This project analyzes the ex1221 dataset from the Sleuth2 R package to explore the factors influencing nitrate concentration (NO3) in river mouths.

# Load required packages
library(Sleuth2)
library(ggplot2)
library(GGally)
library(cowplot)
library(lmtest)

# Suppress package startup messages if needed
Sys.setenv(`_R_S3_METHOD_REGISTRATION_NOTE_OVERWRITES_` = "false")
suppressPackageStartupMessages(library(zoo))

Task 1: Data Exploration

Load the dataset and perform basic statistical investigations:

  • Briefly describe the data and individual variables.
  • Determine the most important statistical measures that best characterize the data.
  • Represent the data appropriately using selected graphs.
attach(ex1221)
df <- ex1221
ex1221
A data.frame: 42 x 11
RiverCountryDischargeRunoffAreaDensityNO3ExportDepNPrecPrec
<chr><fct><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl>
1AdigeItaly223.0018.31220102.0067.01224.71237.546.084.8
2AmazonS_America175000.0024.870500001.003.074.5120.62.1181.1
3CaraghIreland7.2945.61607.153.6164.086.52.6104.9
4ColumbiaUSA7900.0011.867000010.0026.6313.662.82.099.1
5DanubeRumania6500.008.180500090.0046.0371.4826.445.057.9
6DelawareUSA336.0019.117600100.0061.01167.2851.725.0107.4
7FraserCanada3550.0016.12200002.006.4103.3739.716.0145.8
8GangesIndia16000.0014.91070000300.0091.31361.4294.35.8160.0
9GlaamaNorway706.0016.94177012.0024.0405.7975.045.068.3
10HuangheChina1470.002.0750000200.00139.0272.6286.428.032.3
11HudsonUSA560.0016.134700150.0047.8771.4851.725.0107.4
12Kazan_and_BackCanada1900.006.13120000.401.16.760.97.027.4
13MackenzieCanada10600.005.917870000.155.733.873.97.033.3
14MagdalenaColumbia7500.0031.324000030.0017.0531.387.52.6106.2
15MekongSE_Asia15000.0019.278300043.0017.0325.7334.17.6139.2
16MerseyEngland21.0017.51200200.00156.02730.0919.428.9100.3
17MeuseNthlnds/Belgium317.009.134900250.00230.02089.1742.336.065.0
18MississippiUSA16100.005.0322000030.0063.0315.0691.719.0114.8
19Murray-DarlingAustralia318.200.310730001.5015.04.474.84.453.6
20NelsonCanada2370.002.210700002.005.011.1248.621.037.3
21NigerW_Africa7000.006.2112500020.007.043.6555.29.6181.6
22NileNE_Africa950.000.3296000050.0020.06.450.910.215.7
23OrangeS_Africa170.000.2102000020.0050.08.3154.923.018.1
24OrinocoVenezuela33900.0033.910000002.006.0203.492.53.097.3
25ParanaArgentina15900.005.7280000010.0014.280.6216.29.975.8
26PoItaly1470.0022.066700232.00102.02247.31237.546.084.8
27RhineEurope2200.0011.9185300300.00286.03395.61647.960.086.6
28RhoneFrance1700.0017.796000100.0057.21012.9695.930.073.2
29ShannonIreland190.0013.51400035.0054.0727.7252.88.692.7
30StikineCanada/USA1100.0022.0500001.006.1134.276.81.0242.1
31St._LawrenceCanada/USA10700.0010.4102500015.0016.0167.0673.221.0101.1
32SusquehannaUSA1100.0015.173000100.0066.0994.5821.525.0103.6
33TeesEngland50.0027.71806100.0075.02076.5608.733.058.2
34ThamesEngland78.007.89950400.00520.04076.41125.161.058.2
35TiberItaly230.0013.517000262.00100.01352.91237.546.084.8
36UruguayS_America3850.0010.536500010.0029.0305.9355.613.786.1
37VistulaPoland1100.005.5200000120.0070.5387.8832.847.055.9
38VolgaRussia8200.006.1135000050.0030.0182.2151.813.036.8
39YangtzeChina29000.0015.41900000200.0058.2897.0370.510.0116.8
40YukonCanada6180.007.48310000.409.369.2185.47.878.5
41ZaireZaire39730.0010.4382000011.706.062.4467.210.0147.3
42ZambeziSE_Africa3200.002.5130000015.009.322.9138.58.451.8

Data Description

Rising nitrate levels in river mouths cause an increase in algae in coastal waters. Therefore, data was collected to investigate the relationship between nitrate concentration in rivers and human population density. We have the following variables available:

VariableDescriptionUnits
RiverRiver name
CountryTerritory through which the river flows
DischargeAverage annual river discharge into the oceanm³/sec
RunoffAverage annual specific runoff from the watershedliter/(sec x km²)
AreaArea of the watershedkm²
ExportNitrate export (product of Discharge and NO3)
DepDeposition (proportional to the product of NPrec and Prec)
NPrecPrecipitation - nitrate concentrationmicromol NO3/(sec × km²))
PrecPrecipitationcm/year
DensityPopulation densitypeople/km²
NO3Nitrate concentrationmicromol NO3/liter

In the dataset, we observe the artificial variables Export and Dep, which were calculated from other measured values:

options(repr.plot.width = 10)
options(repr.plot.height = 5)
p1 <- ggplot(df, aes(x = Runoff * NO3, y = Export)) +
    geom_point() + theme_bw()
p2 <- ggplot(df, aes(x = NPrec * Prec, y = Dep)) +
    geom_point() + theme_bw()
plot_grid(p1, p2, ncol = 2)

png

Statistical Measures

Nitrate Concentration

options(repr.plot.width = 10)
options(repr.plot.height = 8)
plot.ecdf(NO3, xlab = "NO3")

options(repr.plot.height = 6)
hist(NO3, 18)

options(repr.plot.height = 4)
boxplot(NO3, horizontal = T, xlab = "NO3")

png

png

png

cat("Var: ", var(NO3), "\n")
cat("Std: ", sd(NO3), "\n")
summary(NO3)
Var:  8856.962 
Std:  94.11144 

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1.10    9.30   29.50   62.32   66.75  520.00 

The mean of the response variable is 62.2, and the median (which is more representative in this case) is 29.5. The sample variance of nitrate concentration is 8857.

Outliers: The nitrate concentration values in the selected rivers range from 1 to 150, with the exception of a few outliers:

RiverAreaNO3 Concentration
MerseyEngland156
MeuseNetherlands/Belgium230
RhineEurope286
ThamesEngland520

Population Density

options(repr.plot.width = 10)
options(repr.plot.height = 8)
plot.ecdf(Density, xlab = "Density")
options(repr.plot.height = 6)
hist(Density, 10)
options(repr.plot.height = 4)
boxplot(Density, horizontal = T, xlab = "Density")

png

png

png

cat("Var: ", var(Density), "\n")
cat("Std: ", sd(Density), "\n")
summary(Density)
Var:  10880.2 
Std:  104.3082 

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.15   10.00   32.50   85.36  115.50  400.00 

Outliers: In the population density data, we observe outliers (similar to NO3 concentration) for rivers flowing through large cities:

RiverAreaPopulation Density
GangesIndia300.00
RhineEurope300.00
ThamesEngland400.00

Other Explanatory Variables

summary(subset(df, select = c("River", "Country", "Discharge",
                              "Runoff", "Area", "NPrec", "Prec")))
    River                 Country     Discharge             Runoff     
 Length:42          Canada    : 5   Min.   :     7.29   Min.   : 0.20  
 Class :character   USA       : 5   1st Qu.:   392.00   1st Qu.: 6.10  
 Mode  :character   England   : 3   Median :  2050.00   Median :11.85  
                    Italy     : 3   Mean   : 10342.30   Mean   :13.24  
                    Canada/USA: 2   3rd Qu.:  8125.00   3rd Qu.:17.65  
                    China     : 2   Max.   :175000.00   Max.   :45.60  
                    (Other)   :22                                      
      Area             NPrec            Prec       
 Min.   :    160   Min.   : 1.00   Min.   : 15.70  
 1st Qu.:  43828   1st Qu.: 7.65   1st Qu.: 57.98  
 Median : 517500   Median :14.85   Median : 89.84  
 Mean   : 937888   Mean   :20.79   Mean   : 89.84  
 3rd Qu.:1072250   3rd Qu.:29.73   3rd Qu.:107.10  
 Max.   :7050000   Max.   :61.00   Max.   :242.10  

Interactions Between Variables

options(repr.plot.width = 14, repr.plot.height = 8)
ggpairs(subset(df, select = c("NO3", "Density", "Discharge",
                              "Runoff", "Area", "NPrec", "Prec")))

png

Interesting Interactions:

  • We observe a very strong positive correlation of 0.84 between population density (Density) and NO3 concentration.
  • Nitrate precipitation (NPrec) also has a strong positive correlation of 0.68 with NO3 concentration.
  • We also note a significant positive correlation of 0.67 between Density and NPrec themselves.

Task 2: Simple Linear Regression

Examine the relationship between the response variable and a single numerical regressor (Density).

We investigate the dependence of NO3 concentration in rivers on population density Density.

fit <- lm(NO3 ~ Density)
summary(fit)

options(repr.plot.width = 14, repr.plot.height = 8)
plot(NO3 ~ Density)
abline(fit)

options(repr.plot.width = 14, repr.plot.height = 5)
par(mfrow = c(1, 2))
plot(fit, which = 1)
plot(fit, which = 3)
Call:
lm(formula = NO3 ~ Density)

Residuals:
     Min       1Q   Median       3Q      Max 
-133.880  -11.892    2.416   10.857  218.940 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -2.45705   10.32855  -0.238    0.813    
Density      0.75879    0.07718   9.831 3.15e-12 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 51.55 on 40 degrees of freedom
Multiple R-squared:  0.7073, Adjusted R-squared:    0.7 
F-statistic: 96.65 on 1 and 40 DF,  p-value: 3.148e-12

png

png

Model Interpretation: A simple linear regression was chosen because the scatter plot shows a linear trend. The adjusted R-squared is 0.70. The intercept is not significant at the 5% level, while the coefficient for Density is highly significant. The model suggests that for every additional person per km², the NO3 concentration increases by 0.7587 micromols/liter. The negative intercept of -2.457 at zero population density is physically unrealistic.

Model Quality: This simple model does not explain the response variable very well. While it doesn’t show systematic errors in the residuals plot, it is inaccurate for higher values, and the intercept is flawed. The data points seem to diverge at higher values, suggesting that a single regression line may not be sufficient. However, the model is very simple and effectively captures the positive relationship between NO3 and Density.


Task 3: Analysis of Variance (ANOVA)

Examine the relationship between the response variable and a single categorical regressor (Continent).

We examine NO3 concentration based on the region where the river flows. We group the regions into the following 4 categories:

  • Africa
  • America
  • Asia-Pacific
  • Europe
Continent <- c("Europe", "America", "Europe", "America", "Europe",
  "America", "America", "AsiaPacific", "Europe", "AsiaPacific",
  "America", "America", "America", "America", "AsiaPacific",
  "Europe", "Europe", "America", "AsiaPacific", "America",
  "Africa", "Africa", "Africa", "America", "America", "Europe",
  "Europe", "Europe", "Europe", "America", "America", "America",
  "Europe", "Europe", "Europe", "America", "Europe", "Europe",
  "AsiaPacific", "America", "Africa", "Africa")
Continent <- as.factor(Continent)
df$Continent <- Continent

tapply(NO3, Continent, summary)

fit <- lm(NO3 ~ Continent)
summary(fit)

options(repr.plot.width = 14, repr.plot.height = 8)
boxplot(NO3 ~ Continent, ylab = "", xlab = "continent")

options(repr.plot.width = 14, repr.plot.height = 5)
par(mfrow = c(1, 2))
plot(fit, which = 1)
plot(fit, which = 3)
$Africa
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   6.00    7.00    9.30   18.46   20.00   50.00 

$America
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1.10    6.00   14.20   22.54   29.00   66.00 

$AsiaPacific
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   15.0    17.0    58.2    64.1    91.3   139.0 

$Europe
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    3.6    50.0    70.5   121.4   129.0   520.0 

Call:
lm(formula = NO3 ~ Continent)

Residuals:
    Min      1Q  Median      3Q     Max 
-117.82  -40.18  -12.85   20.56  398.58 

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)  
(Intercept)            18.460     37.990   0.486   0.6298  
ContinentAmerica        4.081     43.217   0.094   0.9253  
ContinentAsiaPacific   45.640     53.725   0.850   0.4009  
ContinentEurope       102.960     43.867   2.347   0.0242 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 84.95 on 38 degrees of freedom
Multiple R-squared:  0.2449, Adjusted R-squared:  0.1853 
F-statistic: 4.108 on 3 and 38 DF,  p-value: 0.0128

png

png

Observation: NO3 concentration in Africa and the Americas is significantly lower than in Europe and the Asia-Pacific region.

Test: We test if the mean NO3 concentration is the same for all continents.

\(H_0 : \mu_\text{Africa} = \mu_\text{America} = \mu_\text{AsiaPacific} = \mu_\text{Europe}\)

\(H_A : \text{Not } H_0\)

aov(NO3 ~ Continent)
anova(aov(NO3 ~ Continent))
Call:
   aov(formula = NO3 ~ Continent)

Terms:
                Continent Residuals
Sum of Squares   88926.52 274208.94
Deg. of Freedom         3        38

Residual standard error: 84.94719
Estimated effects may be unbalanced
A anova: 2 x 5
DfSum SqMean SqF valuePr(>F)
<int><dbl><dbl><dbl><dbl>
Continent388926.5229642.1744.1078260.01280309
Residuals38274208.947216.025NANA

Conclusion: At a 5% significance level, we reject the null hypothesis, concluding that the mean NO3 concentration differs by continent. We see that the concentration is highest in Europe, slightly lower in the Asia-Pacific region, and significantly lower in the Americas and Africa.

Interpretation: The average NO3 concentration is 18.46 in Africa, 22.54 in America, 64.1 in AsiaPacific, and 121.4 in Europe. At the 5% level, only the coefficient for Europe is significant relative to the baseline (Africa).

Model Quality: The model evaluating the mean values for each continent is usable but very coarse. It doesn’t provide a complete picture, especially for Europe and the Asia-Pacific region, where the variance is large.


Task 4: Interaction Model

Consider a regression model containing both previous regressors (Density and Continent) including their interaction.

fit <- lm(NO3 ~ Density + Density:Continent)
summary(fit)

options(repr.plot.width = 14, repr.plot.height = 8)
tmp <- suppressWarnings(predict(fit, interval = "prediction"))
ggplot(cbind(df, tmp), aes(x = Density, y = NO3, group = Continent, color = Continent)) +
    geom_point() + geom_line(aes(y = fit)) + theme_bw()

options(repr.plot.width = 14, repr.plot.height = 5)
par(mfrow = c(1, 2))
plot(fit, which = 1)
plot(fit, which = 3)
Call:
lm(formula = NO3 ~ Density + Density:Continent)

Residuals:
     Min       1Q   Median       3Q      Max 
-130.834  -14.438    2.416   11.910  168.013 

Coefficients:
                             Estimate Std. Error t value Pr(>|t|)
(Intercept)                    0.8182     9.7710   0.084    0.934
Density                        0.6156     0.7971   0.772    0.445
Density:ContinentAmerica      -0.1054     0.7915  -0.133    0.895
Density:ContinentAsiaPacific  -0.2259     0.7888  -0.286    0.776
Density:ContinentEurope        0.2623     0.7842   0.335    0.740

Residual standard error: 44.4 on 37 degrees of freedom
Multiple R-squared:  0.7991, Adjusted R-squared:  0.7774 
F-statistic:  36.8 on 4 and 37 DF,  p-value: 2.005e-12

png

png

Assumption: The NO3 concentration at zero population density (Density = 0) is the same on all continents. We chose the model lm(NO3 ~ Density + Density:Continent) which reflects this assumption by allowing different slopes for each continent but a common intercept.

Interpretation: The adjusted R-squared is 0.78. At zero population density, the natural NO3 concentration is 0.8182 micromol/liter. For each additional person per km², the NO3 concentration increases by the following amounts (in micromols NO3 per liter):

ContinentCalculationCoefficient
Asia-Pacific0.6156 - 0.22590.3897
America0.6156 - 0.10540.5102
Africa0.6156 (base)0.6156
Europe0.6156 + 0.26230.8779

Result: All coefficients are insignificant at the 5% level. This suggests the model might be over-parameterized and could be simplified.

Model Quality: This model explains the response variable slightly better than the previous regressions, as it accounts for continents and is more flexible at higher values. However, it is much more complex and still inaccurate at high values.


Task 5: Final Model Selection

Find a suitable final model that explains the behavior well but does not contain insignificant components.

Approach: We will start with a complex model including all regressors and then simplify it sequentially. The goal is a model that explains the data well without unnecessary complexity.

Initial Model: We start with a model containing the previous two regressors plus other features from the dataset. We also add NPrec:Continent and Prec:Continent interactions, as precipitation might have different effects on NO3 concentration in different regions.

fit_overparam <- lm(NO3 ~ Density + Density:Continent + Continent +
                    Discharge + Runoff + Area + NPrec + NPrec:Continent + Prec + Prec:Continent)
summary(fit_overparam)
Call:
lm(formula = NO3 ~ Density + Density:Continent + Continent + 
    Discharge + Runoff + Area + NPrec + NPrec:Continent + Prec + 
    Prec:Continent)

Residuals:
     Min       1Q   Median       3Q      Max 
-126.862   -9.479   -1.329    7.749  130.537 

Coefficients:
                               Estimate Std. Error t value Pr(>|t|)
(Intercept)                  -1.741e+01  1.241e+02  -0.140    0.890
Density                       2.456e-02  2.093e+00   0.012    0.991
ContinentAmerica              1.611e+01  1.275e+02   0.126    0.901
ContinentAsiaPacific          2.587e+01  1.563e+02   0.166    0.870
ContinentEurope               6.059e+01  1.387e+02   0.437    0.666
Discharge                    -2.048e-04  7.873e-04  -0.260    0.797
Runoff                        1.420e-01  1.406e+00   0.101    0.920
Area                          3.116e-06  1.657e-05   0.188    0.852
NPrec                         2.796e+00  5.009e+00   0.558    0.582
Prec                         -4.542e-02  4.546e-01  -0.100    0.921
Density:ContinentAmerica      2.757e-01  2.160e+00   0.128    0.900
Density:ContinentAsiaPacific  2.764e-01  2.107e+00   0.131    0.897
Density:ContinentEurope       1.070e+00  2.101e+00   0.509    0.615
ContinentAmerica:NPrec       -2.157e+00  5.491e+00  -0.393    0.698
ContinentAsiaPacific:NPrec   -2.336e-01  6.618e+00  -0.035    0.972
ContinentEurope:NPrec        -3.493e+00  5.098e+00  -0.685    0.500
ContinentAmerica:Prec         9.982e-02  5.189e-01   0.192    0.849
ContinentAsiaPacific:Prec    -1.558e-01  9.208e-01  -0.169    0.867
ContinentEurope:Prec         -8.138e-01  9.178e-01  -0.887    0.384

Residual standard error: 49.12 on 23 degrees of freedom
Multiple R-squared:  0.8472, Adjusted R-squared:  0.7276 
F-statistic: 7.083 on 18 and 23 DF,  p-value: 1.216e-05

Observation: The R-squared of 0.8472 is very good, but all coefficients are insignificant, so we should simplify the model.

Model-Submodel Test: We use a stepwise selection process (based on AIC) to reduce the full model.

step(lm(NO3 ~ 1), trace=T, scope = list(
    lower = ~1,
    upper = ~NO3 ~ Density + Density:Continent + Continent + Discharge + Runoff + Area + NPrec + NPrec:Continent + Prec + Prec:Continent
))
Start:  AIC=382.72
NO3 ~ 1

            Df Sum of Sq    RSS    AIC
+ Density    1    256842 106294 333.12
+ NPrec      1    168973 194163 358.43
+ Continent  3     88927 274209 376.93
+ Area       1     27124 336011 381.46
<none>                   363135 382.72
+ Discharge  1     11001 352134 383.43
+ Prec       1     10049 353087 383.55
+ Runoff     1      3984 359151 384.26

Step:  AIC=333.12
NO3 ~ Density

            Df Sum of Sq    RSS    AIC
+ NPrec      1      9395  96899 331.24
<none>                   106294 333.12
+ Prec       1      4244 102050 333.41
+ Continent  3     12904  93390 333.69
+ Runoff     1      3034 103259 333.91
+ Discharge  1       269 106024 335.02
+ Area       1       113 106180 335.08
- Density    1    256842 363135 382.72

Step:  AIC=331.24
NO3 ~ Density + NPrec

            Df Sum of Sq    RSS    AIC
<none>                    96899 331.24
+ Runoff     1      1744  95154 332.47
+ Prec       1      1014  95885 332.80
+ Area       1       275  96624 333.12
- NPrec      1      9395 106294 333.12
+ Discharge  1        53  96846 333.21
+ Continent  3      6137  90762 334.49
- Density    1     97264 194163 358.43

Call:
lm(formula = NO3 ~ Density + NPrec)

Coefficients:
(Intercept)      Density        NPrec  
   -16.1835       0.6282       1.1965  
fit_reduced_1 <- lm(NO3 ~ Density + NPrec)
summary(fit_reduced_1)
Call:
lm(formula = NO3 ~ Density + NPrec)

Residuals:
     Min       1Q   Median       3Q      Max 
-103.440  -14.440    2.526   17.014  211.923 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -16.1835    12.2300  -1.323   0.1935    
Density       0.6282     0.1004   6.257 2.28e-07 ***
NPrec         1.1965     0.6153   1.945   0.0591 .  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 49.85 on 39 degrees of freedom
Multiple R-squared:  0.7332, Adjusted R-squared:  0.7195 
F-statistic: 53.58 on 2 and 39 DF,  p-value: 6.484e-12

Commentary: The stepwise algorithm stopped at the model with the lowest AIC: NO3 ~ Density + NPrec. However, this model still has insignificant components at the 5% level, which we will now remove manually.

# remove intercept
fit_reduced_2 <- lm(NO3 ~ 0 + Density + NPrec)
summary(fit_reduced_2)
Call:
lm(formula = NO3 ~ 0 + Density + NPrec)

Residuals:
     Min       1Q   Median       3Q      Max 
-101.327  -18.167   -6.716    4.405  224.464 

Coefficients:
        Estimate Std. Error t value Pr(>|t|)    
Density   0.6280     0.1013   6.197 2.49e-07 ***
NPrec     0.7265     0.5072   1.433     0.16    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 50.31 on 40 degrees of freedom
Multiple R-squared:  0.8076, Adjusted R-squared:  0.798 
F-statistic: 83.95 on 2 and 40 DF,  p-value: 4.833e-15
# remove NPrec
fit_reduced_3 <- lm(NO3 ~ 0 + Density)
summary(fit_reduced_3)
Call:
lm(formula = NO3 ~ 0 + Density)

Residuals:
     Min       1Q   Median       3Q      Max 
-132.824  -12.885    0.547    8.433  221.168 

Coefficients:
        Estimate Std. Error t value Pr(>|t|)    
Density  0.74708    0.05875   12.72 8.15e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 50.95 on 41 degrees of freedom
Multiple R-squared:  0.7977, Adjusted R-squared:  0.7928 
F-statistic: 161.7 on 1 and 41 DF,  p-value: 8.151e-16

Quality Comparison: It is interesting to track the R-squared values during this simplification process.

FitModelR-squaredAdjusted R-squared
fit_overparamNO3 ~ …0.84720.7276
fit_reduced_1NO3 ~ Density + NPrec0.73320.7195
fit_reduced_2NO3 ~ 0 + Density + NPrec0.80760.7980
fit_reduced_3NO3 ~ 0 + Density0.79770.7928

The highest adjusted R-squared (which accounts for the number of parameters) belongs to fit_reduced_2. However, fit_reduced_3 is simpler and only marginally worse.

anova(fit_reduced_3,
      fit_reduced_2,
      fit_reduced_1,
      fit_overparam,
      test="F")
A anova: 4 x 6
Res.DfRSSDfSum of SqFPr(>F)
<dbl><dbl><dbl><dbl><dbl><dbl>
141106444.01NANANANA
240101249.3015194.7072.1529620.1558373
33996898.7014350.6051.8031210.1924385
42355494.841641403.8581.0724970.4293205

Model-Submodel Test Result: The F-test shows that there is no statistically significant difference between the models. Therefore, we can reduce the more complex models to the final model fit_reduced_3 (NO3 ~ 0 + Density).

Final Model Interpretation: At zero population density, the natural NO3 concentration is zero, and with each additional person per km², it increases by 0.74708 micromols per liter.

Final Model Quality: Like the other models, it is inaccurate at higher values. On the other hand, it is easily interpretable and, importantly, it has a permissible NO3 concentration value (non-negative) at Density = 0.

options(repr.plot.width = 14, repr.plot.height = 8)
plot(NO3 ~ 0 + Density)
abline(fit_reduced_3)

png


Task 6: Verification of Model Assumptions

For the final model, verify the assumptions of the methods used with appropriate tests.

fit_residual <- lm(NO3 ~ Density + Density:Continent + Continent +
          Discharge + Runoff + Area + NPrec + NPrec:Continent + Prec + Prec:Continent)
residuals <- resid(fit_residual)

options(repr.plot.width = 14, repr.plot.height = 5)
par(mfrow = c(1, 2))
plot(residuals)
hist(residuals)
plot(fit, which = 1)
plot(fit, which = 3)

png

png

To use the previous diagnostic methods, several assumptions must be verified.

Regression Model: Requires, in addition to a linear relationship, independent and identically distributed errors.

Analysis of Variance (ANOVA): Similar to LR, requires independence, homogeneity of variances, and, for the model selection method, normality of errors.

Normality of Residuals

The residuals plot shows they are distributed quite symmetrically around zero, but their center is not a straight line parallel to the x-axis.

Test: We examine the normality of the residuals using a Q-Q plot and the Shapiro-Wilk normality test:

\(H_0 :\) The residuals come from a normal distribution.

\(H_A :\) The residuals do not come from a normal distribution.

shapiro.test(residuals)
options(repr.plot.width = 10, repr.plot.height = 6)
plot(fit_residual, which = 2)
 Shapiro-Wilk normality test

data:  residuals
W = 0.79481, p-value = 3.439e-06

Conclusion (Shapiro-Wilk, Q-Q): At a 5% significance level, we reject the null hypothesis that the residuals come from a normal distribution. The result is consistent with the Q-Q plot, where the values of low and high quantiles differ significantly from normal. The assumptions for using linear regression and ANOVA methods were not met.

Solution: In linear regression, we can estimate parameters using the method of least squares and approximately test the significance of regressors using the T-statistic. For ANOVA, the Kruskal-Wallis test can be used, which can identify differences between several categories without assuming normality of residuals.

Homoscedasticity of Residuals

Test: We test the homogeneity of residual variances with the Breusch-Pagan test.

\(H_0 :\) The residuals have constant variance.

\(H_A :\) The residuals do not have constant variance.

bptest(fit_residual)
 studentized Breusch-Pagan test

data:  fit_residual
BP = 20.876, df = 18, p-value = 0.2857

Conclusion (BP test): At a 5% significance level, we do not reject the homogeneity of variances for the initial model.

bptest(fit_reduced_1)
 studentized Breusch-Pagan test

data:  fit_reduced_1
BP = 17.647, df = 2, p-value = 0.0001472

However, after the first reduction, this assumption is violated, and the residuals no longer have constant variance (we would reject the test). This is evident in the plots, where the data is clearly heteroscedastic. In such a case, it is possible to transform the response variable, for example, using a Box-Cox transformation.

Other Assumptions

Multicollinearity: The matrix \(\mathbf{X}^T\mathbf{X}\) can be singular or ill-conditioned, and its inverse may not exist, or the coefficient estimates may have high variance. In the initial analysis of fit_overparam, we observed a high correlation between Area and Discharge or NPrec and Density. In such a case, we could opt for ridge regression, which introduces a bias to improve the matrix’s condition.

However, we do not encounter a multicollinearity problem with our final model, as we use only one regressor without an intercept (making \(\mathbf{X}^T\mathbf{X}\) a 1x1 matrix).

Outliers: Outliers are problematic because they can strongly influence the final model. In our case, the outliers are relatively symmetrically distributed (caused by heteroscedasticity), but if we wanted a more robust model estimate, robust regression would be a suitable option.


detach(ex1221)