Dataset

K <- 16
L <- 3
M <- ((K + L) * 47)%%11 + 1
cat("K    =", K, "\n")
cat("L    =", L, "\n")
cat("M    =", M, "\n")
cat("YEAR =", 2011 + M)
K    = 16 
L    = 3 
M    = 3 
YEAR = 2014
M Dataset Year Description Source
3 nama_10_pc 2014 Main GDP aggregates per capita (eurostat database) [src]
# .libPaths('/tmp')

# install.packages("eurostat")
# install.packages("ggplot2")
# install.packages("GGally")
# install.packages("car")
# install.packages("lmtest")

library(eurostat)
library(ggplot2)
library(GGally)
library(car)
library(MASS)
library(lmtest)

Introduction

Gross Domestic Product (GDP) is a crucial economic indicator that measures the monetary value of all finished goods and services produced within a country's borders in a specific time period. Analyzing the distribution of GDP allows us to understand the economic performance and productivity of different nations. This task involves presenting the distribution of GDP numerically and graphically to highlight its characteristics, followed by a discussion on which country-specific data points could significantly impact GDP. By examining these elements, we gain deeper insights into the factors that drive economic growth and the disparities in economic output across countries.

id <- "nama_10_pc"
df <- get_eurostat(id = id)

hdp <- subset(df, TIME_PERIOD == "2014-01-01")
hdp <- subset(hdp, na_item == "B1GQ")
hdp <- subset(hdp, unit == "CP_EUR_HAB")
hdp <- subset(hdp, select = c("geo", "values"))
hdp <- subset(hdp, !grepl("^EA", geo))
hdp <- subset(hdp, !grepl("^EU", geo))
hdp <- label_eurostat(hdp)

Main GDP aggregates per capita

  • Code: nama_10_pc
  • Time: 2014
  • Time frequency: Annual
  • National accounts indicator: Gross domestic product at market prices
  • Unit of measure: Current prices, euro per capita

nama_10_pc_MAP_CNTR.png

hdp$geo
  1. 'Albania'
  2. 'Austria'
  3. 'Belgium'
  4. 'Bulgaria'
  5. 'Switzerland'
  6. 'Cyprus'
  7. 'Czechia'
  8. 'Germany'
  9. 'Denmark'
  10. 'Estonia'
  11. 'Greece'
  12. 'Spain'
  13. 'Finland'
  14. 'France'
  15. 'Croatia'
  16. 'Hungary'
  17. 'Ireland'
  18. 'Iceland'
  19. 'Italy'
  20. 'Liechtenstein'
  21. 'Lithuania'
  22. 'Luxembourg'
  23. 'Latvia'
  24. 'Montenegro'
  25. 'North Macedonia'
  26. 'Malta'
  27. 'Netherlands'
  28. 'Norway'
  29. 'Poland'
  30. 'Portugal'
  31. 'Romania'
  32. 'Serbia'
  33. 'Sweden'
  34. 'Slovenia'
  35. 'Slovakia'
  36. 'Türkiye'
  37. 'United Kingdom'

We observe that the Nordic countries, along with Ireland, Switzerland, and Luxembourg, have the highest GDP per capita among the selected countries. Western European countries also show high GDP per capita. In contrast, the lowest values are seen in Eastern Europe.

Data description

In this project, we examine the GDP of European countries in 2014, measured in euros per capita (CP_EUR_HAB) and expressed in market prices (B1GQ).

options(repr.plot.width = 10, repr.plot.height = 6)
plot.ecdf(hdp$values, xlab = "value")

options(repr.plot.width = 10, repr.plot.height = 6)
hist(hdp$values, breaks = 9, xlab = "value")
head(hdp[order(hdp$values, decreasing = T), ], 5)
tail(hdp[order(hdp$values, decreasing = T), ], 5)

options(repr.plot.width = 10, repr.plot.height = 3)
boxplot(hdp$values, horizontal = T, xlab = "value")
A tibble: 5 × 2
geovalues
<chr><dbl>
Liechtenstein133180
Luxembourg 92760
Norway 73670
Switzerland 66920
Denmark 47090
A tibble: 5 × 2
geovalues
<chr><dbl>
Bulgaria 5960
Montenegro 5560
Serbia 4970
North Macedonia4470
Albania 3450
cat("Count:", length(hdp$values), "\n")
cat("Var:", var(hdp$values), "\n")
cat("Std:", sd(hdp$values), "\n")
summary(hdp$values)
Count: 37 
Var: 730769837 
Std: 27032.75 
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   3450   10770   20170   29245   38990  133180 

We see that only a few countries have a very high GDP per capita, while most fall within the range of 1,000 to 4,000 EUR.

Factors affecting GDP

A country's GDP is influenced by many factors, some of which are listed below. The bolded items are chosen as regressors in the next section.

Economy and trade

  • Inflation rate and interest rates: These factors influence consumer and investor confidence. Low inflation rates boost confidence, contributing to economic growth.
  • Exports and imports: High exports indicate strong economic performance and self-sufficiency, whereas high imports suggest dependence on foreign trade and potential economic imbalance.

Labor market

  • Employment: Higher employment typically leads to higher GDP, as more people have income and greater purchasing power.
  • Income levels: Besides employment, the quality of jobs matters. A high rate of low-wage jobs does not necessarily increase GDP per capita. On the other hand people with higher incomes tend to spend more on goods and services.

Other factors

  • Happiness and satisfaction: A happier and more satisfied population usually performs better at work and tends to contribute to economic growth.
  • Adaptation to technological changes: A country's ability to adapt to technological advancements enhances its competitiveness, significantly boosts productivity, and fosters new opportunities in production, services, and trade.

  • Climatic conditions: Floods, droughts, or extreme temperatures can negatively impact agricultural production and infrastructure. Conversely, favorable climatic conditions can support various industries, such as tourism and agriculture.


Indicators

We choose the following regressors:

  • Employment rate
  • Average monthly income
  • Internet usage
  • Climate region

In the following section, we present the important properties of these selected indicators and investigate their relationship with GDP.

All indicators are statistically significant and explain the modeled variable well. At the standard significance level, we reject the hypothesis that the correlation between each regressor and GDP is zero.

id <- "lfsa_ergan"
emp <- get_eurostat(id = id)

emp <- subset(emp, TIME_PERIOD == "2014-01-01")
emp <- subset(emp, sex == "T")
emp <- subset(emp, citizen == "TOTAL")
emp <- subset(emp, age == "Y_GE25")
emp <- subset(emp, !grepl("^EU", geo))
emp <- subset(emp, !grepl("^EA", geo))
emp <- subset(emp, select = c("geo", "values"))
emp <- label_eurostat(emp)

# https://data.worldbank.org/indicator/SL.EMP.TOTL.SP.NE.ZS?locations=LI-AL
emp <- rbind(emp, c("Albania", 43.8))
emp <- rbind(emp, c("Liechtenstein", 59.4))

# ===========================================

id <- "earn_ses14_19"
earn <- get_eurostat(id = id)

earn <- subset(earn, cpayagr == "TOTAL")
earn <- subset(earn, currency == "EUR")
earn <- subset(earn, sizeclas == "GE10")
earn <- subset(earn, TIME_PERIOD == "2014-01-01")
earn <- subset(earn, indic_se == "ERN")
earn <- subset(earn, sex == "T")
earn <- subset(earn, nace_r2 == "B-S_X_O")
earn <- subset(earn, !grepl("^EU", geo))
earn <- subset(earn, !grepl("^EA", geo))
earn <- subset(earn, select = c("geo", "values"))
earn <- label_eurostat(earn)

# https://ndiqparate.al/?p=9775&lang=en
earn <- rbind(earn, c("Albania", 480))
# https://archiv.llv.li/files/as/jahrbuch-2017.pdf (p. 121)
earn <- rbind(earn, c("Liechtenstein", 6692))

# ===========================================

id <- "tin00028"
int <- get_eurostat(id = id)

int <- subset(int, indic_is == "I_IU3")
int <- subset(int, TIME_PERIOD == "2014-01-01")
int <- subset(int, !grepl("^EU", geo))
int <- subset(int, !grepl("^EA", geo))
int <- subset(int, select = c("geo", "values"))
int <- label_eurostat(int)

# https://data.worldbank.org/indicator/IT.NET.USER.ZS?locations=AL-LI-ME-RS
int <- rbind(int, c("Albania", 54))
int <- rbind(int, c("Liechtenstein", 95))
int <- rbind(int, c("Montenegro", 61))
int <- rbind(int, c("Serbia", 62))

# ===========================================

countries <- c(
    "Albania", "Greece", "Italy", "Cyprus", "Croatia", "Spain", "North Macedonia", "Malta",
    "Slovenia", "Serbia", "Türkiye", "Montenegro", "Portugal",
    
    "Austria", "Belgium", "Bulgaria", "Switzerland", "Liechtenstein", "Luxembourg", "Netherlands",
    "Slovakia", "United Kingdom", "Poland", "Romania", "Czechia", "Germany", "France", "Hungary",
    "Ireland",
    
    "Denmark", "Estonia", "Finland", "Iceland", "Latvia", "Lithuania", "Norway", "Sweden"
)
climate_zones <- c(rep("Mediterranean climate", 13), rep("Temperate", 16), rep("Cold", 8))
clima <- data.frame(geo = countries, climate = climate_zones)

# ===========================================

colnames(emp)[colnames(emp) == "values"] <- "employment"
colnames(earn)[colnames(earn) == "values"] <- "earnings"
colnames(int)[colnames(int) == "values"] <- "internet"
colnames(clima)[colnames(clima) == "values"] <- "climate"

df <- merge(hdp, emp, by = "geo")
df <- merge(df, earn, by = "geo")
df <- merge(df, int, by = "geo")
df <- merge(df, clima, by = "geo")

df$employment <- as.double(df$employment)
df$earnings <- as.double(df$earnings)
df$internet <- as.double(df$internet)
df$climate <- as.factor(df$climate)

Employment rate

  • Code: lfsa_ergan
  • Name: Employment rates by sex, age and citizenship
  • Time: 2014
  • Sex: Total
  • Age: 25 years and over
  • Country of citizenship: Total
  • Time frequency: Annual
  • Unit of measure: Percentage

lfsa_ergan_MAP_CNTR.png

head(df[order(df$employment, decreasing = T), c("geo", "employment")], 5)
tail(df[order(df$employment, decreasing = T), c("geo", "employment")], 5)
A data.frame: 5 × 2
geoemployment
<chr><dbl>
15Iceland 79.8
35Switzerland65.0
26Norway 64.9
34Sweden 62.6
21Luxembourg 62.3
A data.frame: 5 × 2
geoemployment
<chr><dbl>
17Italy 46.4
30Serbia 46.1
25North Macedonia45.4
1Albania 43.8
13Greece 41.4

Regressor analysis

cat("Var:", var(df$employment), "\n")
cat("Std:", sd(df$employment), "\n")
summary(df$employment)
Var: 52.02444 
Std: 7.212797 
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   41.4    51.6    55.8    55.5    59.4    79.8 
options(repr.plot.width = 10, repr.plot.height = 6)
plot.ecdf(df$employment, xlab = "employment [%]")

options(repr.plot.width = 10, repr.plot.height = 6)
hist(df$employment, xlab = "employment [%]")

options(repr.plot.width = 10, repr.plot.height = 3)
boxplot(df$employment, horizontal = T, xlab = "employment [%]")

Dependence on GDP

cor.test(df$values, df$employment)
	Pearson's product-moment correlation

data:  df$values and df$employment
t = 3.6397, df = 35, p-value = 0.0008728
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.2408721 0.7249318
sample estimates:
      cor 
0.5239948 
fit <- lm(df$values ~ df$employment)
summary(fit)

options(repr.plot.width = 10, repr.plot.height = 6)
plot(df$values ~ df$employment)
abline(fit)
Call:
lm(formula = df$values ~ df$employment)

Residuals:
   Min     1Q Median     3Q    Max 
-35047 -14744  -3677   7849  96276 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)   -79749.9    30191.4  -2.641 0.012255 *  
df$employment   1963.9      539.6   3.640 0.000873 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 23350 on 35 degrees of freedom
Multiple R-squared:  0.2746,	Adjusted R-squared:  0.2538 
F-statistic: 13.25 on 1 and 35 DF,  p-value: 0.0008728

Average monthly income

  • Code: earn_ses14_19
  • Name: Mean monthly earnings by sex, economic activity and collective pay agreement
  • Time: 2014
  • Sex: Total
  • collective pay agreement: Total
  • Currency: Euro
  • Economic activities: Industry, construction and services
  • Indicator: Gross earnings
  • Time frequency: Annual

earn_ses14_19_MAP_CNTR.png

head(df[order(df$earnings, decreasing = T), c("geo", "earnings")], 5)
tail(df[order(df$earnings, decreasing = T), c("geo", "earnings")], 5)
A data.frame: 5 × 2
geoearnings
<chr><dbl>
19Liechtenstein6692
35Switzerland 6011
26Norway 5031
21Luxembourg 4206
8Denmark 4194
A data.frame: 5 × 2
geoearnings
<chr><dbl>
30Serbia 574
29Romania 521
25North Macedonia494
1Albania 480
4Bulgaria 431

Regressor analysis

cat("Var:", var(df$earnings), "\n")
cat("Std:", sd(df$earnings), "\n")
summary(df$earnings)
Var: 2644510 
Std: 1626.195 
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    431     811    1720    2201    3151    6692 
options(repr.plot.width = 10, repr.plot.height = 6)
plot.ecdf(df$earnings, xlab = "earnings [EUR]")

options(repr.plot.width = 10, repr.plot.height = 6)
hist(df$earnings, xlab = "earnings [EUR]")

options(repr.plot.width = 10, repr.plot.height = 3)
boxplot(df$earnings, horizontal = T, xlab = "earnings [EUR]")

Dependence on GDP

cor.test(df$values, df$earnings)
	Pearson's product-moment correlation

data:  df$values and df$earnings
t = 14.946, df = 35, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.8669865 0.9635352
sample estimates:
      cor 
0.9298047 
fit <- lm(df$values ~ df$earnings)
summary(fit)

options(repr.plot.width = 10, repr.plot.height = 6)
plot(df$values ~ df$earnings)
abline(fit)
Call:
lm(formula = df$values ~ df$earnings)

Residuals:
   Min     1Q Median     3Q    Max 
-21215  -5599   -857   3008  34519 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -4773.653   2816.531  -1.695    0.099 .  
df$earnings    15.456      1.034  14.946   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 10090 on 35 degrees of freedom
Multiple R-squared:  0.8645,	Adjusted R-squared:  0.8607 
F-statistic: 223.4 on 1 and 35 DF,  p-value: < 2.2e-16

Internet use by individuals

  • Code: tin00028
  • Name: Internet use by individuals
  • Time: 2014
  • Individual type: All individuals
  • Last internet use: in last 3 months
  • Time frequency: Annual
  • Unit of measure: Percantage of individuals

tin00028_MAP_CNTR.png

Regressor analysis

cat("Var:", var(df$internet), "\n")
cat("Std:", sd(df$internet), "\n")
summary(df$internet)
Var: 193.9805 
Std: 13.92769 
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  48.49   66.60   76.19   76.80   89.73   98.16 
options(repr.plot.width = 10, repr.plot.height = 6)
plot.ecdf(df$internet, xlab = "internet usage [%]")

options(repr.plot.width = 10, repr.plot.height = 6)
hist(df$internet, xlab = "internet usage [%]")

options(repr.plot.width = 10, repr.plot.height = 3)
boxplot(df$internet, horizontal = T, xlab = "internet usage [%]")

Dependence on GDP

cor.test(df$values, df$internet)
	Pearson's product-moment correlation

data:  df$values and df$internet
t = 6.1225, df = 35, p-value = 5.327e-07
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.5151387 0.8460142
sample estimates:
      cor 
0.7191251 
fit <- lm(df$values ~ df$internet)
summary(fit)

options(repr.plot.width = 10, repr.plot.height = 6)
plot(df$values ~ df$internet)
abline(fit)
Call:
lm(formula = df$values ~ df$internet)

Residuals:
   Min     1Q Median     3Q    Max 
-24393 -12278  -4654   6025  78528 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   -77947      17786  -4.383 0.000102 ***
df$internet     1396        228   6.122 5.33e-07 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 19050 on 35 degrees of freedom
Multiple R-squared:  0.5171,	Adjusted R-squared:  0.5033 
F-statistic: 37.48 on 1 and 35 DF,  p-value: 5.327e-07

Climate area

Europe_Climate_Map_zones!.jpg

source: https://www.barenbrug.biz/forage/european-climate-zones

There are four climate zones on the map:

  1. Nordic climate (cold winters and mild humid summers)
  2. Eastern-continental climate (cold winters and hot summers)
  3. Oceanic climate (mild winters and humid summers)
  4. Mediterranean climate

For simplicity, we merge categories 2 and 3 into one climate zone.

Regressor analysis

tapply(df$values, df$climate, summary)
$Cold
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  11850   14550   39900   35632   45470   73670 

$`Mediterranean climate`
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   3450    5560   16270   13776   20170   26980 

$Temperate
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   5960   13260   35865   38620   40368  133180 
options(repr.plot.width = 10, repr.plot.height = 6)
barplot(table(df$climate), xlab = "climate", ylab = "Frequency")

Dependence on GDP

fit <- lm(df$values ~ df$climate)
summary(fit)

options(repr.plot.width = 10, repr.plot.height = 8)
boxplot(df$values ~ df$climate, ylab = "value", xlab = "climate")
Call:
lm(formula = df$values ~ df$climate)

Residuals:
   Min     1Q Median     3Q    Max 
-32660 -10326  -2470   6394  94560 

Coefficients:
                                Estimate Std. Error t value Pr(>|t|)    
(Intercept)                        35632       8883   4.011 0.000314 ***
df$climateMediterranean climate   -21856      11290  -1.936 0.061240 .  
df$climateTemperate                 2988      10880   0.275 0.785291    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 25130 on 34 degrees of freedom
Multiple R-squared:  0.1841,	Adjusted R-squared:  0.1361 
F-statistic: 3.836 on 2 and 34 DF,  p-value: 0.03146

Test: We examine whether the mean value of GDP is the same for all climate regions or if it varies significantly.

$H_0 : \mu_\text{cold} = \mu_\text{temperate} = \mu_\text{meditteranean}$

$H_A : H_0 \text{ does not hold.}$

aov(df$values ~ df$climate)
anova(aov(df$values ~ df$climate))
Call:
   aov(formula = df$values ~ df$climate)

Terms:
                 df$climate   Residuals
Sum of Squares   4843358867 21464355258
Deg. of Freedom           2          34

Residual standard error: 25125.77
Estimated effects may be unbalanced
A anova: 2 × 5
DfSum SqMean SqF valuePr(>F)
<int><dbl><dbl><dbl><dbl>
df$climate 2 484335886724216794333.8359920.0314623
Residuals3421464355258 631304566 NA NA

Conclusion: At the 5% significance level, we reject the null hypothesis in favor of the alternative that mean GDP varies by climate region.

Dependencies between regressors

d <- subset(df, select = c("employment", "earnings", "internet", "climate"))

cor(d[c(1, 2, 3)])

options(repr.plot.width = 12, repr.plot.height = 8)
ggpairs(d, lower = list(combo = "box"))
A matrix: 3 × 3 of type dbl
employmentearningsinternet
employment1.00000000.54643160.7414445
earnings0.54643161.00000000.7712457
internet0.74144450.77124571.0000000

We observe that the correlations between the numerical indicators are statistically significant. Additionally, there is a significant dependence on the categorical indicator (climate zone).

aov(df$employment ~ df$climate)
anova(aov(df$employment ~ df$climate))
Call:
   aov(formula = df$employment ~ df$climate)

Terms:
                df$climate Residuals
Sum of Squares    939.3983  933.4817
Deg. of Freedom          2        34

Residual standard error: 5.239785
Estimated effects may be unbalanced
A anova: 2 × 5
DfSum SqMean SqF valuePr(>F)
<int><dbl><dbl><dbl><dbl>
df$climate 2939.3983469.6991317.107757.229859e-06
Residuals34933.4817 27.45535 NA NA
aov(df$earnings ~ df$climate)
anova(aov(df$earnings ~ df$climate))
Call:
   aov(formula = df$earnings ~ df$climate)

Terms:
                df$climate Residuals
Sum of Squares    17438893  77763451
Deg. of Freedom          2        34

Residual standard error: 1512.336
Estimated effects may be unbalanced
A anova: 2 × 5
DfSum SqMean SqF valuePr(>F)
<int><dbl><dbl><dbl><dbl>
df$climate 21743889387194463.8123460.03207555
Residuals34777634512287160 NA NA
aov(df$internet ~ df$climate)
anova(aov(df$internet ~ df$climate))
Call:
   aov(formula = df$internet ~ df$climate)

Terms:
                df$climate Residuals
Sum of Squares    3195.714  3787.583
Deg. of Freedom          2        34

Residual standard error: 10.5546
Estimated effects may be unbalanced
A anova: 2 × 5
DfSum SqMean SqF valuePr(>F)
<int><dbl><dbl><dbl><dbl>
df$climate 23195.7141597.856914.343483.041784e-05
Residuals343787.583 111.3995 NA NA

Statistical modeling

Using a linear regression model or its variants, we will examine the dependence of GDP on all chosen regressors. We will practically interpret the values of the regression coefficients and evaluate the quality of the model. Additionally, we will identify outlying observations and investigate multicollinearity. We will test the model assumptions and, if they are not met, design and test methods that compensate for the failure or dispense with the assumptions. Using appropriate tools, we will aim to find a final sub-model that explains GDP behavior well but does not contain insignificant components. The results will be interpreted practically.

Dependence of GDP on all regressors

fit <- lm(df$values ~ df$employment + df$earnings + df$internet + df$climate)
summary(fit)

options(repr.plot.width = 10, repr.plot.height = 6)
plot(fit, which = 1)
plot(fit, which = 3)
Call:
lm(formula = df$values ~ df$employment + df$earnings + df$internet + 
    df$climate)

Residuals:
   Min     1Q Median     3Q    Max 
-22452  -3813   -697   2572  34164 

Coefficients:
                                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)                     -10160.722  25819.538  -0.394    0.697    
df$employment                      103.315    401.955   0.257    0.799    
df$earnings                         15.163      1.803   8.409 1.69e-09 ***
df$internet                        -19.648    273.609  -0.072    0.943    
df$climateMediterranean climate    917.870   7153.894   0.128    0.899    
df$climateTemperate               3433.730   5092.107   0.674    0.505    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 10590 on 31 degrees of freedom
Multiple R-squared:  0.8679,	Adjusted R-squared:  0.8466 
F-statistic: 40.72 on 5 and 31 DF,  p-value: 1.004e-12

Result: It appears that except for the variable "earnings," all coefficients are statistically insignificant at the 5% significance level, suggesting that they can be individually omitted from the model. It is possible to further explore the model by gradually dropping regressors, clustering them more broadly, or returning to simpler models to refine our understanding.

Interpretation:

  • For every 1% increase in the employment rate, GDP increases by €103.
  • For every €1 increase in average monthly income, GDP increases by €15.
  • For every 1% increase in internet usage among the population, GDP decreases by €20.
  • In Nordic countries, when all regressors are at zero, GDP at market prices is estimated to be €-10,160 per capita. Compared to this baseline, GDP is €917 higher in Mediterranean countries and €3,433 higher in temperate climate countries.

Quality of the model: The current model is inadequate due to the abovementioned issues. Initially, the analysis suggests a multicollinearity problem, where variables are highly correlated with each other. This can distort the interpretation of individual coefficients and lead to incorrect conclusions about their significance. Additionally, the model shows inconsistencies, such as the unexpected negative relationship between internet usage and GDP, which contradicts intuitive expectations. Furthermore, the model appears to be sensitive to outliers, particularly affecting higher GDP values, which warrants further investigation in the subsequent sections.

Outliers

plot(fit, which = 4)
df[c(19, 21, 35), ]
A data.frame: 3 × 6
geovaluesemploymentearningsinternetclimate
<chr><dbl><dbl><dbl><dbl><fct>
19Liechtenstein13318059.4669295.00Temperate
21Luxembourg 9276062.3420694.67Temperate
35Switzerland 6692065.0601189.73Temperate

Outliers are Liechtenstein, Luxembourg and Switzerland. These are relatively small countries with very strong financial sectors.

Multicollinearity

Conditionality of the matrix $X^TX$

Xnum <- df[, c("employment", "earnings", "internet")]
X <- as.matrix(Xnum)
XTX <- t(X) %*% X

XTX
cat("Singular values : ", round(svd(XTX)$d), "\n")
cat("Condition number: ", round(kappa(XTX, norm="2")))
A matrix: 3 × 3 of type dbl
employmentearningsinternet
employment 115842.1 4750378 160384.7
earnings4750378.12744363776882837.6
internet 160384.7 6882838 225202.8
Singular values :  274691304 85341 777 
Condition number:  497733

The matrix $X^TX$ is very ill-conditioned (497733 $\gg$ 1).

Correlation

round(cor(Xnum), 3)
A matrix: 3 × 3 of type dbl
employmentearningsinternet
employment1.0000.5460.741
earnings0.5461.0000.771
internet0.7410.7711.000

The regressors are strongly correlated.

Variance inflation factor

data.frame(round(vif(lm(df$values ~ df$employment + df$earnings + df$internet)), 3))
A data.frame: 3 × 1
round.vif.lm.df.values...df.employment...df.earnings...df.internet....
<dbl>
df$employment2.229
df$earnings2.477
df$internet3.858

VIF is not too large for any of the regressors.

Ridge regression

To further explore multicollinearity, we will apply ridge regression and compare it with standard regression without regularization.

fit_rr <- lm.ridge(df$values ~ df$employment + df$earnings + df$internet + df$climate, lambda = 10)
fit_lm <- lm(df$values ~ df$employment + df$earnings + df$internet + df$climate)

data.frame(fit_lm$coef, coef(fit_rr, scale = FALSE))
A data.frame: 6 × 2
fit_lm.coefcoef.fit_rr..scale...FALSE.
<dbl><dbl>
(Intercept)-10160.72250-30059.60907
df$employment 103.31457 189.79038
df$earnings 15.16323 9.93583
df$internet -19.64774 322.65346
df$climateMediterranean climate 917.87002 868.71395
df$climateTemperate 3433.72978 4206.47306

We see an improvement in terms of interpretation. Internet usage has a positive coefficient in the ridge regression model.

Model assumptions

In order to apply some diagnostic methods to submodel selection, several assumptions need to be verified.

Regression Model: requires independent and equally distributed variances in addition to a linear relationship between the explained and explanatory variables.

Analysis of variance: similar to LR requires independence, homogeneity of variances and, for the model selection method, additionally normality of variances.

fit <- lm(df$values ~ df$employment + df$earnings + df$internet + df$climate)
residuals <- resid(fit)

options(repr.plot.width = 14, repr.plot.height = 5)
par(mfrow = c(1, 2))
plot(residuals)
hist(residuals, 10)
plot(fit, which = 1)
# plot(fit, which = 2)
plot(fit, which = 3)
# plot(fit, which = 4)

Residue normality

The residue plot shows that the residues are distributed quite symmetrically around zero.

Test: We examine the normality of the residuals using a Q-Q plot and the Shapiro-Wilk normality test:

$H_0:$ The residuals come from a normal distribution.\ $H_A:$ The residuals do not come from a normal distribution.

shapiro.test(residuals)
options(repr.plot.width = 10, repr.plot.height = 6)
plot(fit, which = 2)
	Shapiro-Wilk normality test

data:  residuals
W = 0.79839, p-value = 1.21e-05

At the 5% significance level, we reject the normality of the residuals.

Homogeneity of variance of residuals

Test: The homogeneity of the residue variance is tested by the Breusch-Pagan test.

$H_0:$ The residuals have identical variances.\ $H_A:$ The residuals do not have identical variances.

bptest(fit)
	studentized Breusch-Pagan test

data:  fit
BP = 18.577, df = 5, p-value = 0.002303

Conclusion: At the 5% significance level, the fit of the variances of the initial model is rejected.

Correction

Since the residuals increase with the explanatory variable, we transform the GDP values. Here we use the square root and remove the outliers that are very different from the rest of the dataset.

df2 <- df[-c(19, 21, 35),]
df2$values <- sqrt(df2$values)
fit2 <- lm(df2$values ~ df2$employment + df2$earnings + df2$internet + df2$climate)
summary(fit2)
residuals2 <- resid(fit2)

options(repr.plot.width = 14, repr.plot.height = 5)
par(mfrow = c(1, 2))
plot(residuals2)
hist(residuals2, 10)
plot(fit2, which = 1)
plot(fit2, which = 2)
Call:
lm(formula = df2$values ~ df2$employment + df2$earnings + df2$internet + 
    df2$climate)

Residuals:
     Min       1Q   Median       3Q      Max 
-16.9576  -6.0405   0.1392   7.0079  16.2545 

Coefficients:
                                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)                      -4.54867   25.28852  -0.180   0.8585    
df2$employment                    0.97844    0.39462   2.479   0.0194 *  
df2$earnings                      0.03548    0.00229  15.494 2.89e-15 ***
df2$internet                      0.31952    0.28165   1.134   0.2662    
df2$climateMediterranean climate  3.10285    6.96468   0.446   0.6594    
df2$climateTemperate              3.46487    5.08587   0.681   0.5013    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 10.15 on 28 degrees of freedom
Multiple R-squared:  0.969,	Adjusted R-squared:  0.9634 
F-statistic: 174.8 on 5 and 28 DF,  p-value: < 2.2e-16

Residue normality

shapiro.test(residuals2)
options(repr.plot.width = 10, repr.plot.height = 6)
plot(fit2, which = 2)
	Shapiro-Wilk normality test

data:  residuals2
W = 0.95774, p-value = 0.2088

Conclusion: Here, at the 5% significance level, we fail to reject the normality of residuals.

Homogeneity of variance of residuals

bptest(fit2)
	studentized Breusch-Pagan test

data:  fit2
BP = 6.0235, df = 5, p-value = 0.3039

Conclusion: Even here, at the 5% significance level, we fail to reject the homogeneity of residue variance.

Choosing the final model

Selecting the final model: Now we proceed to the selection of the final model. Such a model should achieve high accuracy. However, we prefer one that is both simple and easy to interpret.

Initially, we will consider a model with all regressors. We will simplify this model as we go along. The final model will be achieved if the reduced version explains the variable under study well while containing no insignificant components

Model: We will consider an all-flags model with a transformed explained variable and no outliers.

fit_overparam <- lm(df2$values ~ df2$employment + df2$earnings + df2$internet + df2$climate)
summary(fit_overparam)
Call:
lm(formula = df2$values ~ df2$employment + df2$earnings + df2$internet + 
    df2$climate)

Residuals:
     Min       1Q   Median       3Q      Max 
-16.9576  -6.0405   0.1392   7.0079  16.2545 

Coefficients:
                                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)                      -4.54867   25.28852  -0.180   0.8585    
df2$employment                    0.97844    0.39462   2.479   0.0194 *  
df2$earnings                      0.03548    0.00229  15.494 2.89e-15 ***
df2$internet                      0.31952    0.28165   1.134   0.2662    
df2$climateMediterranean climate  3.10285    6.96468   0.446   0.6594    
df2$climateTemperate              3.46487    5.08587   0.681   0.5013    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 10.15 on 28 degrees of freedom
Multiple R-squared:  0.969,	Adjusted R-squared:  0.9634 
F-statistic: 174.8 on 5 and 28 DF,  p-value: < 2.2e-16

Observation: The coefficient of determination of 0.96 is very good, but a large part of the coefficients is insignificant and we should try to simplify the model.

Test model-submodel

We test whether a richer model can be reduced to a simpler submodel. We choose a sequential procedure in which we define the smallest and largest sets of parameters used:

  • smallest: HDP' ~ 1 (constant)
  • largest: HDP' ~ employment + earnings + internet + climate
step(lm(df2$values ~ 1), trace=T, scope = list(
    lower = ~1,
    upper = ~df2$values ~ df2$employment + df2$earnings + df2$internet + df2$climate
))
Start:  AIC=271.08
df2$values ~ 1

                 Df Sum of Sq   RSS    AIC
+ df2$earnings    1     88288  4714 171.69
+ df2$internet    1     64228 28774 233.19
+ df2$employment  1     35915 57087 256.48
+ df2$climate     2     24387 68615 264.74
<none>                        93002 271.08

Step:  AIC=171.69
df2$values ~ df2$earnings

                 Df Sum of Sq   RSS    AIC
+ df2$employment  1      1661  3053 158.92
+ df2$internet    1      1089  3625 164.75
+ df2$climate     2       730  3984 169.97
<none>                         4714 171.69
- df2$earnings    1     88288 93002 271.08

Step:  AIC=158.92
df2$values ~ df2$earnings + df2$employment

                 Df Sum of Sq   RSS    AIC
<none>                         3053 158.92
+ df2$internet    1       118  2935 159.58
+ df2$climate     2        33  3020 162.55
- df2$employment  1      1661  4714 171.69
- df2$earnings    1     54034 57087 256.48
Call:
lm(formula = df2$values ~ df2$earnings + df2$employment)

Coefficients:
   (Intercept)    df2$earnings  df2$employment  
       8.96734         0.03727         1.15468  
fit_reduced_1 <- lm(df2$values ~ df2$earnings + df2$employment)
summary(fit_reduced_1)
Call:
lm(formula = df2$values ~ df2$earnings + df2$employment)

Residuals:
     Min       1Q   Median       3Q      Max 
-18.6950  -7.1481   0.3365   9.1403  16.8266 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)     8.967342  14.205352   0.631 0.532495    
df2$earnings    0.037269   0.001591  23.423  < 2e-16 ***
df2$employment  1.154683   0.281167   4.107 0.000271 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 9.924 on 31 degrees of freedom
Multiple R-squared:  0.9672,	Adjusted R-squared:  0.9651 
F-statistic: 456.7 on 2 and 31 DF,  p-value: < 2.2e-16

Comment: The sequential selection algorithm stopped at the model with the lowest AIC: GDP' ~ earnings + employment. However, such a model still has a statistically insignificant intercept at the 5% significance level.

# -intercept
fit_reduced_2 <- lm(df2$values ~ 0 + df2$earnings + df2$employment)
summary(fit_reduced_2)
Call:
lm(formula = df2$values ~ 0 + df2$earnings + df2$employment)

Residuals:
     Min       1Q   Median       3Q      Max 
-17.1454  -7.1348   0.7599   8.8256  16.7243 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
df2$earnings   0.036914   0.001474   25.04   <2e-16 ***
df2$employment 1.327927   0.060566   21.93   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 9.83 on 32 degrees of freedom
Multiple R-squared:  0.9961,	Adjusted R-squared:  0.9958 
F-statistic:  4068 on 2 and 32 DF,  p-value: < 2.2e-16

This simplification got rid of all the irrelevant components.

Quality comparison: It may be interesting to observe the values of the coefficient of determination during this gradual simplification.

Fit Model R-squared Adjusted R-squared
fit_overparam HDP' ~ ... 0.969 0.963
fit_reduced_1 HDP' ~ earnings + employment 0.967 0.965
fit_reduced_2 HDP' ~ 0 + earnings + employment 0.996 0.996

The last model examined has the highest adjusted coefficient of determination.

Test model-submodel:

anova(fit_reduced_2,
      fit_reduced_1,
      fit_overparam,
      test="F")
A anova: 3 × 6
Res.DfRSSDfSum of SqFPr(>F)
<dbl><dbl><dbl><dbl><dbl><dbl>
1323092.262NA NA NA NA
2313053.017 1 39.245590.38058830.5422729
3282887.311 3165.705690.53564940.6616838

Result of model-submodel test: More complex models can be reduced to the resulting submodel GDP' ~ 0 + earnings + employment without significant loss in accuracy.

Final model

m <- lm(df2$values ~ 0 + df2$earnings + df2$employment)
r <- resid(m)
summary(m)

options(repr.plot.width = 14, repr.plot.height = 6)
par(mfrow = c(1, 2))
plot(r)
plot(m, which = 3)
Call:
lm(formula = df2$values ~ 0 + df2$earnings + df2$employment)

Residuals:
     Min       1Q   Median       3Q      Max 
-17.1454  -7.1348   0.7599   8.8256  16.7243 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
df2$earnings   0.036914   0.001474   25.04   <2e-16 ***
df2$employment 1.327927   0.060566   21.93   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 9.83 on 32 degrees of freedom
Multiple R-squared:  0.9961,	Adjusted R-squared:  0.9958 
F-statistic:  4068 on 2 and 32 DF,  p-value: < 2.2e-16

Interpretation: With zero income and zero employment, GDP is zero. For every percent of employment, the square root of GDP increases by 1.33 and for every euro in average income, GDP increases by 0.04.

Quality of the model: The final model is appropriate. It has a high coefficient of determination, does not make systematic errors and is relatively simple. In addition, unlike previous models, it models only valid values even with, for example, low values of the regressors.