Here’s a breakdown of the features used in this analysis:

  • Year: The year of observation.
  • Status: Indicates whether the country is Developed or Developing.
  • Life expectancy: Life expectancy in years – this is our target variable to predict.
  • Adult Mortality: Adult mortality rate (probability of dying between 15 and 60 years per 1,000 population).
  • infant deaths: Number of infant deaths per 1,000 population.
  • Alcohol: Recorded per capita consumption (15+) of pure alcohol (in liters).
  • percentage expenditure: Percentage of gross domestic product (GDP) spent on health per capita (%).
  • Hepatitis B: Hepatitis B (HepB) immunization coverage among 1-year-olds (%).
  • Measles: Number of reported measles cases per 1,000 population.
  • BMI: Average Body Mass Index of the entire population.
  • under-five deaths: Number of under-five deaths per 1,000 population.
  • Polio: Polio (Pol3) immunization coverage among 1-year-olds (%).
  • Total expenditure: Government health expenditure as a percentage of total government expenditure (%).
  • Diphtheria: Diphtheria, Tetanus, and Pertussis (DTP3) immunization coverage among 1-year-olds (%).
  • HIV/AIDS: Deaths per 1,000 live births due to HIV/AIDS (0-4 years).
  • GDP: Gross Domestic Product per capita (in USD).
  • Population: Country’s population.
  • thinness 1-19 years: Prevalence of thinness among children aged 10-19 (BMI less than 2 standard deviations below the median) (%).
  • thinness 5-9 years: Prevalence of thinness among children aged 5-9 (BMI less than 2 standard deviations below the median) (%).
  • Income composition of resources: Human Development Index (HDI) based on income composition of resources (index ranging from 0 to 1).
  • Schooling: Number of years of schooling (years).

πŸ“¦ Library Imports

This section handles the necessary imports, sets a random seed for reproducibility, and configures plotting styles.

from itertools import chain

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.colors as mcolors
import missingno as msno
import pycountry_convert as pcc

from matplotlib.ticker import MaxNLocator
from matplotlib.ticker import FormatStrFormatter
from sklearn.model_selection import train_test_split
from sklearn.model_selection import ParameterGrid
from sklearn.metrics import mean_squared_error as mse
from sklearn.metrics import mean_absolute_error as mae
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import MaxAbsScaler
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import FunctionTransformer
from sklearn.tree import DecisionTreeRegressor
from sklearn.base import RegressorMixin
from sklearn.base import BaseEstimator
from sklearn.base import TransformerMixin
from sklearn.linear_model import Ridge
from sklearn.neighbors import KNeighborsRegressor
from sklearn.utils import resample
from sklearn.impute import SimpleImputer
from sklearn.impute import KNNImputer
from sklearn.compose import ColumnTransformer
np.random.seed(42)
plt.style.use('ggplot')
red =  np.array((226/255, 74/255, 51/255))
blue = np.array((52/255, 138/255, 189/255))
grey = np.array((100/255, 100/255, 100/255))
cobalt = np.array((0/255, 71/255, 171/255))
main_color = cobalt

βš™οΈ Data Preprocessing

This section covers loading the dataset, performing initial exploratory data analysis, visualizing features, and splitting the data into training, validation, and test sets.

πŸ“š Dataset Loading

raw_data = pd.read_csv('data.csv')
raw_data.head().T

01234
CountryAfghanistanAfghanistanAfghanistanAfghanistanAfghanistan
Year20152014201320122011
StatusDevelopingDevelopingDevelopingDevelopingDeveloping
Life expectancy65.059.959.959.559.2
Adult Mortality263.0271.0268.0272.0275.0
infant deaths6264666971
Alcohol0.010.010.010.010.01
percentage expenditure71.27962473.52358273.21924378.1842157.097109
Hepatitis B65.062.064.067.068.0
Measles115449243027873013
BMI19.118.618.117.617.2
under-five deaths8386899397
Polio6.058.062.067.068.0
Total expenditure8.168.188.138.527.87
Diphtheria65.062.064.067.068.0
HIV/AIDS0.10.10.10.10.1
GDP584.25921612.696514631.744976669.95963.537231
Population33736494.0327582.031731688.03696958.02978599.0
thinness 1-19 years17.217.517.717.918.2
thinness 5-9 years17.317.517.718.018.2
Income composition of resources0.4790.4760.470.4630.454
Schooling10.110.09.99.89.5

πŸ’‘ Dataset Overview

pd.DataFrame([raw_data.count(), raw_data.nunique(), raw_data.dtypes], index=['Non-Null Count', 'Unique Values', 'Dtype']).T

Non-Null CountUnique ValuesDtype
Country2718183object
Year271816int64
Status27182object
Life expectancy2718359float64
Adult Mortality2718423float64
infant deaths2718195int64
Alcohol25641055float64
percentage expenditure27182185float64
Hepatitis B218887float64
Measles2718909int64
BMI2692600float64
under-five deaths2718239int64
Polio270073float64
Total expenditure2529792float64
Diphtheria270081float64
HIV/AIDS2718197float64
GDP23172317float64
Population21162110float64
thinness 1-19 years2692194float64
thinness 5-9 years2692200float64
Income composition of resources2576613float64
Schooling2576173float64
raw_data.describe().T

countmeanstdmin25%50%75%max
Year2718.02.007114e+034.537979e+002000.000002003.0000002.007000e+032.011000e+032.015000e+03
Life expectancy2718.06.920453e+019.612530e+0036.3000063.1000007.220000e+017.580000e+018.900000e+01
Adult Mortality2718.01.644323e+021.255128e+021.0000073.2500001.420000e+022.270000e+027.230000e+02
infant deaths2718.03.082524e+011.217866e+020.000000.0000003.000000e+002.200000e+011.800000e+03
Alcohol2564.04.672512e+004.051664e+000.010000.9900003.820000e+007.832500e+001.787000e+01
percentage expenditure2718.07.570717e+022.007472e+030.000005.8323856.768701e+014.468877e+021.947991e+04
Hepatitis B2188.08.088483e+012.501008e+011.0000077.0000009.200000e+019.700000e+019.900000e+01
Measles2718.02.371000e+031.117424e+040.000000.0000001.800000e+013.720000e+022.121830e+05
BMI2692.03.831434e+011.995480e+011.0000019.2000004.345000e+015.610000e+017.760000e+01
under-five deaths2718.04.276748e+011.657044e+020.000000.0000004.000000e+002.800000e+012.500000e+03
Polio2700.08.252815e+012.329438e+013.0000077.0000009.300000e+019.700000e+019.900000e+01
Total expenditure2529.05.943606e+002.488801e+000.370004.2600005.730000e+007.530000e+001.760000e+01
Diphtheria2700.08.213593e+012.384957e+012.0000078.0000009.300000e+019.700000e+019.900000e+01
HIV/AIDS2718.01.788263e+005.221587e+000.100000.1000001.000000e-018.000000e-015.060000e+01
GDP2317.07.646460e+031.445559e+041.68135459.2912001.741143e+036.337883e+031.191727e+05
Population2116.01.261063e+076.238395e+0734.00000182922.0000001.365022e+067.383590e+061.293859e+09
thinness 1-19 years2692.04.892236e+004.434584e+000.100001.6000003.400000e+007.200000e+002.770000e+01
thinness 5-9 years2692.04.925149e+004.522269e+000.100001.6000003.400000e+007.300000e+002.860000e+01
Income composition of resources2576.06.266968e-012.133229e-010.000000.4920006.790000e-017.810000e-019.380000e-01
Schooling2576.01.199608e+013.364109e+000.0000010.1000001.230000e+011.430000e+012.070000e+01

βœ‚οΈ Data Splitting

y_data = raw_data.loc[:, 'Life expectancy']
X_data = raw_data.drop('Life expectancy', axis=1)
test_ratio = 0.3
val_ratio = 0.2
X_train_val, X_test, y_train_val, y_test = train_test_split(X_data, y_data, test_size=test_ratio)
X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val, test_size=val_ratio/(1-test_ratio))
split_df = pd.DataFrame([X_train.shape[0], X_val.shape[0], X_test.shape[0]], index=['Train', 'Val', 'Test'], columns=['Size'])
split_df['Relative size'] = split_df['Size'] / split_df['Size'].sum()
split_df['Relative size'] = split_df['Relative size'].round(3)
split_df

SizeRelative size
Train13580.5
Val5440.2
Test8160.3

πŸ•΅οΈ Exploratory Data Analysis

This exploratory analysis focuses on:

  • Features with missing values.
  • The scale of individual features.
  • The correlation of individual features with the target variable.

❓ Missing Values

fig, ax = plt.subplots(1, 1, figsize=(12, 5), layout='constrained')
fig.suptitle('Missing values (heatmap)', fontsize=20)

msno.matrix(X_train, fontsize=14, sparkline=False, ax=ax, color=main_color)
ax.get_yaxis().set_visible(False)

png

fig, ax = plt.subplots(1, 1, figsize=(12, 8), layout='constrained')
fig.suptitle('Missing values (counts)', fontsize=20)

missings = X_train.isna().sum().sort_values()
bars = ax.barh(missings.index, missings, color=main_color)
ax.set_xlabel('Count', fontsize=14)
ax.tick_params(axis='both', which='major', labelsize=12)
for bar, count in zip(bars, missings):
    ax.text(bar.get_width()+2, bar.get_y() + bar.get_height() / 2, f'{count}', va='center', fontsize=10)

✨ Features

Here we examine the distribution of individual features. Features with a logarithmic scale are processed separately.

cat_columns = ['Status', 'Country']
num_columns = ['Year', 'Adult Mortality', 'infant deaths', 'Alcohol', 'percentage expenditure', 'Hepatitis B',
               'Measles', 'BMI', 'under-five deaths', 'Polio', 'Total expenditure', 'Diphtheria', 'HIV/AIDS', 'GDP', 'Population',
               'thinness  1-19 years', 'thinness 5-9 years', 'Income composition of resources', 'Schooling']
logscaled_columns = ['infant deaths', 'percentage expenditure', 'Measles', 'under-five deaths', 'HIV/AIDS', 'GDP', 'Population']
fig, axes = plt.subplots(7, 3, figsize=(12, 16), layout='constrained')
fig.suptitle('Features', size=20)
fig.supylabel('Count', size=18)

for col, ax in zip(num_columns, fig.axes):
    data = X_train[col].dropna()
    if col in logscaled_columns:
        hist, bins = np.histogram(data, bins=20)
        logbins = np.logspace(np.log10(max(bins[0], 0.001)), np.log10(bins[-1]), len(bins))
        ax.set_xscale('log')
        ax.hist(data, bins=logbins, color=main_color)
    else:
        ax.hist(data, color=main_color)
        if col == 'Year':
            ax.xaxis.set_major_formatter(FormatStrFormatter('%d'))
    ax.set_xlabel(col)

ax = axes[6][1]
states = X_train['Status'].value_counts()
ax.bar(states.index, states, color=main_color)
ax.set_xlabel('Status')

axes[6][2].axis('off')

plt.show()

png

print('List of countries:')
print(', '.join(X_train['Country'].unique()))
List of countries:
Rwanda, Ukraine, Micronesia (Federated States of), Grenada, Montenegro, Sierra Leone, Mali, Lebanon, Cabo Verde, Tajikistan, Georgia, Spain, Algeria, Thailand, Fiji, Azerbaijan, United States of America, Cambodia, Togo, Russian Federation, France, Kyrgyzstan, Seychelles, Estonia, Gabon, Switzerland, Barbados, Nigeria, Turkmenistan, Comoros, Uruguay, CΓ΄te d'Ivoire, Samoa, Kazakhstan, Yemen, Belize, Iran (Islamic Republic of), Benin, Sri Lanka, Belgium, Italy, Zambia, Lithuania, Sudan, Burundi, Republic of Moldova, Papua New Guinea, Ethiopia, Sweden, Serbia, Jamaica, Denmark, Venezuela (Bolivarian Republic of), Afghanistan, South Africa, Democratic People's Republic of Korea, Maldives, Albania, Cyprus, India, Philippines, Tunisia, Saint Lucia, South Sudan, Zimbabwe, Angola, Nepal, Viet Nam, Croatia, Somalia, Brunei Darussalam, Antigua and Barbuda, Mauritius, Niger, Syrian Arab Republic, Morocco, Burkina Faso, New Zealand, United Kingdom of Great Britain and Northern Ireland, Guatemala, Malta, Republic of Korea, Honduras, Sao Tome and Principe, Trinidad and Tobago, Chile, Ghana, Cameroon, Oman, Ireland, Indonesia, Tonga, Hungary, Kenya, Malawi, Luxembourg, El Salvador, Slovenia, Jordan, Haiti, Namibia, Netherlands, Eritrea, Guyana, Argentina, Timor-Leste, Costa Rica, Guinea, Senegal, Germany, China, Uzbekistan, Ecuador, Suriname, Czechia, Equatorial Guinea, Greece, Belarus, Malaysia, Israel, Bosnia and Herzegovina, Bangladesh, Latvia, Pakistan, Myanmar, Iceland, Mauritania, Turkey, Mexico, Congo, Lao People's Democratic Republic, Austria, Mozambique, United Republic of Tanzania, Saudi Arabia, Singapore, Swaziland, United Arab Emirates, Qatar, The former Yugoslav republic of Macedonia, Bahrain, Nicaragua, Panama, Bhutan, Poland, Bolivia (Plurinational State of), Australia, Norway, Peru, Djibouti, Dominican Republic, Portugal, Armenia, Madagascar, Bahamas, Chad, Saint Vincent and the Grenadines, Mongolia, Botswana, Libya, Vanuatu, Bulgaria, Egypt, Canada, Guinea-Bissau, Lesotho, Democratic Republic of the Congo, Brazil, Kiribati, Iraq, Gambia, Slovakia, Kuwait, Japan, Paraguay, Uganda, Colombia, Finland, Liberia, Romania, Cuba, Solomon Islands, Central African Republic

🎯 Target Variable

Here, we examine the target variable and its relationship with other features.

fig, axes = plt.subplots(2, 1, figsize=(12, 6), sharex=True)
fig.subplots_adjust(hspace=0.03)
fig.suptitle('Target variable', size=18)

ax = axes[0]
ax.hist(y_train, bins=35, color=main_color)
ax.set_ylabel('Count')
ax.tick_params(labelbottom=False)
ax.tick_params(axis='x', which='both', length=0)

medianprops = dict(linewidth=2.5, color=main_color)
flierprops = dict(marker='o', markerfacecolor='none', markersize=7, markeredgecolor=main_color)
ax = axes[1]
ax.boxplot(y_train, vert=False, widths=[0.4], showfliers=True, flierprops=flierprops, medianprops=medianprops)
ax.get_yaxis().set_visible(False)
ax.set_xlabel('Life expectancy [years]')

plt.show()

png

We analyze relationships between features, specifically identifying highly correlated ones for later special handling.

fig, axes = plt.subplots(7, 3, figsize=(12, 28), layout='constrained')
fig.suptitle('Life expectancy vs Features', size=20)
# fig.supxlabel('Life expectancy', size=18)

alpha = 0.3
for col, ax in zip(num_columns, fig.axes):
    data = X_train[col]
    if col in logscaled_columns:
        ax.set_yscale('log')
        ax.scatter(y_train, data, alpha=alpha, color=main_color)
    else:
        ax.scatter(y_train, data, alpha=alpha, color=main_color)
    ax.set_ylabel(col)
    ax.set_title(f'Life expectancy vs {col}', fontsize=12)

axes[6][1].axis('off')
axes[6][2].axis('off')

plt.show()

png

colors_r = plt.get_cmap('Blues')(np.linspace(0, 1, 128))
colors_l = colors_r[::-1]
ggcmap_bi = mcolors.LinearSegmentedColormap.from_list('ggplot_like', np.vstack((colors_l, colors_r)))

fig, ax = plt.subplots(figsize=(12, 8), layout='constrained')
fig.suptitle('Correlation matrix', size=20)
corr = X_train[num_columns].corr()
sns.heatmap(corr.round(1), ax=ax, square=True, annot=True, linewidths=0.3,
            annot_kws={"size": 8}, cmap=ggcmap_bi, vmin=-1, vmax=1)
ax.tick_params(axis='both', which='both',length=0)
plt.show()

png

high_corr_columns = ['under-five deaths', 'thinness 5-9 years', 'GDP']

🏭 Feature Transformations

This section prepares the feature preprocessing steps, including handling missing values, logarithmic transformations, and encoding. A new continent feature is also created.

πŸ”§ Transformation Preparation

πŸ—‚οΈ Categorical Features

One-Hot Encoding (OHE) is prepared for categorical features.

X_train[cat_columns].describe().T

countuniquetopfreq
Status13582Developing1108
Country1358183Afghanistan13
cat_encoder = Pipeline([
    ('ohe_encoder', OneHotEncoder(handle_unknown='ignore')),
])

πŸ”’ Numerical Features

An imputer is prepared to handle missing values in numerical features.

X_train[num_columns].describe().T

countmeanstdmin25%50%75%max
Year1358.02.007104e+034.535324e+002000.000002003.0000002.007000e+032.011000e+032.015000e+03
Adult Mortality1358.01.660037e+021.273147e+021.0000073.0000001.420000e+022.290000e+027.230000e+02
infant deaths1358.03.252651e+011.300906e+020.000000.0000003.000000e+002.300000e+011.800000e+03
Alcohol1278.04.806667e+004.091876e+000.010000.9625004.055000e+008.042500e+001.787000e+01
percentage expenditure1358.08.191974e+022.179784e+030.000007.3423226.750684e+014.571048e+021.909905e+04
Hepatitis B1098.08.152823e+012.407928e+012.0000077.0000009.200000e+019.600000e+019.900000e+01
Measles1358.02.237757e+031.072481e+040.000000.0000001.700000e+013.517500e+022.121830e+05
BMI1346.03.818388e+011.996276e+011.0000018.9250004.325000e+015.610000e+017.760000e+01
under-five deaths1358.04.508689e+011.772966e+020.000000.0000004.000000e+003.175000e+012.500000e+03
Polio1350.08.232741e+012.357581e+013.0000077.2500009.300000e+019.700000e+019.900000e+01
Total expenditure1262.06.037964e+002.425639e+000.370004.3700005.845000e+007.700000e+001.720000e+01
Diphtheria1350.08.246296e+012.356940e+013.0000078.0000009.250000e+019.700000e+019.900000e+01
HIV/AIDS1358.01.904271e+005.498987e+000.100000.1000001.000000e-018.000000e-015.060000e+01
GDP1170.07.999548e+031.527320e+041.68135453.5439221.705259e+036.476837e+031.157616e+05
Population1079.01.234136e+075.776179e+0741.00000187528.5000001.373513e+067.417429e+061.293859e+09
thinness 1-19 years1346.04.997845e+004.645902e+000.100001.6000003.400000e+007.400000e+002.770000e+01
thinness 5-9 years1346.05.060550e+004.749024e+000.100001.5250003.400000e+007.400000e+002.860000e+01
Income composition of resources1295.06.269668e-012.166401e-010.000000.4890006.810000e-017.840000e-019.380000e-01
Schooling1295.01.207097e+013.396963e+000.0000010.1000001.240000e+011.440000e+012.050000e+01
num_encoder = Pipeline([
    # ('knn_imputer', KNNImputer(missing_values=np.nan, n_neighbors=5, weights='distance')),
    ('mean_imputer', SimpleImputer(missing_values=np.nan, strategy='mean')),
])

πŸͺ΅ Logarithmic Features

Values for features with a logarithmic scale are transformed.

def log_transform(x):
    return np.log(x + 1)

log_encoder = Pipeline([
    # ('knn_imputer', KNNImputer(missing_values=np.nan, n_neighbors=5, weights='distance')),
    ('mean_imputer', SimpleImputer(missing_values=np.nan, strategy='mean')),
    ('log_transformer', FunctionTransformer(log_transform))
])

🌍 Continent Feature

A new feature indicating the continent for each country is created.

def country_to_continent(country):
    try:
        alpha2 = pcc.country_name_to_country_alpha2(country)
        continent = pcc.country_alpha2_to_continent_code(alpha2)
        return continent
    except KeyError:
        return np.nan

class ContinentTransformer(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass

    def fit(self, X=None, y=None):
        return self

    def transform(self, X, y=None):
        X['Country'] = X['Country'].apply(country_to_continent)
        return X
continent_encoder = Pipeline([
    ('continent_transformer', ContinentTransformer()),
    ('ohe_encoder', OneHotEncoder(handle_unknown='ignore')),
])

πŸ—œοΈ Applying Transformations

This section sets up imputer and scaler transformations on the training data and creates suitable preprocessors for each model.

🌳 Random Forest Preprocessor

column_transformer = ColumnTransformer([
    # ('continent_encoder', continent_encoder, ['Country']),
    ('categorical_encoder', cat_encoder, cat_columns),
    ('numerical_encoder', num_encoder, num_columns),
])

preprocessor_random_forest = Pipeline([
    ('preprocessor', column_transformer),
])

preprocessor_random_forest.fit(X_train)
Pipeline(steps=[('preprocessor',
             ColumnTransformer(transformers=[('categorical_encoder',
                                              Pipeline(steps=[('ohe_encoder',
                                                               OneHotEncoder(handle_unknown='ignore'))]),
                                              ['Status', 'Country']),
                                             ('numerical_encoder',
                                              Pipeline(steps=[('mean_imputer',
                                                               SimpleImputer())]),
                                              ['Year', 'Adult Mortality',
                                               'infant deaths', 'Alcohol',
                                               'percentage expenditure',
                                               'Hepatitis B', 'Measles',
                                               'BMI', 'under-five deaths',
                                               'Polio', 'Total expenditure',
                                               'Diphtheria', 'HIV/AIDS',
                                               'GDP', 'Population',
                                               'thinness  1-19 years',
                                               'thinness 5-9 years',
                                               'Income composition of '
                                               'resources',
                                               &#x27;Schooling&#x27;])]))])</pre><b>In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. <br />On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.</b></div><div class="sk-container" hidden><div class="sk-item sk-dashed-wrapped"><div class="sk-label-container"><div class="sk-label sk-toggleable"><input class="sk-toggleable__control sk-hidden--visually" id="sk-estimator-id-1" type="checkbox" ><label for="sk-estimator-id-1" class="sk-toggleable__label sk-toggleable__label-arrow">Pipeline</label><div class="sk-toggleable__content"><pre>Pipeline(steps=[(&#x27;preprocessor&#x27;,
             ColumnTransformer(transformers=[(&#x27;categorical_encoder&#x27;,
                                              Pipeline(steps=[(&#x27;ohe_encoder&#x27;,
                                                               OneHotEncoder(handle_unknown=&#x27;ignore&#x27;))]),
                                              [&#x27;Status&#x27;, &#x27;Country&#x27;]),
                                             (&#x27;numerical_encoder&#x27;,
                                              Pipeline(steps=[(&#x27;mean_imputer&#x27;,
                                                               SimpleImputer())]),
                                              [&#x27;Year&#x27;, &#x27;Adult Mortality&#x27;,
                                               &#x27;infant deaths&#x27;, &#x27;Alcohol&#x27;,
                                               &#x27;percentage expenditure&#x27;,
                                               &#x27;Hepatitis B&#x27;, &#x27;Measles&#x27;,
                                               &#x27;BMI&#x27;, &#x27;under-five deaths&#x27;,
                                               &#x27;Polio&#x27;, &#x27;Total expenditure&#x27;,
                                               &#x27;Diphtheria&#x27;, &#x27;HIV/AIDS&#x27;,
                                               &#x27;GDP&#x27;, &#x27;Population&#x27;,
                                               &#x27;thinness  1-19 years&#x27;,
                                               &#x27;thinness 5-9 years&#x27;,
                                               &#x27;Income composition of &#x27;
                                               &#x27;resources&#x27;,
                                               &#x27;Schooling&#x27;])]))])</pre></div></div></div><div class="sk-serial"><div class="sk-item sk-dashed-wrapped"><div class="sk-label-container"><div class="sk-label sk-toggleable"><input class="sk-toggleable__control sk-hidden--visually" id="sk-estimator-id-2" type="checkbox" ><label for="sk-estimator-id-2" class="sk-toggleable__label sk-toggleable__label-arrow">preprocessor: ColumnTransformer</label><div class="sk-toggleable__content"><pre>ColumnTransformer(transformers=[(&#x27;categorical_encoder&#x27;,
                             Pipeline(steps=[(&#x27;ohe_encoder&#x27;,
                                              OneHotEncoder(handle_unknown=&#x27;ignore&#x27;))]),
                             [&#x27;Status&#x27;, &#x27;Country&#x27;]),
                            (&#x27;numerical_encoder&#x27;,
                             Pipeline(steps=[(&#x27;mean_imputer&#x27;,
                                              SimpleImputer())]),
                             [&#x27;Year&#x27;, &#x27;Adult Mortality&#x27;, &#x27;infant deaths&#x27;,
                              &#x27;Alcohol&#x27;, &#x27;percentage expenditure&#x27;,
                              &#x27;Hepatitis B&#x27;, &#x27;Measles&#x27;, &#x27;BMI&#x27;,
                              &#x27;under-five deaths&#x27;, &#x27;Polio&#x27;,
                              &#x27;Total expenditure&#x27;, &#x27;Diphtheria&#x27;, &#x27;HIV/AIDS&#x27;,
                              &#x27;GDP&#x27;, &#x27;Population&#x27;, &#x27;thinness  1-19 years&#x27;,
                              &#x27;thinness 5-9 years&#x27;,
                              &#x27;Income composition of resources&#x27;,
                              &#x27;Schooling&#x27;])])</pre></div></div></div><div class="sk-parallel"><div class="sk-parallel-item"><div class="sk-item"><div class="sk-label-container"><div class="sk-label sk-toggleable"><input class="sk-toggleable__control sk-hidden--visually" id="sk-estimator-id-3" type="checkbox" ><label for="sk-estimator-id-3" class="sk-toggleable__label sk-toggleable__label-arrow">categorical_encoder</label><div class="sk-toggleable__content"><pre>[&#x27;Status&#x27;, &#x27;Country&#x27;]</pre></div></div></div><div class="sk-serial"><div class="sk-item"><div class="sk-serial"><div class="sk-item"><div class="sk-estimator sk-toggleable"><input class="sk-toggleable__control sk-hidden--visually" id="sk-estimator-id-4" type="checkbox" ><label for="sk-estimator-id-4" class="sk-toggleable__label sk-toggleable__label-arrow">OneHotEncoder</label><div class="sk-toggleable__content"><pre>OneHotEncoder(handle_unknown=&#x27;ignore&#x27;)</pre></div></div></div></div></div></div></div></div><div class="sk-parallel-item"><div class="sk-item"><div class="sk-label-container"><div class="sk-label sk-toggleable"><input class="sk-toggleable__control sk-hidden--visually" id="sk-estimator-id-5" type="checkbox" ><label for="sk-estimator-id-5" class="sk-toggleable__label sk-toggleable__label-arrow">numerical_encoder</label><div class="sk-toggleable__content"><pre>[&#x27;Year&#x27;, &#x27;Adult Mortality&#x27;, &#x27;infant deaths&#x27;, &#x27;Alcohol&#x27;, &#x27;percentage expenditure&#x27;, &#x27;Hepatitis B&#x27;, &#x27;Measles&#x27;, &#x27;BMI&#x27;, &#x27;under-five deaths&#x27;, &#x27;Polio&#x27;, &#x27;Total expenditure&#x27;, &#x27;Diphtheria&#x27;, &#x27;HIV/AIDS&#x27;, &#x27;GDP&#x27;, &#x27;Population&#x27;, &#x27;thinness  1-19 years&#x27;, &#x27;thinness 5-9 years&#x27;, &#x27;Income composition of resources&#x27;, &#x27;Schooling&#x27;]</pre></div></div></div><div class="sk-serial"><div class="sk-item"><div class="sk-serial"><div class="sk-item"><div class="sk-estimator sk-toggleable"><input class="sk-toggleable__control sk-hidden--visually" id="sk-estimator-id-6" type="checkbox" ><label for="sk-estimator-id-6" class="sk-toggleable__label sk-toggleable__label-arrow">SimpleImputer</label><div class="sk-toggleable__content"><pre>SimpleImputer()</pre></div></div></div></div></div></div></div></div></div></div></div></div></div></div>

⛰️ Ridge Regression Preprocessor

column_transformer = ColumnTransformer([
    ('continent_encoder', continent_encoder, ['Country']),
    ('categorical_encoder', cat_encoder, cat_columns),
    ('numerical_encoder', num_encoder, [col for col in num_columns if col not in [high_corr_columns]]),
    ('logscaled_encoder', log_encoder, logscaled_columns),
])

preprocessor_ridge = Pipeline([
    ('preprocessor', column_transformer),
    # ('basis_functions', PolynomialFeatures(include_bias=True)),
    ('scaler', MaxAbsScaler()),
])

preprocessor_ridge.fit(X_train)
Pipeline(steps=[('preprocessor',
             ColumnTransformer(transformers=[(&#x27;continent_encoder&#x27;,
                                              Pipeline(steps=[(&#x27;continent_transformer&#x27;,
                                                               ContinentTransformer()),
                                                              (&#x27;ohe_encoder&#x27;,
                                                               OneHotEncoder(handle_unknown=&#x27;ignore&#x27;))]),
                                              [&#x27;Country&#x27;]),
                                             (&#x27;categorical_encoder&#x27;,
                                              Pipeline(steps=[(&#x27;ohe_encoder&#x27;,
                                                               OneHotEncoder(handle_unknown=&#x27;ignore&#x27;))]),
                                              [&#x27;Status&#x27;, &#x27;Country&#x27;]),
                                             (&#x27;numerical_...
                                               &#x27;thinness 5-9 years&#x27;,
                                               &#x27;Income composition of &#x27;
                                               &#x27;resources&#x27;,
                                               &#x27;Schooling&#x27;]),
                                             (&#x27;logscaled_encoder&#x27;,
                                              Pipeline(steps=[(&#x27;mean_imputer&#x27;,
                                                               SimpleImputer()),
                                                              (&#x27;log_transformer&#x27;,
                                                               FunctionTransformer(func=&lt;function log_transform at 0x7fbbd2fbe480&gt;))]),
                                              [&#x27;infant deaths&#x27;,
                                               &#x27;percentage expenditure&#x27;,
                                               &#x27;Measles&#x27;,
                                               &#x27;under-five deaths&#x27;,
                                               &#x27;HIV/AIDS&#x27;, &#x27;GDP&#x27;,
                                               &#x27;Population&#x27;])])),
            (&#x27;scaler&#x27;, MaxAbsScaler())])</pre><b>In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. <br />On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.</b></div><div class="sk-container" hidden><div class="sk-item sk-dashed-wrapped"><div class="sk-label-container"><div class="sk-label sk-toggleable"><input class="sk-toggleable__control sk-hidden--visually" id="sk-estimator-id-7" type="checkbox" ><label for="sk-estimator-id-7" class="sk-toggleable__label sk-toggleable__label-arrow">Pipeline</label><div class="sk-toggleable__content"><pre>Pipeline(steps=[(&#x27;preprocessor&#x27;,
             ColumnTransformer(transformers=[(&#x27;continent_encoder&#x27;,
                                              Pipeline(steps=[(&#x27;continent_transformer&#x27;,
                                                               ContinentTransformer()),
                                                              (&#x27;ohe_encoder&#x27;,
                                                               OneHotEncoder(handle_unknown=&#x27;ignore&#x27;))]),
                                              [&#x27;Country&#x27;]),
                                             (&#x27;categorical_encoder&#x27;,
                                              Pipeline(steps=[(&#x27;ohe_encoder&#x27;,
                                                               OneHotEncoder(handle_unknown=&#x27;ignore&#x27;))]),
                                              [&#x27;Status&#x27;, &#x27;Country&#x27;]),
                                             (&#x27;numerical_...
                                               &#x27;thinness 5-9 years&#x27;,
                                               &#x27;Income composition of &#x27;
                                               &#x27;resources&#x27;,
                                               &#x27;Schooling&#x27;]),
                                             (&#x27;logscaled_encoder&#x27;,
                                              Pipeline(steps=[(&#x27;mean_imputer&#x27;,
                                                               SimpleImputer()),
                                                              (&#x27;log_transformer&#x27;,
                                                               FunctionTransformer(func=&lt;function log_transform at 0x7fbbd2fbe480&gt;))]),
                                              [&#x27;infant deaths&#x27;,
                                               &#x27;percentage expenditure&#x27;,
                                               &#x27;Measles&#x27;,
                                               &#x27;under-five deaths&#x27;,
                                               &#x27;HIV/AIDS&#x27;, &#x27;GDP&#x27;,
                                               &#x27;Population&#x27;])])),
            (&#x27;scaler&#x27;, MaxAbsScaler())])</pre></div></div></div><div class="sk-serial"><div class="sk-item sk-dashed-wrapped"><div class="sk-label-container"><div class="sk-label sk-toggleable"><input class="sk-toggleable__control sk-hidden--visually" id="sk-estimator-id-8" type="checkbox" ><label for="sk-estimator-id-8" class="sk-toggleable__label sk-toggleable__label-arrow">preprocessor: ColumnTransformer</label><div class="sk-toggleable__content"><pre>ColumnTransformer(transformers=[(&#x27;continent_encoder&#x27;,
                             Pipeline(steps=[(&#x27;continent_transformer&#x27;,
                                              ContinentTransformer()),
                                             (&#x27;ohe_encoder&#x27;,
                                              OneHotEncoder(handle_unknown=&#x27;ignore&#x27;))]),
                             [&#x27;Country&#x27;]),
                            (&#x27;categorical_encoder&#x27;,
                             Pipeline(steps=[(&#x27;ohe_encoder&#x27;,
                                              OneHotEncoder(handle_unknown=&#x27;ignore&#x27;))]),
                             [&#x27;Status&#x27;, &#x27;Country&#x27;]),
                            (&#x27;numerical_encoder&#x27;,
                             Pipeline(steps=[(&#x27;mean_...
                              &#x27;GDP&#x27;, &#x27;Population&#x27;, &#x27;thinness  1-19 years&#x27;,
                              &#x27;thinness 5-9 years&#x27;,
                              &#x27;Income composition of resources&#x27;,
                              &#x27;Schooling&#x27;]),
                            (&#x27;logscaled_encoder&#x27;,
                             Pipeline(steps=[(&#x27;mean_imputer&#x27;,
                                              SimpleImputer()),
                                             (&#x27;log_transformer&#x27;,
                                              FunctionTransformer(func=&lt;function log_transform at 0x7fbbd2fbe480&gt;))]),
                             [&#x27;infant deaths&#x27;, &#x27;percentage expenditure&#x27;,
                              &#x27;Measles&#x27;, &#x27;under-five deaths&#x27;, &#x27;HIV/AIDS&#x27;,
                              &#x27;GDP&#x27;, &#x27;Population&#x27;])])</pre></div></div></div><div class="sk-parallel"><div class="sk-parallel-item"><div class="sk-item"><div class="sk-label-container"><div class="sk-label sk-toggleable"><input class="sk-toggleable__control sk-hidden--visually" id="sk-estimator-id-9" type="checkbox" ><label for="sk-estimator-id-9" class="sk-toggleable__label sk-toggleable__label-arrow">continent_encoder</label><div class="sk-toggleable__content"><pre>[&#x27;Country&#x27;]</pre></div></div></div><div class="sk-serial"><div class="sk-item"><div class="sk-serial"><div class="sk-item"><div class="sk-estimator sk-toggleable"><input class="sk-toggleable__control sk-hidden--visually" id="sk-estimator-id-10" type="checkbox" ><label for="sk-estimator-id-10" class="sk-toggleable__label sk-toggleable__label-arrow">ContinentTransformer</label><div class="sk-toggleable__content"><pre>ContinentTransformer()</pre></div></div></div><div class="sk-item"><div class="sk-estimator sk-toggleable"><input class="sk-toggleable__control sk-hidden--visually" id="sk-estimator-id-11" type="checkbox" ><label for="sk-estimator-id-11" class="sk-toggleable__label sk-toggleable__label-arrow">OneHotEncoder</label><div class="sk-toggleable__content"><pre>OneHotEncoder(handle_unknown=&#x27;ignore&#x27;)</pre></div></div></div></div></div></div></div></div><div class="sk-parallel-item"><div class="sk-item"><div class="sk-label-container"><div class="sk-label sk-toggleable"><input class="sk-toggleable__control sk-hidden--visually" id="sk-estimator-id-12" type="checkbox" ><label for="sk-estimator-id-12" class="sk-toggleable__label sk-toggleable__label-arrow">categorical_encoder</label><div class="sk-toggleable__content"><pre>[&#x27;Status&#x27;, &#x27;Country&#x27;]</pre></div></div></div><div class="sk-serial"><div class="sk-item"><div class="sk-serial"><div class="sk-item"><div class="sk-estimator sk-toggleable"><input class="sk-toggleable__control sk-hidden--visually" id="sk-estimator-id-13" type="checkbox" ><label for="sk-estimator-id-13" class="sk-toggleable__label sk-toggleable__label-arrow">OneHotEncoder</label><div class="sk-toggleable__content"><pre>OneHotEncoder(handle_unknown=&#x27;ignore&#x27;)</pre></div></div></div></div></div></div></div></div><div class="sk-parallel-item"><div class="sk-item"><div class="sk-label-container"><div class="sk-label sk-toggleable"><input class="sk-toggleable__control sk-hidden--visually" id="sk-estimator-id-14" type="checkbox" ><label for="sk-estimator-id-14" class="sk-toggleable__label sk-toggleable__label-arrow">numerical_encoder</label><div class="sk-toggleable__content"><pre>[&#x27;Year&#x27;, &#x27;Adult Mortality&#x27;, &#x27;infant deaths&#x27;, &#x27;Alcohol&#x27;, &#x27;percentage expenditure&#x27;, &#x27;Hepatitis B&#x27;, &#x27;Measles&#x27;, &#x27;BMI&#x27;, &#x27;under-five deaths&#x27;, &#x27;Polio&#x27;, &#x27;Total expenditure&#x27;, &#x27;Diphtheria&#x27;, &#x27;HIV/AIDS&#x27;, &#x27;GDP&#x27;, &#x27;Population&#x27;, &#x27;thinness  1-19 years&#x27;, &#x27;thinness 5-9 years&#x27;, &#x27;Income composition of resources&#x27;, &#x27;Schooling&#x27;]</pre></div></div></div><div class="sk-serial"><div class="sk-item"><div class="sk-serial"><div class="sk-item"><div class="sk-estimator sk-toggleable"><input class="sk-toggleable__control sk-hidden--visually" id="sk-estimator-id-15" type="checkbox" ><label for="sk-estimator-id-15" class="sk-toggleable__label sk-toggleable__label-arrow">SimpleImputer</label><div class="sk-toggleable__content"><pre>SimpleImputer()</pre></div></div></div></div></div></div></div></div><div class="sk-parallel-item"><div class="sk-item"><div class="sk-label-container"><div class="sk-label sk-toggleable"><input class="sk-toggleable__control sk-hidden--visually" id="sk-estimator-id-16" type="checkbox" ><label for="sk-estimator-id-16" class="sk-toggleable__label sk-toggleable__label-arrow">logscaled_encoder</label><div class="sk-toggleable__content"><pre>[&#x27;infant deaths&#x27;, &#x27;percentage expenditure&#x27;, &#x27;Measles&#x27;, &#x27;under-five deaths&#x27;, &#x27;HIV/AIDS&#x27;, &#x27;GDP&#x27;, &#x27;Population&#x27;]</pre></div></div></div><div class="sk-serial"><div class="sk-item"><div class="sk-serial"><div class="sk-item"><div class="sk-estimator sk-toggleable"><input class="sk-toggleable__control sk-hidden--visually" id="sk-estimator-id-17" type="checkbox" ><label for="sk-estimator-id-17" class="sk-toggleable__label sk-toggleable__label-arrow">SimpleImputer</label><div class="sk-toggleable__content"><pre>SimpleImputer()</pre></div></div></div><div class="sk-item"><div class="sk-estimator sk-toggleable"><input class="sk-toggleable__control sk-hidden--visually" id="sk-estimator-id-18" type="checkbox" ><label for="sk-estimator-id-18" class="sk-toggleable__label sk-toggleable__label-arrow">FunctionTransformer</label><div class="sk-toggleable__content"><pre>FunctionTransformer(func=&lt;function log_transform at 0x7fbbd2fbe480&gt;)</pre></div></div></div></div></div></div></div></div></div></div><div class="sk-item"><div class="sk-estimator sk-toggleable"><input class="sk-toggleable__control sk-hidden--visually" id="sk-estimator-id-19" type="checkbox" ><label for="sk-estimator-id-19" class="sk-toggleable__label sk-toggleable__label-arrow">MaxAbsScaler</label><div class="sk-toggleable__content"><pre>MaxAbsScaler()</pre></div></div></div></div></div></div></div>

πŸ“ k-NN Preprocessor

column_transformer = ColumnTransformer([
    ('continent_encoder', continent_encoder, ['Country']),
    ('categorical_encoder', cat_encoder, ['Status']),
    ('numerical_encoder', num_encoder, num_columns),
])

preprocessor_knn = Pipeline([
    ('preprocessor', column_transformer),
    ('scaler', MaxAbsScaler()),
])

preprocessor_knn.fit(X_train)
Pipeline(steps=[('preprocessor',
             ColumnTransformer(transformers=[(&#x27;continent_encoder&#x27;,
                                              Pipeline(steps=[(&#x27;continent_transformer&#x27;,
                                                               ContinentTransformer()),
                                                              (&#x27;ohe_encoder&#x27;,
                                                               OneHotEncoder(handle_unknown=&#x27;ignore&#x27;))]),
                                              [&#x27;Country&#x27;]),
                                             (&#x27;categorical_encoder&#x27;,
                                              Pipeline(steps=[(&#x27;ohe_encoder&#x27;,
                                                               OneHotEncoder(handle_unknown=&#x27;ignore&#x27;))]),
                                              [&#x27;Status&#x27;]),
                                             (&#x27;numerical_encoder&#x27;,
                                              P...steps=[(&#x27;mean_imputer&#x27;,
                                                               SimpleImputer())]),
                                              [&#x27;Year&#x27;, &#x27;Adult Mortality&#x27;,
                                               &#x27;infant deaths&#x27;, &#x27;Alcohol&#x27;,
                                               &#x27;percentage expenditure&#x27;,
                                               &#x27;Hepatitis B&#x27;, &#x27;Measles&#x27;,
                                               &#x27;BMI&#x27;, &#x27;under-five deaths&#x27;,
                                               &#x27;Polio&#x27;, &#x27;Total expenditure&#x27;,
                                               &#x27;Diphtheria&#x27;, &#x27;HIV/AIDS&#x27;,
                                               &#x27;GDP&#x27;, &#x27;Population&#x27;,
                                               &#x27;thinness  1-19 years&#x27;,
                                               &#x27;thinness 5-9 years&#x27;,
                                               &#x27;Income composition of &#x27;
                                               &#x27;resources&#x27;,
                                               &#x27;Schooling&#x27;])])),
            (&#x27;scaler&#x27;, MaxAbsScaler())])</pre><b>In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. <br />On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.</b></div><div class="sk-container" hidden><div class="sk-item sk-dashed-wrapped"><div class="sk-label-container"><div class="sk-label sk-toggleable"><input class="sk-toggleable__control sk-hidden--visually" id="sk-estimator-id-20" type="checkbox" ><label for="sk-estimator-id-20" class="sk-toggleable__label sk-toggleable__label-arrow">Pipeline</label><div class="sk-toggleable__content"><pre>Pipeline(steps=[(&#x27;preprocessor&#x27;,
             ColumnTransformer(transformers=[(&#x27;continent_encoder&#x27;,
                                              Pipeline(steps=[(&#x27;continent_transformer&#x27;,
                                                               ContinentTransformer()),
                                                              (&#x27;ohe_encoder&#x27;,
                                                               OneHotEncoder(handle_unknown=&#x27;ignore&#x27;))]),
                                              [&#x27;Country&#x27;]),
                                             (&#x27;categorical_encoder&#x27;,
                                              Pipeline(steps=[(&#x27;ohe_encoder&#x27;,
                                                               OneHotEncoder(handle_unknown=&#x27;ignore&#x27;))]),
                                              [&#x27;Status&#x27;]),
                                             (&#x27;numerical_encoder&#x27;,
                                              P...steps=[(&#x27;mean_imputer&#x27;,
                                                               SimpleImputer())]),
                                              [&#x27;Year&#x27;, &#x27;Adult Mortality&#x27;,
                                               &#x27;infant deaths&#x27;, &#x27;Alcohol&#x27;,
                                               &#x27;percentage expenditure&#x27;,
                                               &#x27;Hepatitis B&#x27;, &#x27;Measles&#x27;,
                                               &#x27;BMI&#x27;, &#x27;under-five deaths&#x27;,
                                               &#x27;Polio&#x27;, &#x27;Total expenditure&#x27;,
                                               &#x27;Diphtheria&#x27;, &#x27;HIV/AIDS&#x27;,
                                               &#x27;GDP&#x27;, &#x27;Population&#x27;,
                                               &#x27;thinness  1-19 years&#x27;,
                                               &#x27;thinness 5-9 years&#x27;,
                                               &#x27;Income composition of &#x27;
                                               &#x27;resources&#x27;,
                                               &#x27;Schooling&#x27;])])),
            (&#x27;scaler&#x27;, MaxAbsScaler())])</pre></div></div></div><div class="sk-serial"><div class="sk-item sk-dashed-wrapped"><div class="sk-label-container"><div class="sk-label sk-toggleable"><input class="sk-toggleable__control sk-hidden--visually" id="sk-estimator-id-21" type="checkbox" ><label for="sk-estimator-id-21" class="sk-toggleable__label sk-toggleable__label-arrow">preprocessor: ColumnTransformer</label><div class="sk-toggleable__content"><pre>ColumnTransformer(transformers=[(&#x27;continent_encoder&#x27;,
                             Pipeline(steps=[(&#x27;continent_transformer&#x27;,
                                              ContinentTransformer()),
                                             (&#x27;ohe_encoder&#x27;,
                                              OneHotEncoder(handle_unknown=&#x27;ignore&#x27;))]),
                             [&#x27;Country&#x27;]),
                            (&#x27;categorical_encoder&#x27;,
                             Pipeline(steps=[(&#x27;ohe_encoder&#x27;,
                                              OneHotEncoder(handle_unknown=&#x27;ignore&#x27;))]),
                             [&#x27;Status&#x27;]),
                            (&#x27;numerical_encoder&#x27;,
                             Pipeline(steps=[(&#x27;mean_imputer&#x27;,
                                              SimpleImputer())]),
                             [&#x27;Year&#x27;, &#x27;Adult Mortality&#x27;, &#x27;infant deaths&#x27;,
                              &#x27;Alcohol&#x27;, &#x27;percentage expenditure&#x27;,
                              &#x27;Hepatitis B&#x27;, &#x27;Measles&#x27;, &#x27;BMI&#x27;,
                              &#x27;under-five deaths&#x27;, &#x27;Polio&#x27;,
                              &#x27;Total expenditure&#x27;, &#x27;Diphtheria&#x27;, &#x27;HIV/AIDS&#x27;,
                              &#x27;GDP&#x27;, &#x27;Population&#x27;, &#x27;thinness  1-19 years&#x27;,
                              &#x27;thinness 5-9 years&#x27;,
                              &#x27;Income composition of resources&#x27;,
                              &#x27;Schooling&#x27;])])</pre></div></div></div><div class="sk-parallel"><div class="sk-parallel-item"><div class="sk-item"><div class="sk-label-container"><div class="sk-label sk-toggleable"><input class="sk-toggleable__control sk-hidden--visually" id="sk-estimator-id-22" type="checkbox" ><label for="sk-estimator-id-22" class="sk-toggleable__label sk-toggleable__label-arrow">continent_encoder</label><div class="sk-toggleable__content"><pre>[&#x27;Country&#x27;]</pre></div></div></div><div class="sk-serial"><div class="sk-item"><div class="sk-serial"><div class="sk-item"><div class="sk-estimator sk-toggleable"><input class="sk-toggleable__control sk-hidden--visually" id="sk-estimator-id-23" type="checkbox" ><label for="sk-estimator-id-23" class="sk-toggleable__label sk-toggleable__label-arrow">ContinentTransformer</label><div class="sk-toggleable__content"><pre>ContinentTransformer()</pre></div></div></div><div class="sk-item"><div class="sk-estimator sk-toggleable"><input class="sk-toggleable__control sk-hidden--visually" id="sk-estimator-id-24" type="checkbox" ><label for="sk-estimator-id-24" class="sk-toggleable__label sk-toggleable__label-arrow">OneHotEncoder</label><div class="sk-toggleable__content"><pre>OneHotEncoder(handle_unknown=&#x27;ignore&#x27;)</pre></div></div></div></div></div></div></div></div><div class="sk-parallel-item"><div class="sk-item"><div class="sk-label-container"><div class="sk-label sk-toggleable"><input class="sk-toggleable__control sk-hidden--visually" id="sk-estimator-id-25" type="checkbox" ><label for="sk-estimator-id-25" class="sk-toggleable__label sk-toggleable__label-arrow">categorical_encoder</label><div class="sk-toggleable__content"><pre>[&#x27;Status&#x27;]</pre></div></div></div><div class="sk-serial"><div class="sk-item"><div class="sk-serial"><div class="sk-item"><div class="sk-estimator sk-toggleable"><input class="sk-toggleable__control sk-hidden--visually" id="sk-estimator-id-26" type="checkbox" ><label for="sk-estimator-id-26" class="sk-toggleable__label sk-toggleable__label-arrow">OneHotEncoder</label><div class="sk-toggleable__content"><pre>OneHotEncoder(handle_unknown=&#x27;ignore&#x27;)</pre></div></div></div></div></div></div></div></div><div class="sk-parallel-item"><div class="sk-item"><div class="sk-label-container"><div class="sk-label sk-toggleable"><input class="sk-toggleable__control sk-hidden--visually" id="sk-estimator-id-27" type="checkbox" ><label for="sk-estimator-id-27" class="sk-toggleable__label sk-toggleable__label-arrow">numerical_encoder</label><div class="sk-toggleable__content"><pre>[&#x27;Year&#x27;, &#x27;Adult Mortality&#x27;, &#x27;infant deaths&#x27;, &#x27;Alcohol&#x27;, &#x27;percentage expenditure&#x27;, &#x27;Hepatitis B&#x27;, &#x27;Measles&#x27;, &#x27;BMI&#x27;, &#x27;under-five deaths&#x27;, &#x27;Polio&#x27;, &#x27;Total expenditure&#x27;, &#x27;Diphtheria&#x27;, &#x27;HIV/AIDS&#x27;, &#x27;GDP&#x27;, &#x27;Population&#x27;, &#x27;thinness  1-19 years&#x27;, &#x27;thinness 5-9 years&#x27;, &#x27;Income composition of resources&#x27;, &#x27;Schooling&#x27;]</pre></div></div></div><div class="sk-serial"><div class="sk-item"><div class="sk-serial"><div class="sk-item"><div class="sk-estimator sk-toggleable"><input class="sk-toggleable__control sk-hidden--visually" id="sk-estimator-id-28" type="checkbox" ><label for="sk-estimator-id-28" class="sk-toggleable__label sk-toggleable__label-arrow">SimpleImputer</label><div class="sk-toggleable__content"><pre>SimpleImputer()</pre></div></div></div></div></div></div></div></div></div></div><div class="sk-item"><div class="sk-estimator sk-toggleable"><input class="sk-toggleable__control sk-hidden--visually" id="sk-estimator-id-29" type="checkbox" ><label for="sk-estimator-id-29" class="sk-toggleable__label sk-toggleable__label-arrow">MaxAbsScaler</label><div class="sk-toggleable__content"><pre>MaxAbsScaler()</pre></div></div></div></div></div></div></div>

🌳 Random Forest

πŸ—οΈ Model Implementation

class CustomRandomForest(RegressorMixin):
    """
    Custom Random Forest Regressor.
    This model utilizes DecisionTreeRegressor from sklearn as its base estimators.
    """
    def __init__(self, n_estimators, max_samples_fraction, max_depth, **kwargs):
        """
        Model constructor.
        Key hyperparameters:
            n_estimators: Number of decision tree sub-models.
            max_samples_fraction: The fraction of samples to bootstrap for each sub-model (0 to 1).
            max_depth: Maximum depth of each decision tree sub-model.
            kwargs: (Optional) Additional hyperparameters for the DecisionTreeRegressor sub-models.
        """
        self.n_estimators = n_estimators
        self.max_samples_fraction = max_samples_fraction
        self.max_depth = max_depth
        self.decision_tree_kwargs = kwargs
        
    def fit(self, X, y):
        """
        Trains the model. Training data is provided in X and y.
        Sub-models are trained using bootstrapping, with sample size determined by max_samples_fraction.
        """
        self.estimators = []
        n_samples = int(X.shape[0] * self.max_samples_fraction)
        for _ in range(self.n_estimators):
            X_sample, y_sample = resample(X, y, replace=True, n_samples=n_samples)
            tree = DecisionTreeRegressor(splitter='random', max_depth=self.max_depth, **self.decision_tree_kwargs)
            tree.fit(X_sample, y_sample)
            self.estimators.append(tree)
            
    def predict(self, X):
        """
        Predicts y for the given data points in X.
        """
        estimations = np.zeros((self.n_estimators, X.shape[0]))
        for i, estimator in enumerate(self.estimators):
            estimations[i] = estimator.predict(X)
        y_predicted = estimations.mean(axis=0)
        return y_predicted

πŸ‘ Model Suitability

The Random Forest model is well-suited for this dataset due to its robustness.

  • It handles outliers and various feature types effectively.
  • The random feature selection in its implementation leads to diverse trees, promoting good generalization.
  • Unlike single decision trees, it’s less sensitive to data changes and generally yields strong results.
  • It requires minimal preprocessing and has a lower risk of overfitting during training.
  • While training time is longer (though parallelizable), its interpretability is lower compared to a single tree.

⏱️ Hyperparameters for Tuning

param_grid = ParameterGrid({
    'n_estimators': range(20, 300, 20),
})

✨ Best Model Selection

X_train_np = preprocessor_random_forest.transform(X_train)
X_val_np = preprocessor_random_forest.transform(X_val)

log_rf = pd.DataFrame(columns=['n_estimators', 'train_rmse', 'val_rmse'])
estimators_rf = []
for params in param_grid:
    reg = CustomRandomForest(max_depth=None, max_samples_fraction=1, **params)
    reg.fit(X_train_np, y_train)
    train_rmse = mse(y_train, reg.predict(X_train_np), squared=False)
    val_rmse = mse(y_val, reg.predict(X_val_np), squared=False)
    log_rf.loc[len(log_rf.index)] = [params['n_estimators'], train_rmse, val_rmse]
    estimators_rf.append(reg)
log_rf.sort_index(inplace=True)

fig, axes = plt.subplots(2, 1, figsize=(12, 5), layout='constrained', sharex=True)
fig.suptitle('Random forest learning curve', size=20)
fig.supylabel('RMSE', size=16)
fig.supxlabel('Number of estimators', size=16)

ax = axes[0]
ax.plot(log_rf['n_estimators'], log_rf['train_rmse'], label='train', color=main_color, linewidth=2.5)
ax.legend()

ax = axes[1]
ax.plot(log_rf['n_estimators'], log_rf['val_rmse'], label='validation', color='green', linewidth=2.5)
ax.legend()

plt.show()

png

log_rf = log_rf.sort_values('val_rmse')
log_rf.head(5)

n_estimatorstrain_rmseval_rmse
8180.00.7031422.009168
6140.00.6958452.013311
7160.00.7032512.023062
380.00.7222812.023428
12260.00.6897632.023804
best_rf = estimators_rf[log_rf.sort_values('val_rmse').index[0]]

πŸ“ˆ Best Model Evaluation

X_val_np = preprocessor_random_forest.transform(X_val)
rf_eval = pd.DataFrame([
   mse(y_val, best_rf.predict(X_val_np), squared=False),
   mae(y_val, best_rf.predict(X_val_np)),
], index=['RMSE', 'MAE'], columns=['random forest'])
rf_eval

alphatrain_rmseval_rmse
80.0008081.5921432.146361
90.0009091.5934792.146424
70.0007071.5908302.146496
100.0010101.5948502.146761
110.0011111.5961892.146996
best_ridge = estimators_ridge[log_ridge.sort_values('val_rmse').index[0]]

πŸ“ˆ Best Model Evaluation

X_val_np = preprocessor_ridge.transform(X_val)
ridge_eval = pd.DataFrame([
   mse(y_val, best_ridge.predict(X_val_np), squared=False),
   mae(y_val, best_ridge.predict(X_val_np)),
], index=['RMSE', 'MAE'], columns=['ridge'])
ridge_eval

ridge
RMSE2.146361
MAE1.222629

πŸ“ k-NN regression

πŸ‘ Model Suitability

k-Nearest Neighbors (k-NN) is a suitable method for this task:

  • k-NN is a non-parametric model that doesn’t require explicit training; it simply stores the training data.
  • While prediction can be computationally intensive for large datasets, it’s not an issue here due to the relatively small training data size.
  • Given that the data is normalized and one-hot encoded, and the data dimensionality is relatively low (avoiding the curse of dimensionality), k-NN can be an effective approach.

⏱️ Hyperparameters for Tuning

param_grid = ParameterGrid({
    'n_neighbors': range(1, 20),
    'weights': ['uniform', 'distance'],
})

✨ Best Model Selection

X_train_np = preprocessor_knn.transform(X_train)
X_val_np = preprocessor_knn.transform(X_val)

log_knn = pd.DataFrame(columns=['n_neighbors', 'weights', 'train_rmse', 'val_rmse'])
estimators_knn = []
for params in param_grid:
    reg = KNeighborsRegressor(n_neighbors=params['n_neighbors'], weights=params['weights'])
    reg.fit(X_train_np, y_train)
    train_rmse = mse(y_train, reg.predict(X_train_np), squared=False)
    val_rmse = mse(y_val, reg.predict(X_val_np), squared=False)
    log_knn.loc[len(log_knn.index)] = [params['n_neighbors'], params['weights'], train_rmse, val_rmse]
    estimators_knn.append(reg)
log_knn.sort_index(inplace=True)
df = log_knn[log_knn['weights'] == 'distance']

fig, axes = plt.subplots(2, 1, figsize=(12, 5), layout='constrained', sharex=True)
fig.suptitle('KNN learning curve (distance)', size=20)
fig.supylabel('RMSE', size=16)
fig.supxlabel('Number of neighbors', size=16)

ax = axes[0]
ax.plot(df['n_neighbors'], df['train_rmse'], label='train', color=main_color, linewidth=2.5)
ax.legend(loc='lower right')

ax = axes[1]
ax.plot(df['n_neighbors'], df['val_rmse'], label='validation', color='chocolate', linewidth=2.5)
ax.legend(loc='lower right')

ax.xaxis.set_major_locator(MaxNLocator(integer=True))
plt.show()

png

log_knn = log_knn.sort_values('val_rmse')
log_knn.head(5)

n_neighborsweightstrain_rmseval_rmse
74distance9.274417e-073.015674
53distance7.163777e-073.032370
95distance1.097045e-063.057202
137distance1.454370e-063.059293
116distance1.284174e-063.077836
best_knn = estimators_knn[log_knn.sort_values('val_rmse').index[0]]

πŸ“ˆ Best Model Evaluation

X_val_np = preprocessor_knn.transform(X_val)
knn_eval = pd.DataFrame([
   mse(y_val, best_knn.predict(X_val_np), squared=False),
   mae(y_val, best_knn.predict(X_val_np)),
], index=['RMSE', 'MAE'], columns=['knn'])
knn_eval

knn
RMSE3.015674
MAE1.878656

βš–οΈ Model Comparison

This section compares the best models from each family, selects the model with the lowest RMSE, retrains it on the full training data (training + validation), and estimates the RMSE on new data using the test set.

eval = pd.concat([rf_eval.T, ridge_eval.T, knn_eval.T])
eval

RMSEMAE
random forest2.0091681.151335
ridge2.1463611.222629
knn3.0156741.878656
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

ax = axes[0]

for metric, ax in zip(['RMSE', 'MAE'], fig.axes):
    ax.set_title(metric)
    bars = ax.bar(eval.index, eval.loc[:, metric], color=['green', 'darkorchid', 'chocolate'], width=0.5)
    ax.axhline(bars[0].get_height(), color = 'black', linestyle = '--', alpha=0.5, linewidth=1)
    ax.set_yticks([0, *[b.get_height() for b in bars]])

png

πŸ† Final Model

Based on the validation data, the Random Forest model demonstrates the best performance.

The chosen model will now be retrained on the combined training and validation data. Its objective RMSE on new, unseen data will then be measured using the test set.

X_train_val_np = preprocessor_random_forest.fit_transform(X_train_val)
best_rf.fit(X_train_val_np, y_train_val)

X_test_np = preprocessor_random_forest.transform(X_test)
print(mse(y_test, best_rf.predict(X_test_np), squared=False))
1.7480988972377174

We expect an approximate RMSE of 1.75 on new data.

🎯 Evaluating evaluation.csv

eval_data = pd.read_csv('evaluation.csv')
eval_data_np = preprocessor_random_forest.transform(eval_data)
eval_data['Life expectancy'] = best_rf.predict(eval_data_np)
eval_data.to_csv('results.csv', columns=['Country', 'Year', 'Life expectancy'], header=True, index=False)