Modelling COVID-19 Government Responses on Effectiveness

In this notebook two datasets are used. One dataset contains information of the daily number of cases, the daily reproduction rate along with some national statistics per country. This dataset is sourced from ourworldindata.org. The other dataset contains information on non-pharmaceutical interventions (NPIs) implemented by governments around the world from acaps.org. For the sake of consistency only data from Europe is considered.

In [1]:
import numpy as np
import pandas as pd
import seaborn as sn
import matplotlib.pyplot as plt

from tqdm import tqdm
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression

Data Exploration & Processing

The datasets are loaded into pandas dataframes, and the columns of interest are extracted (others are dropped).

In [2]:
# Specify columns of interest
cas_col = ['iso_code', 'location', 'continent', 'date', 'total_cases', 'new_cases',
           'reproduction_rate', 'population','population_density', 'gdp_per_capita']
res_col = ['ISO', 'REGION', 'COUNTRY', 'MEASURE', 'DATE_IMPLEMENTED', 'ENTRY_DATE']

cas_df = pd.read_csv('data/cases.csv')[cas_col]
res_df = pd.read_csv('data/responses.csv')[res_col]

# Limit data to Europe
cas_df = cas_df[cas_df.continent == 'Europe']
res_df = res_df[res_df.REGION == 'Europe']

The first dataset daily COVID-19 reports per country, along with some data describing the country.

In [3]:
cas_df.head()
Out[3]:
iso_code location continent date total_cases new_cases reproduction_rate population population_density gdp_per_capita
345 ALB Albania Europe 2020-03-09 2.0 2.0 NaN 2877800.0 104.871 11803.431
346 ALB Albania Europe 2020-03-10 10.0 8.0 NaN 2877800.0 104.871 11803.431
347 ALB Albania Europe 2020-03-11 12.0 2.0 NaN 2877800.0 104.871 11803.431
348 ALB Albania Europe 2020-03-12 23.0 11.0 NaN 2877800.0 104.871 11803.431
349 ALB Albania Europe 2020-03-13 33.0 10.0 NaN 2877800.0 104.871 11803.431

The second dataset contains the covid-measures taken by a country, with a start and an end date.

In [4]:
res_df.head()
Out[4]:
ISO REGION COUNTRY MEASURE DATE_IMPLEMENTED ENTRY_DATE
58 ALB Europe Albania Closure of businesses and public services NaN 4/15/2020
59 ALB Europe Albania Strengthening the public health system 1/24/2020 4/15/2020
60 ALB Europe Albania Other public health measures enforced 1/24/2020 4/15/2020
61 ALB Europe Albania Strengthening the public health system 1/31/2020 4/15/2020
62 ALB Europe Albania Strengthening the public health system 2/26/2020 4/15/2020

From this we learn the number of unique measures in the dataset, and the number of unique countries. To prevent potential mismatches between the datasets, the ISO codes for countries are used, and these numbers are taken from the responses dataset.

In [5]:
unique_countries = list(res_df.ISO.unique())
print(f'Number of unique countries: {len(unique_countries)}')

unique_measures = list(res_df.MEASURE.unique())
print(f'Number of unique measures: {len(unique_measures)}')
Number of unique countries: 43
Number of unique measures: 33

We can also extract the total number of times each unique measure is implemented.

In [6]:
stat = res_df.MEASURE.value_counts().sort_values(ascending=False)
print('The total number of times that a measure has been implemented (by any country):')
print(stat.to_string(header=False, max_rows=8))
The total number of times that a measure has been implemented (by any country):
Economic measures                                    1119
Limit public gatherings                              1017
Closure of businesses and public services            1001
Strengthening the public health system                511
                                                     ... 
Limit product imports/exports                          18
Lockdown of refugee/idp camps or other minorities      15
Checkpoints within the country                          9
Humanitarian exemptions                                 1

And the number of countries that have implemented a measure at least once.

In [7]:
stat = res_df.groupby('MEASURE')['COUNTRY'].nunique().sort_values(ascending=False)
print('The number of countries that have implemented a measure at least once:')
print(stat.to_string(header=False, max_rows=8))
The number of countries that have implemented a measure at least once:
Strengthening the public health system               43
Schools closure                                      43
Other public health measures enforced                43
Economic measures                                    43
                                                     ..
Full lockdown                                        10
Lockdown of refugee/idp camps or other minorities     5
Checkpoints within the country                        5
Humanitarian exemptions                               1

With this information it makes sense to remove some measures from the dataset, either beceause only a small number of countries have implemented them (i.e. Humanitarian exemptions), or because the measure has only been implemented a small number of times (i.e. Checkpoints within the country and Lockdown of refugee/idp camps or other minorities). With these measures removed, the modelling approach is less prone to data skewness.

In [8]:
unique_measures.remove('Humanitarian exemptions')
unique_measures.remove('Checkpoints within the country')
unique_measures.remove('Lockdown of refugee/idp camps or other minorities')

The data can now be pre-processed, such that it is easier to work with.

In [9]:
# Convert date strings to datetime objects
cas_df['date'] = pd.to_datetime(cas_df.date)
res_df['DATE_IMPLEMENTED'] = pd.to_datetime(res_df.DATE_IMPLEMENTED)
res_df['ENTRY_DATE'] = pd.to_datetime(res_df.ENTRY_DATE)

# Fill missing case numbers rates with 0
cas_df['total_cases'] = cas_df.total_cases.fillna(0)
cas_df['new_cases'] = cas_df.new_cases.fillna(0)
cas_df['reproduction_rate'] = cas_df.reproduction_rate.fillna(0)

Which helps in finding the span of the datasets.

In [10]:
print(cas_df.date.min(), cas_df.date.max())
2020-01-24 00:00:00 2021-02-02 00:00:00

Splitting the dataframes per country will also make the processing easier.

In [11]:
cas_dfs = { c:cas_df[cas_df.iso_code == c].reset_index(drop=True) for c in unique_countries }
res_dfs = { c:res_df[res_df.ISO == c].reset_index(drop=True) for c in unique_countries }

Many countries results in many forms of data. The reproduction rate of covid over time in the United Kingdom, looks way different from the reproduction rate of covid over time in Sweden.

In [12]:
plt.figure(figsize=(15, 5))
plt.plot(cas_dfs['GBR']['date'], cas_dfs['GBR']['reproduction_rate'], '#00bfff', label='GBR')
plt.plot(cas_dfs['SWE']['date'], cas_dfs['SWE']['reproduction_rate'], '#006080', label='SWE')
plt.tick_params(axis='both', which='major', labelsize=15)
plt.ylabel('$r$', fontsize=20)
plt.legend(fontsize=20)
plt.show()

A seven-day average of the reproduction rate is implemented to get more similar data (in terms of sudden changes) acrross all countries. A rather crude, but seemingly effective way to account for the different source of the data.

In [13]:
_, axs = plt.subplots(11, 4, figsize=(15, 25), sharex=True, sharey=True)
for i, c in enumerate(unique_countries):
    i, j = divmod(i, 4)
    axs[i, j].plot(cas_dfs[c]['reproduction_rate'], '#00bfff')
    cas_dfs[c]['reproduction_rate'] = cas_dfs[c].reproduction_rate.rolling(7, min_periods=1).mean()
    axs[i, j].plot(cas_dfs[c]['reproduction_rate'], '#006080')
    axs[i, j].set_title(c)
plt.show()

Comparing GBR with SWE again

In [14]:
plt.figure(figsize=(15, 5))
plt.plot(cas_dfs['GBR']['date'], cas_dfs['GBR']['reproduction_rate'], '#00bfff', label='GBR')
plt.plot(cas_dfs['SWE']['date'], cas_dfs['SWE']['reproduction_rate'], '#006080', label='SWE')
plt.tick_params(axis='both', which='major', labelsize=15)
plt.ylabel('$r$', fontsize=20)
plt.legend(fontsize=20)
plt.show()

For each dataframe the future $\Delta r$ with a 1 week period for each country is calculated, as target for the modelling approach.

In [15]:
for country in cas_dfs:
    cas_dfs[country]['delta_r'] = cas_dfs[country].reproduction_rate.diff(-7)
    cas_dfs[country]['delta_r'] = cas_dfs[country].delta_r.fillna(0)

With all the data ready, the datasets can be combined. This means that for each day, the reproduction rate $r$ and the 4 week (future) change $\Delta r$ are combined, along with which measures are active on that day. For each measure, the amount of time (in days) that a measure has been active is stored.

In [16]:
def encode_features(row):
    country_res_df = res_dfs[row.iso_code]
    
    # Combine data we already have access to
    result = [row.iso_code, row.location, row.gdp_per_capita, row.population,
              row.population_density, int(row.total_cases), int(row.new_cases),
              row.reproduction_rate, row.delta_r]
    
    active_measures = []
    for measure in unique_measures:
        measure_value = 0
        
        # Get the start and end date for the current measure
        m_data = country_res_df[country_res_df.MEASURE == measure]
        s_dates, e_dates = list(m_data.DATE_IMPLEMENTED), list(m_data.ENTRY_DATE)
        
        # Loop over all dates for current measure and encode if measure is active
        for s_date, e_date in zip(s_dates, e_dates):
            correction = pd.to_timedelta((e_date - s_date).days // 4, unit='d')
            if row.date > s_date and row.date < e_date - correction:
                measure_value = 1
                break
        
        active_measures.append(measure_value)
    result += [len(active_measures)] + active_measures
    return pd.Series(result, index=cols)

This feature encoding must be done for all countries, after which the data for each country can be combined.

In [17]:
cols = ['ISO', 'Country', 'PPP', 'Population', 'Density', 'Total Cases',
        'New Cases', 'R', 'Delta R', 'N Measures'] + unique_measures
features_df = pd.DataFrame(columns=cols)

for country in tqdm(unique_countries):
    # Get dataframe of daily cases for this country
    country_cas_df = cas_dfs[country]

    # Encode data and append to result
    if country_cas_df.isnull().sum().sum() == 0:
        country_data = country_cas_df.apply(encode_features, axis=1)
        features_df = pd.concat((features_df, country_data), sort=False)

features_df.to_csv('data/combined.csv')
100%|██████████| 43/43 [07:42<00:00, 10.76s/it]

Modelling the Measures

With this combined dataframe, a multiple regression model can be trained to predict the change in reproduction rate for the next couple of days. By interpreting the coefficients of this model, the impact of each measure on the change in reproduction rate can be extracted.

In [18]:
features = ['PPP', 'Population', 'Density', 'N Measures'] + unique_measures
target = 'Delta R'

X = features_df[features]
y = features_df[target]

model = LinearRegression().fit(X, y)
y_hat = model.predict(X)
print(f'Model MSE: {mean_squared_error(y_hat, y)}')
Model MSE: 0.08193557283384217

This model results in the following coefficients:

In [19]:
pd.set_option('display.float_format', lambda x: '%.7f' % x)
coeff = pd.Series(model.coef_, index=features)
coeff_str = coeff.sort_values(ascending=True).to_string(header=False)
print(coeff_str)
print(f'Intercept\t\t\t\t\t\t\t{model.intercept_:.7f}')
Health screenings in airports and border crossings             -0.1391302
Amendments to funeral and burial regulations                   -0.1087074
International flights suspension                               -0.0895136
Changes in prison-related policies                             -0.0810768
Limit product imports/exports                                  -0.0803689
State of emergency declared                                    -0.0791075
Awareness campaigns                                            -0.0770310
Partial lockdown                                               -0.0598347
Closure of businesses and public services                      -0.0563162
Border closure                                                 -0.0539507
Mass population testing                                        -0.0499771
Limit public gatherings                                        -0.0475034
Curfews                                                        -0.0464392
Surveillance and monitoring                                    -0.0433568
Border checks                                                  -0.0399447
Isolation and quarantine policies                              -0.0377187
General recommendations                                        -0.0371980
Schools closure                                                -0.0369984
Strengthening the public health system                         -0.0321312
Military deployment                                            -0.0318143
Additional health/documents requirements upon arrival          -0.0309593
Visa restrictions                                              -0.0277604
Domestic travel restrictions                                   -0.0259970
Emergency administrative structures activated or established   -0.0224603
Requirement to wear protective gear in public                  -0.0221275
Testing policy                                                 -0.0186716
Psychological assistance and medical social work               -0.0151096
Economic measures                                              -0.0148937
Full lockdown                                                  -0.0092277
Other public health measures enforced                          -0.0016979
Density                                                        -0.0000089
PPP                                                            -0.0000001
Population                                                     -0.0000000
N Measures                                                     -0.0000000
Intercept							0.0152948