In this notebook two datasets are used. One dataset contains information of the daily number of cases, the daily reproduction rate along with some national statistics per country. This dataset is sourced from ourworldindata.org. The other dataset contains information on non-pharmaceutical interventions (NPIs) implemented by governments around the world from acaps.org. For the sake of consistency only data from Europe is considered.
import numpy as np
import pandas as pd
import seaborn as sn
import matplotlib.pyplot as plt
from tqdm import tqdm
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression
The datasets are loaded into pandas dataframes, and the columns of interest are extracted (others are dropped).
# Specify columns of interest
cas_col = ['iso_code', 'location', 'continent', 'date', 'total_cases', 'new_cases',
'reproduction_rate', 'population','population_density', 'gdp_per_capita']
res_col = ['ISO', 'REGION', 'COUNTRY', 'MEASURE', 'DATE_IMPLEMENTED', 'ENTRY_DATE']
cas_df = pd.read_csv('data/cases.csv')[cas_col]
res_df = pd.read_csv('data/responses.csv')[res_col]
# Limit data to Europe
cas_df = cas_df[cas_df.continent == 'Europe']
res_df = res_df[res_df.REGION == 'Europe']
The first dataset daily COVID-19 reports per country, along with some data describing the country.
cas_df.head()
The second dataset contains the covid-measures taken by a country, with a start and an end date.
res_df.head()
From this we learn the number of unique measures in the dataset, and the number of unique countries. To prevent potential mismatches between the datasets, the ISO codes for countries are used, and these numbers are taken from the responses dataset.
unique_countries = list(res_df.ISO.unique())
print(f'Number of unique countries: {len(unique_countries)}')
unique_measures = list(res_df.MEASURE.unique())
print(f'Number of unique measures: {len(unique_measures)}')
We can also extract the total number of times each unique measure is implemented.
stat = res_df.MEASURE.value_counts().sort_values(ascending=False)
print('The total number of times that a measure has been implemented (by any country):')
print(stat.to_string(header=False, max_rows=8))
And the number of countries that have implemented a measure at least once.
stat = res_df.groupby('MEASURE')['COUNTRY'].nunique().sort_values(ascending=False)
print('The number of countries that have implemented a measure at least once:')
print(stat.to_string(header=False, max_rows=8))
With this information it makes sense to remove some measures from the dataset, either beceause only a small number of countries have implemented them (i.e. Humanitarian exemptions), or because the measure has only been implemented a small number of times (i.e. Checkpoints within the country and Lockdown of refugee/idp camps or other minorities). With these measures removed, the modelling approach is less prone to data skewness.
unique_measures.remove('Humanitarian exemptions')
unique_measures.remove('Checkpoints within the country')
unique_measures.remove('Lockdown of refugee/idp camps or other minorities')
The data can now be pre-processed, such that it is easier to work with.
# Convert date strings to datetime objects
cas_df['date'] = pd.to_datetime(cas_df.date)
res_df['DATE_IMPLEMENTED'] = pd.to_datetime(res_df.DATE_IMPLEMENTED)
res_df['ENTRY_DATE'] = pd.to_datetime(res_df.ENTRY_DATE)
# Fill missing case numbers rates with 0
cas_df['total_cases'] = cas_df.total_cases.fillna(0)
cas_df['new_cases'] = cas_df.new_cases.fillna(0)
cas_df['reproduction_rate'] = cas_df.reproduction_rate.fillna(0)
Which helps in finding the span of the datasets.
print(cas_df.date.min(), cas_df.date.max())
Splitting the dataframes per country will also make the processing easier.
cas_dfs = { c:cas_df[cas_df.iso_code == c].reset_index(drop=True) for c in unique_countries }
res_dfs = { c:res_df[res_df.ISO == c].reset_index(drop=True) for c in unique_countries }
Many countries results in many forms of data. The reproduction rate of covid over time in the United Kingdom, looks way different from the reproduction rate of covid over time in Sweden.
plt.figure(figsize=(15, 5))
plt.plot(cas_dfs['GBR']['date'], cas_dfs['GBR']['reproduction_rate'], '#00bfff', label='GBR')
plt.plot(cas_dfs['SWE']['date'], cas_dfs['SWE']['reproduction_rate'], '#006080', label='SWE')
plt.tick_params(axis='both', which='major', labelsize=15)
plt.ylabel('$r$', fontsize=20)
plt.legend(fontsize=20)
plt.show()
A seven-day average of the reproduction rate is implemented to get more similar data (in terms of sudden changes) acrross all countries. A rather crude, but seemingly effective way to account for the different source of the data.
_, axs = plt.subplots(11, 4, figsize=(15, 25), sharex=True, sharey=True)
for i, c in enumerate(unique_countries):
i, j = divmod(i, 4)
axs[i, j].plot(cas_dfs[c]['reproduction_rate'], '#00bfff')
cas_dfs[c]['reproduction_rate'] = cas_dfs[c].reproduction_rate.rolling(7, min_periods=1).mean()
axs[i, j].plot(cas_dfs[c]['reproduction_rate'], '#006080')
axs[i, j].set_title(c)
plt.show()
Comparing GBR with SWE again
plt.figure(figsize=(15, 5))
plt.plot(cas_dfs['GBR']['date'], cas_dfs['GBR']['reproduction_rate'], '#00bfff', label='GBR')
plt.plot(cas_dfs['SWE']['date'], cas_dfs['SWE']['reproduction_rate'], '#006080', label='SWE')
plt.tick_params(axis='both', which='major', labelsize=15)
plt.ylabel('$r$', fontsize=20)
plt.legend(fontsize=20)
plt.show()
For each dataframe the future $\Delta r$ with a 1 week period for each country is calculated, as target for the modelling approach.
for country in cas_dfs:
cas_dfs[country]['delta_r'] = cas_dfs[country].reproduction_rate.diff(-7)
cas_dfs[country]['delta_r'] = cas_dfs[country].delta_r.fillna(0)
With all the data ready, the datasets can be combined. This means that for each day, the reproduction rate $r$ and the 4 week (future) change $\Delta r$ are combined, along with which measures are active on that day. For each measure, the amount of time (in days) that a measure has been active is stored.
def encode_features(row):
country_res_df = res_dfs[row.iso_code]
# Combine data we already have access to
result = [row.iso_code, row.location, row.gdp_per_capita, row.population,
row.population_density, int(row.total_cases), int(row.new_cases),
row.reproduction_rate, row.delta_r]
active_measures = []
for measure in unique_measures:
measure_value = 0
# Get the start and end date for the current measure
m_data = country_res_df[country_res_df.MEASURE == measure]
s_dates, e_dates = list(m_data.DATE_IMPLEMENTED), list(m_data.ENTRY_DATE)
# Loop over all dates for current measure and encode if measure is active
for s_date, e_date in zip(s_dates, e_dates):
correction = pd.to_timedelta((e_date - s_date).days // 4, unit='d')
if row.date > s_date and row.date < e_date - correction:
measure_value = 1
break
active_measures.append(measure_value)
result += [len(active_measures)] + active_measures
return pd.Series(result, index=cols)
This feature encoding must be done for all countries, after which the data for each country can be combined.
cols = ['ISO', 'Country', 'PPP', 'Population', 'Density', 'Total Cases',
'New Cases', 'R', 'Delta R', 'N Measures'] + unique_measures
features_df = pd.DataFrame(columns=cols)
for country in tqdm(unique_countries):
# Get dataframe of daily cases for this country
country_cas_df = cas_dfs[country]
# Encode data and append to result
if country_cas_df.isnull().sum().sum() == 0:
country_data = country_cas_df.apply(encode_features, axis=1)
features_df = pd.concat((features_df, country_data), sort=False)
features_df.to_csv('data/combined.csv')
With this combined dataframe, a multiple regression model can be trained to predict the change in reproduction rate for the next couple of days. By interpreting the coefficients of this model, the impact of each measure on the change in reproduction rate can be extracted.
features = ['PPP', 'Population', 'Density', 'N Measures'] + unique_measures
target = 'Delta R'
X = features_df[features]
y = features_df[target]
model = LinearRegression().fit(X, y)
y_hat = model.predict(X)
print(f'Model MSE: {mean_squared_error(y_hat, y)}')
This model results in the following coefficients:
pd.set_option('display.float_format', lambda x: '%.7f' % x)
coeff = pd.Series(model.coef_, index=features)
coeff_str = coeff.sort_values(ascending=True).to_string(header=False)
print(coeff_str)
print(f'Intercept\t\t\t\t\t\t\t{model.intercept_:.7f}')