Visualise the John Hopkins Covid-19 dataset with a little help from nbdev.
- Introduction
- Installation
- Graphing current counts
- Graphing time series counts
- Graphing new vs. existing cases
- Graphing current and time series counts using Covid API
The accompanying covid
module built using nbdev provides convenience utilities for graphing the covid-19 dataset published by John Hopkins University (JHU) here. The JHU dataset is updated daily with the latest in separate time series csv files covering here:
time_series_covid19_confirmed_global.csv
time_series_covid19_deaths_global.csv
time_series_covid19_recovered_global.csv
Daily reports are kept in this directory and conform to the format dd-mm-2020.csv
.
This code is not yet in PyPI. You can clone the repo and the corresponding functions described below will all be available in the accompanying covid
module. The covid
module has the following dependencies which will need to be pip installed: requests
,pandas
,matplotlib
,seaborn
You can use getCountriesDailyReport
to obtain a pandas
dataframe df
holding the latest values for each of ["Confirmed","Deaths","Recovered"]
by both Province_State
and Country_Region
as follows:
which = getYesterday()
df = getCountriesDailyReport(which)
You can view the structure of df
as follows:
n = 1
nrows,ncols = df.shape
print(f'df has {nrows} rows and {ncols} columns with column names {df.columns.to_list()}')
print(f'First {n} rows are:')
print(df.iloc[:n,:])
df has 3134 rows and 12 columns with column names ['FIPS', 'Admin2', 'Province_State', 'Country_Region', 'Last_Update', 'Lat', 'Long_', 'Confirmed', 'Deaths', 'Recovered', 'Active', 'Combined_Key']
First 1 rows are:
FIPS Admin2 Province_State Country_Region Last_Update \
0 45001.0 Abbeville South Carolina US 2020-04-26 02:30:51
Lat Long_ Confirmed Deaths Recovered Active \
0 34.223334 -82.461707 24 0 0 24
Combined_Key
0 Abbeville, South Carolina, US
You can plot this data aggregated by country and kind
as follows. Note here that setDefaults
configures the graphs to be drawn using the seaborn visualisation library when the visualisation parameter is set to matplotlib
. We can also use or the altair
visualisation library as an alternative viz
:
setDefaults()
viz = 'altair'
plotCountriesDailyReport(getCountriesDailyReport(which), which, topN=15,
color='red', kind='Deaths',visualisation=viz)
We can also dig into the breakdown per country if available as follows:
plotCountryDailyReport(getCountriesDailyReport(which), 'US', which, topN=15,
color='red', kind='Deaths', visualisation=viz)
We can look at how infection and death counts have varied for a county over time if we aggregate by doing a groupby
on country
. We should see an equal number of values per country following this aggregation:
df = procTimeSeriesConfirmed()
print(f'Found {df.shape} (rows, cols) of cols={df.columns.values}')
ddf = df.groupby('country')['Confirmed'].count().sort_values(ascending=True)
print(f'max={ddf.max()}, min={ddf.min()}, count={len(ddf)}')
Found (17575, 4) (rows, cols) of cols=['day' 'country' 'Confirmed' 'LogConfirmed']
max=95, min=95, count=185
Now we can plot a time series of confirmed cases of Covid-19 in China, Italy, US and UK as follows:
plotCountriesTimeSeries(df, ['China', 'Italy', 'Spain', 'US', 'United Kingdom'],
which, x='day', y='Confirmed', visualisation=viz)
And we can plot a time series of recorded deaths in these same countries as follows:
df = procTimeSeriesDeaths()
plotCountriesTimeSeries(df, ['China', 'Italy', 'Spain', 'US', 'United Kingdom'],
which, x='day', y='Deaths', visualisation=viz)
We can also view these as a log series over time:
plotCountriesTimeSeries(df, ['China', 'Italy', 'Spain', 'US', 'United Kingdom'],
which, x='day', y='LogDeaths', visualisation=viz)
This video provides an excellent demystifier on how to view the Covid data using the following ground rules:
- Use a log scale
- Focus on change not absolute numbers
- Don't plot against time
From this analysis we see that we want to diff Confirmed
cases between days to build up an New
column and then plot the logs of both against each other as follows:
ndf = procNewCasesTimeSeries(procTimeSeriesConfirmed(), 'Confirmed')
plotCountriesTimeSeries(ndf, ['China', 'US'], which, x='LogConfirmed', y='LogNew', visualisation=viz)
We can look at the same ddata across a wider range of countries as follows:
countries = ['China', 'Italy', 'Spain', 'US', 'United Kingdom']
plotCountriesTimeSeries(ndf, countries, which, x='LogConfirmed', y='LogNew', visualisation=viz)
We can also view the same set of countries in a similar way in respect of deaths. Note here the grid is being removed for clarity:
ndf = procNewCasesTimeSeries(procTimeSeriesDeaths(), 'Deaths')
plotCountriesTimeSeries(ndf, countries, which, x='LogDeaths', y='LogNew', grid=False, visualisation=viz)
It would be nice to view that data also want to fix up the display of the log axis markers so they show the actual numbers and to filter out some of the low data values to make the trends a bit clearer. We can do that by setting log
true as follows and leaving the grid on (note this only works for altair
right now):
plotCountriesTimeSeries(ndf, countries, which, x='Deaths', y='New', clampx=100, clampy=5,
log=True, grid=True, visualisation=viz)
Finally we can apply loess
local regression to smooth these curves to produce the kind of graphic that you see in print and online media:
plotCountriesTimeSeries(ndf, countries, which, x='Deaths', y='New', clampx=100, clampy=5,
log=True, useLoess=True, grid=True, visualisation=viz)
This site details an API that nicely wraps up the same JHU dataset and presents it as json
via a REST API which allows us to go from API call to formatted graph showing cases and deaths by country using altair
as follows:
plotCountriesDailyReportFromAPI(visualisation=viz)
There used to be an issue with normalisation of this data a while back with Iran and South Korea appearing twice but that seems to have been fixed.
It's also possible to do timeseries representation using this API by country using altair
as follows for the US confirmed cases:
country = 'united-kingdom'
plotCategoryTimeSeriesByCountryFromAPI('Confirmed', country, color='orange', visualisation=viz)
We can also look at the data in log format:
plotCategoryTimeSeriesByCountryFromAPI('Confirmed', country, color='orange', log=True, visualisation=viz)
We can also retrieve multiple categories for a country as follows again for the UK:
plotCategoriesTimeSeriesByCountryFromAPI(country, which)
Here's that same data on a log scale:
plotCategoriesTimeSeriesByCountryFromAPI(country, which, log=True)
Finally let's look at the data for the US from the Covid API:
plotCategoriesTimeSeriesByCountryFromAPI('united-states', which, log=True)