malminhas / covid

Various tools for analysing the JHU covid dataset

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

covid

Visualise the John Hopkins Covid-19 dataset with a little help from nbdev.

Contents

  1. Introduction
  2. Installation
  3. Graphing current counts
  4. Graphing time series counts
  5. Graphing new vs. existing cases
  6. Graphing current and time series counts using Covid API

1. Introduction

The accompanying covid module built using nbdev provides convenience utilities for graphing the covid-19 dataset published by John Hopkins University (JHU) here. The JHU dataset is updated daily with the latest in separate time series csv files covering here:

  • time_series_covid19_confirmed_global.csv
  • time_series_covid19_deaths_global.csv
  • time_series_covid19_recovered_global.csv

Daily reports are kept in this directory and conform to the format dd-mm-2020.csv.

2. Installation

This code is not yet in PyPI. You can clone the repo and the corresponding functions described below will all be available in the accompanying covid module. The covid module has the following dependencies which will need to be pip installed: requests,pandas,matplotlib,seaborn

3. Graphing current counts

You can use getCountriesDailyReport to obtain a pandas dataframe df holding the latest values for each of ["Confirmed","Deaths","Recovered"] by both Province_State and Country_Region as follows:

which = getYesterday()
df = getCountriesDailyReport(which)

You can view the structure of df as follows:

n = 1
nrows,ncols = df.shape
print(f'df has {nrows} rows and {ncols} columns with column names {df.columns.to_list()}')
print(f'First {n} rows are:')
print(df.iloc[:n,:])
df has 3134 rows and 12 columns with column names ['FIPS', 'Admin2', 'Province_State', 'Country_Region', 'Last_Update', 'Lat', 'Long_', 'Confirmed', 'Deaths', 'Recovered', 'Active', 'Combined_Key']
First 1 rows are:
      FIPS     Admin2  Province_State Country_Region         Last_Update  \
0  45001.0  Abbeville  South Carolina             US 2020-04-26 02:30:51   

         Lat      Long_  Confirmed  Deaths  Recovered  Active  \
0  34.223334 -82.461707         24       0          0      24   

                    Combined_Key  
0  Abbeville, South Carolina, US  

You can plot this data aggregated by country and kind as follows. Note here that setDefaults configures the graphs to be drawn using the seaborn visualisation library when the visualisation parameter is set to matplotlib. We can also use or the altair visualisation library as an alternative viz:

setDefaults()
viz = 'altair'
plotCountriesDailyReport(getCountriesDailyReport(which), which, topN=15, 
                         color='red', kind='Deaths',visualisation=viz)

svg

We can also dig into the breakdown per country if available as follows:

plotCountryDailyReport(getCountriesDailyReport(which), 'US', which, topN=15, 
                       color='red', kind='Deaths', visualisation=viz)

svg

4. Graphing time series counts

We can look at how infection and death counts have varied for a county over time if we aggregate by doing a groupby on country. We should see an equal number of values per country following this aggregation:

df = procTimeSeriesConfirmed()
print(f'Found {df.shape} (rows, cols) of cols={df.columns.values}')
ddf = df.groupby('country')['Confirmed'].count().sort_values(ascending=True)
print(f'max={ddf.max()}, min={ddf.min()}, count={len(ddf)}')
Found (17575, 4) (rows, cols) of cols=['day' 'country' 'Confirmed' 'LogConfirmed']
max=95, min=95, count=185

Now we can plot a time series of confirmed cases of Covid-19 in China, Italy, US and UK as follows:

plotCountriesTimeSeries(df, ['China', 'Italy', 'Spain', 'US', 'United Kingdom'], 
                        which, x='day', y='Confirmed', visualisation=viz)

svg

And we can plot a time series of recorded deaths in these same countries as follows:

df = procTimeSeriesDeaths()
plotCountriesTimeSeries(df, ['China', 'Italy', 'Spain', 'US', 'United Kingdom'], 
                        which, x='day', y='Deaths', visualisation=viz)

svg

We can also view these as a log series over time:

plotCountriesTimeSeries(df, ['China', 'Italy', 'Spain', 'US', 'United Kingdom'], 
                        which, x='day', y='LogDeaths', visualisation=viz)

svg

5. Graphing new versus existing cases

This video provides an excellent demystifier on how to view the Covid data using the following ground rules:

  • Use a log scale
  • Focus on change not absolute numbers
  • Don't plot against time

From this analysis we see that we want to diff Confirmed cases between days to build up an New column and then plot the logs of both against each other as follows:

ndf = procNewCasesTimeSeries(procTimeSeriesConfirmed(), 'Confirmed')
plotCountriesTimeSeries(ndf, ['China', 'US'], which, x='LogConfirmed', y='LogNew', visualisation=viz)

svg

We can look at the same ddata across a wider range of countries as follows:

countries = ['China', 'Italy', 'Spain', 'US', 'United Kingdom']
plotCountriesTimeSeries(ndf, countries, which, x='LogConfirmed', y='LogNew', visualisation=viz)

svg

We can also view the same set of countries in a similar way in respect of deaths. Note here the grid is being removed for clarity:

ndf = procNewCasesTimeSeries(procTimeSeriesDeaths(), 'Deaths')
plotCountriesTimeSeries(ndf, countries, which, x='LogDeaths', y='LogNew', grid=False, visualisation=viz)

svg

It would be nice to view that data also want to fix up the display of the log axis markers so they show the actual numbers and to filter out some of the low data values to make the trends a bit clearer. We can do that by setting log true as follows and leaving the grid on (note this only works for altair right now):

plotCountriesTimeSeries(ndf, countries, which, x='Deaths', y='New', clampx=100, clampy=5, 
                        log=True, grid=True, visualisation=viz)

svg

Finally we can apply loess local regression to smooth these curves to produce the kind of graphic that you see in print and online media:

plotCountriesTimeSeries(ndf, countries, which, x='Deaths', y='New', clampx=100, clampy=5, 
                        log=True, useLoess=True, grid=True, visualisation=viz)

svg

6. Graphing current and time series counts using Covid API

This site details an API that nicely wraps up the same JHU dataset and presents it as json via a REST API which allows us to go from API call to formatted graph showing cases and deaths by country using altair as follows:

plotCountriesDailyReportFromAPI(visualisation=viz)

svg

There used to be an issue with normalisation of this data a while back with Iran and South Korea appearing twice but that seems to have been fixed.

It's also possible to do timeseries representation using this API by country using altair as follows for the US confirmed cases:

country = 'united-kingdom'
plotCategoryTimeSeriesByCountryFromAPI('Confirmed', country, color='orange', visualisation=viz)

svg

We can also look at the data in log format:

plotCategoryTimeSeriesByCountryFromAPI('Confirmed', country, color='orange', log=True, visualisation=viz)

svg

We can also retrieve multiple categories for a country as follows again for the UK:

plotCategoriesTimeSeriesByCountryFromAPI(country, which)

svg

Here's that same data on a log scale:

plotCategoriesTimeSeriesByCountryFromAPI(country, which, log=True)

svg

Finally let's look at the data for the US from the Covid API:

plotCategoriesTimeSeriesByCountryFromAPI('united-states', which, log=True)

svg

About

Various tools for analysing the JHU covid dataset

License:Apache License 2.0


Languages

Language:Jupyter Notebook 55.8%Language:HTML 41.3%Language:Python 2.6%Language:Smarty 0.2%Language:Makefile 0.0%