This repository contains datasets of daily time-series data related to COVID-19 for 30+ countries around the world. For most countries, the data is at the spatial resolution of states/provinces, although for US, UK, NL and CO, it is at the finer resolution of county/municipality. All regions are assigned a unique key, which resolves discrepancies between ISO/ NUTS/ FIPS codes, etc.
There are multiple types of data:
- Outcome data
Y(i,t)
, such as cases, deaths, tests, for regions i and time t - Static covariate data
X(i)
, such as population size, GDP, latitude/ longitude - Dynamic covariate data
X(i,t)
, such as mobility, weather - Dynamic interventional data
A(i,t)
, such as government lockdowns
The data is drawn from multiple sources, as listed below.
The data is stored in separate csv/ json files, which can be easily merged due to the use of consistent geographic (and temporal) keys.
Table | Keys1 | Content | URL | Source2 |
---|---|---|---|---|
Master | [key][date] |
Flat table with records from all other tables joined by key and date |
master.csv | All tables below |
Index | [key] |
Various names and codes, useful for joining with other datasets | index.csv, index.json | Wikidata, DataCommons |
Demographics | [key] |
Various (current3) population statistics | demographics.csv, demographics.json | Wikidata, DataCommons |
Economy | [key] |
Various (current3) economic indicators | economy.csv, economy.json | Wikidata, DataCommons |
Epidemiology | [key][date] |
COVID-19 cases, deaths, recoveries and tests | epidemiology.csv, epidemiology.json | Various2 |
Geography | [key] |
Geographical information about the region | geography.csv, geography.json | Wikidata |
Health | [key] |
Health indicators for the region | health.csv, health.json | Wikidata, WorldBank |
Hospitalizations | [key][date] |
Information related to patients of COVID-19 and hospitals | hospitalizations.csv, hospitalizations.json | Various2 |
Mobility | [key][date] |
Various metrics related to the movement of people | mobility.csv, google-mobility.json | Google, Apple |
Oxford Government Response | [key][date] |
Government interventions and their relative stringency | oxford-government-response.csv, oxford-government-response.json | University of Oxford |
Weather | [key][date] |
Dated meteorological information for each region | weather.csv, weather.json | NOAA |
WorldBank | [key] |
Latest record for each indicator from WorldBank for all reporting countries | worldbank.csv, worldbank.json | WorldBank |
By Age | [key][date] |
Epidemiology and hospitalizations data stratified by age | by-age.csv, by-age.json | Various2 |
By Sex | [key][date] |
Epidemiology and hospitalizations data stratified by sex | by-sex.csv, by-sex.json | Various2 |
1 key
is a unique string for the specific geographical region built from a combination
of codes such as ISO 3166
, NUTS
, FIPS
and other local equivalents.
2 Refer to the data sources for specifics about each data source and
the associated terms of use.
3 Datasets without a date
column contain the most recently reported information for
each datapoint to date.
For more information about how to use these files see the section about using the data, and for more details about each dataset see the section about understanding the data.
There are many other public COVID-19 datasets. However, we believe this dataset is unique in the way that it merges multiple global sources, at a fine spatial resolution, using a consistent set of region keys. We hope this will make it easier for researchers to use. We are also very transparent about the data sources, and the code for ingesting and merging the data is easy to understand and modify.
A simple visualization tool was built to explore the Open COVID-19 datasets, the Open COVID-19 Explorer: | If you want to see interactive charts with a unique UX, don't miss what @Mahks built using the Open COVID-19 dataset: |
You can also check out the great work of @quixote79, a MapBox-powered interactive map site: | Experience clean, clear graphs with smooth animations thanks to the work of @jmullo: |
Become an armchair epidemiologist with the COVID-19 timeline simulation tool built by @LeviticusMB: | Whether you want an interactive map, compare stats or look at charts, @saadmas has you covered with a COVID-19 Daily Tracking site: |
Compare per-million data at Omnimodel thanks to @OmarJay1: |
The data is available as CSV and JSON files, which are published in Github Pages so they can be served directly to Javascript applications without the need of a proxy to set the correct headers for CORS and content type. Even if you only want the CSV files, using the URL served by Github Pages is preferred in order to avoid caching issues and potential, future breaking changes.
For the purpose of making the data as easy to use as possible, there is a master table
which contains the columns of all other tables joined by key
and date
. However,
performance-wise, it may be better to download the data separately and join the tables locally.
Each table has a full version as well as subsets with only the last 30, 14, 7 and 1 days of data. The full version is accessible at the URL described in the table above. The subsets can be found by appending the number of days to the path. For example, the subsets of the master table are available at the following locations:
- Full version: https://open-covid-19.github.io/data/v2/master.csv
- Last 30 days: https://open-covid-19.github.io/data/v2/30/master.csv
- Last 14 days: https://open-covid-19.github.io/data/v2/14/master.csv
- Last 7 days: https://open-covid-19.github.io/data/v2/7/master.csv
- Latest: https://open-covid-19.github.io/data/v2/latest/master.csv
Note that the latest
version contains the last non-null record for each key, whereas all others
contain the last N
days of data (all of which could be null for some keys).
If you are trying to use this data alongside your own datasets, then you can use the Index table to get access to the ISO 3166 / NUTS / FIPS code, although administrative subdivisions are not consistent among all reporting regions. For example, for the intra-country reporting, some EU countries use NUTS2, others NUTS3 and many ISO 3166-2 codes.
You can find several examples in the examples subfolder with code showcasing how to load and analyze the data for several programming environments. If you want the short version, here are a few snippets to get started.
You can use Google Colab if you want to run your analysis without having to install anything in your computer, simply go to this URL: https://colab.research.google.com/github/open-covid-19/data.
If you prefer R, then this is all you need to do to load the epidemiology data:
data <- read.csv("https://open-covid-19.github.io/data/v2/master.csv")
In Python, you need to have the package pandas
installed to get started:
import pandas
data = pandas.read_csv("https://open-covid-19.github.io/data/v2/master.csv")
Loading the JSON file using jQuery can be done directly from the output folder,
this code snippet loads the master table into the data
variable:
$.getJSON("https://open-covid-19.github.io/data/v2/master.json", data => { ... }
You can also use Powershell to get the latest data for a country directly from the command line, for example to query the latest data for Australia:
Invoke-WebRequest 'https://open-covid-19.github.io/data/v2/latest/master.csv' | ConvertFrom-Csv | `
where Key -eq 'AU' | select country_name,date,total_confirmed,total_deceased,total_recovered
Make sure that you are using the URL linked at the table above and not the raw GitHub file, the latter is subject to change at any moment in non-compatible ways, and due to the configuration of GitHub's raw file server you may run into potential caching issues.
Missing values will be represented as nulls, whereas zeroes are used when a true value of zero is reported.
Flat table with records from all other tables joined by key
and date
. See below for information
about all the different tables and columns.
Non-temporal data related to countries and regions. It includes keys, codes and names for each region, which is helpful for displaying purposes or when merging with other data:
Name | Type | Description | Example |
---|---|---|---|
key | string |
Unique string identifying the region | US_CA_06001 |
wikidata | string |
Wikidata ID corresponding to this key | Q107146 |
datacommons | string |
DataCommons ID corresponding to this key | geoId/06001 |
country_code | string |
ISO 3166-1 alphanumeric 2-letter code of the country | US |
country_name | string |
American English name of the country, subject to change | United States of America |
subregion1_code | string |
(Optional) ISO 3166-2 or NUTS 2/3 code of the subregion | CA |
subregion1_name | string |
(Optional) American English name of the subregion, subject to change | California |
subregion2_code | string |
(Optional) FIPS code of the county (or local equivalent) | 06001 |
subregion2_name | string |
(Optional) American English name of the county (or local equivalent), subject to change | Alameda County |
3166-1-alpha-2 | string |
ISO 3166-1 alphanumeric 2-letter code of the country | US |
3166-1-alpha-3 | string |
ISO 3166-1 alphanumeric 3-letter code of the country | USA |
aggregation_level | integer [0-2] |
Level at which data is aggregated, i.e. country, state/province or county level | 2 |
Information related to the population demographics for each region:
Name | Type | Description | Example |
---|---|---|---|
key | string |
Unique string identifying the region | KR |
population | integer |
Total count of humans | 51606633 |
male_population | integer |
Total count of males | 25846211 |
female_population | integer |
Total count of females | 25760422 |
rural_population | integer |
Population in a rural area | 9568386 |
urban_population | integer |
Population in an urban area | 42038247 |
largest_city_population | integer |
Population in the largest city of the region | 9963497 |
clustered_population | integer |
Population in urban agglomerations of more than 1 million | 25893097 |
population_density | double [persons per squared kilometer] |
Population per squared kilometer of land area | 529.3585 |
human_development_index | double [0-1] |
Composite index of life expectancy, education, and per capita income indicators | 0.903 |
Information related to the economic development for each region:
Name | Name | Description | Example |
---|---|---|---|
key | string |
Unique string identifying the region | CN_HB |
gdp | integer [USD] |
Gross domestic product; monetary value of all finished goods and services | 24450604878 |
gdp_per_capita | integer [USD] |
Gross domestic product divided by total population | 1148 |
human_capital_index | double [0-1] |
Mobilization of the economic and professional potential of citizens | 0.765 |
Information related to the COVID-19 infections for each date-region pair:
Name | Type | Description | Example |
---|---|---|---|
date | string |
ISO 8601 date (YYYY-MM-DD) of the datapoint | 2020-03-30 |
key | string |
Unique string identifying the region | CN_HB |
new_confirmed1 | integer |
Count of new cases confirmed after positive test on this date | 34 |
new_deceased1 | integer |
Count of new deaths from a positive COVID-19 case on this date | 2 |
new_recovered1 | integer |
Count of new recoveries from a positive COVID-19 case on this date | 13 |
new_tested2 | integer |
Count of new COVID-19 tests performed on this date | 13 |
total_confirmed3 | integer |
Cumulative sum of cases confirmed after positive test to date | 6447 |
total_deceased3 | integer |
Cumulative sum of deaths from a positive COVID-19 case to date | 133 |
total_tested2,3 | integer |
Cumulative sum of COVID-19 tests performed to date | 133 |
1Values can be negative, typically indicating a correction or an adjustment in the way
they were measured. For example, a case might have been incorrectly flagged as recovered one date so
it will be subtracted from the following date.
2When the reporting authority makes a distinction between PCR and antibody testing, only
PCR tests are reported here.
3Total count will not always amount to the sum of daily counts, because many authorities
make changes to criteria for counting cases, but not always make adjustments to the data. There is
also potential missing data. All of that makes the total counts drift away from the sum of all
daily counts over time, which is why the cumulative values, if reported, are kept in a separate
column.
Information related to the geography for each region:
Name | Type | Description | Example |
---|---|---|---|
key | string |
Unique string identifying the region | CN_HB |
latitude | double |
Floating point representing the geographic coordinate | 30.9756 |
longitude | double |
Floating point representing the geographic coordinate | 112.2707 |
elevation | integer [meters] |
Elevation above the sea level | 875 |
area | integer [squared kilometers] |
Area encompassing this region | 3729 |
Health related indicators for each region:
Name | Type | Description | Example |
---|---|---|---|
key | string |
Unique string identifying the region | BN |
life_expectancy | double [years] |
Average years that an individual is expected to live | 75.722 |
smoking_prevalence | double [%] |
Percentage of smokers in population | 16.9 |
diabetes_prevalence | double [%] |
Percentage of persons with diabetes in population | 13.3 |
infant_mortality_rate | double |
Infant mortality rate (per 1,000 live births) | 9.8 |
adult_male_mortality_rate | double |
Mortality rate, adult, male (per 1,000 male adults) | 143.719 |
adult_female_mortality_rate | double |
Mortality rate, adult, female (per 1,000 male adults) | 98.803 |
pollution_mortality_rate | double |
Mortality rate attributed to household and ambient air pollution, age-standardized (per 100,000 population) | 13.3 |
comorbidity_mortality_rate | double [%] |
Mortality from cardiovascular disease, cancer, diabetes or cardiorespiratory disease between exact ages 30 and 70 | 16.6 |
hospital_beds | double |
Hospital beds (per 1,000 people) | 2.7 |
nurses | double |
Nurses and midwives (per 1,000 people) | 5.8974 |
physicians | double |
Physicians (per 1,000 people) | 1.609 |
health_expenditure | double [USD] |
Health expenditure per capita | 671.4115 |
out_of_pocket_health_expenditure | double [USD] |
Out-of-pocket health expenditure per capita | 34.756348 |
Note that the majority of the health indicators are only available at the country level.
Information related to patients of COVID-19 and hospitals:
Name | Type | Description | Example |
---|---|---|---|
key | string |
Unique string identifying the region | CN_HB |
new_hospitalized* | integer |
Count of new cases hospitalized after positive test on this date | 34 |
new_intensive_care* | integer |
Count of new cases admitted into ICU after a positive COVID-19 test on this date | 2 |
new_ventilator* | integer |
Count of new COVID-19 positive cases which require a ventilator on this date | 13 |
total_hospitalized** | integer |
Cumulative sum of cases hospitalized after positive test to date | 6447 |
total_intensive_care** | integer |
Cumulative sum of cases admitted into ICU after a positive COVID-19 test to date | 133 |
total_ventilator** | integer |
Cumulative sum of COVID-19 positive cases which require a ventilator to date | 133 |
current_hospitalized** | integer |
Count of current (active) cases hospitalized after positive test to date | 34 |
current_intensive_care** | integer |
Count of current (active) cases admitted into ICU after a positive COVID-19 test to date | 2 |
current_ventilator** | integer |
Count of current (active) COVID-19 positive cases which require a ventilator to date | 13 |
*Values can be negative, typically indicating a correction or an adjustment in the way they were measured. For example, a case might have been incorrectly flagged as recovered one date so it will be subtracted from the following date.
**Total count will not always amount to the sum of daily counts, because many authorities make changes to criteria for counting cases, but not always make adjustments to the data. There is also potential missing data. All of that makes the total counts drift away from the sum of all daily counts over time, which is why the cumulative values, if reported, are kept in a separate column.
Google's and Apple's Mobility Reports] are presented in CSV form as mobility.csv with the following columns:
Name | Type | Description | Example |
---|---|---|---|
date | string |
ISO 8601 date (YYYY-MM-DD) of the datapoint | 2020-03-30 |
key | string |
Unique string identifying the region | US_CA |
mobility_driving | double [%] |
Percentage change in movement via driving compared to baseline | -15 |
mobility_transit | double [%] |
Percentage change in movement via public transit compared to baseline | -15 |
mobility_walking | double [%] |
Percentage change in movement via walking compared to baseline | -15 |
mobility_transit_stations | double [%] |
Percentage change in visits to transit station locations compared to baseline | -15 |
mobility_retail_and_recreation | double [%] |
Percentage change in visits to retail and recreation locations compared to baseline | -15 |
mobility_grocery_and_pharmacy | double [%] |
Percentage change in visits to grocery and pharmacy locations compared to baseline | -15 |
mobility_parks | double [%] |
Percentage change in visits to park locations compared to baseline | -15 |
mobility_residential | double [%] |
Percentage change in visits to residential locations compared to baseline | -15 |
mobility_workplaces | double [%] |
Percentage change in visits to workplace locations compared to baseline | -15 |
Summary of a government's response to the events, including a stringency index, collected from University of Oxford:
Name | Type | Description | Example |
---|---|---|---|
date | string |
ISO 8601 date (YYYY-MM-DD) of the datapoint | 2020-03-30 |
key | string |
Unique string identifying the region | US_CA |
school_closing | integer [0-3] |
Schools are closed | 2 |
workplace_closing | integer [0-3] |
Workplaces are closed | 2 |
cancel_public_events | integer [0-3] |
Public events have been cancelled | 2 |
restrictions_on_gatherings | integer [0-3] |
Gatherings of non-household members are restricted | 2 |
public_transport_closing | integer [0-3] |
Public transport is not operational | 0 |
stay_at_home_requirements | integer [0-3] |
Self-quarantine at home is mandated for everyone | 0 |
restrictions_on_internal_movement | integer [0-3] |
Travel within country is restricted | 1 |
international_travel_controls | integer [0-3] |
International travel is restricted | 3 |
income_support | integer [USD] |
Value of fiscal stimuli, including spending or tax cuts | 20449287023 |
debt_relief | integer [0-3] |
Debt/contract relief for households | 0 |
fiscal_measures | integer [USD] |
Value of fiscal stimuli, including spending or tax cuts | 20449287023 |
international_support | integer [USD] |
Giving international support to other countries | 274000000 |
public_information_campaigns | integer [0-2] |
Government has launched public information campaigns | 1 |
testing_policy | integer [0-3] |
Country-wide COVID-19 testing policy | 1 |
contact_tracing | integer [0-2] |
Country-wide contact tracing policy | 1 |
emergency_investment_in_healthcare | integer [USD] |
Emergency funding allocated to healthcare | 500000 |
investment_in_vaccines | integer [USD] |
Emergency funding allocated to vaccine research | 100000 |
stringency_index | double [0-100] |
Overall stringency index | 71.43 |
For more information about each field and how the overall stringency index is computed, see the Oxford COVID-19 government response tracker.
Daily weather information from nearest station reported by NOAA:
Name | Type | Description | Example |
---|---|---|---|
date | string |
ISO 8601 date (YYYY-MM-DD) of the datapoint | 2020-03-30 |
key | string |
Unique string identifying the region | US_CA |
noaa_station* | string |
Identifier for the weather station | USC00206080 |
noaa_distance* | double [kilometers] |
Distance between the location coordinates and the weather station | 28.693 |
average_temperature | double [celsius] |
Recorded hourly average temperature | 11.2 |
minimum_temperature | double [celsius] |
Recorded hourly minimum temperature | 1.7 |
maximum_temperature | double [celsius] |
Recorded hourly maximum temperature | 19.4 |
rainfall | double [millimeters] |
Rainfall during the entire day | 51.0 |
snowfall | double [millimeters] |
Snowfall during the entire day | 0.0 |
*The reported weather station refers to the nearest station which provides temperature measurements, but rainfall and snowfall may come from a different nearby weather station. In all cases, only weather stations which are at most 300km from the location coordinates are considered.
Most recent value for each indicator of the WorldBank Database.
Name | Type | Description | Example |
---|---|---|---|
key | string |
Unique string identifying the region | ES |
<indicator> |
double |
Value of the indicator corresponding to this column, column name is indicator code | 0 |
Refer to the WorldBank documentation for more details, or refer to the
worldbank_indicators.csv file for a short description of each
indicator. Each column uses the indicator code as its name, and the rows are filled with the values
for the corresponding key
.
Note that WorldBank data is only available at the country level and it's not included in the master table. If no values are reported by WorldBank for the country since 2015, the row value will be null.
Epidemiology and hospitalizations data stratified by age:
Name | Type | Description | Example |
---|---|---|---|
date | string |
ISO 8601 date (YYYY-MM-DD) of the datapoint | 2020-03-30 |
key | string |
Unique string identifying the region | FR |
${var}_age_${index} |
integer |
Instances of ${var} for age bin ${index} |
139 |
age_bin_${index} |
integer |
Range for the age values inside of bin ${index} , both ends inclusive |
30-39 |
Values in this table are stratified versions of the columns available in the
epidemiology and hospitalizatons tables. Each row contains up
to 10 distinct bins, for example:
{new_deceased_age_00: 1, new_deceased_age_01: 45, ... , new_deceased_age_09: 32}
.
Each row may have different bins, depending on the data source. This table tries to capture the raw
data with as much fidelity as possible up to 10 bins. The range of each bin is encoded into the
age_bin_${index}
variable, for example:
{age_bin_00: 0-9, age_bin_01: 10-19, age_bin_02: 20-29, ... , age_bin_09: 90-}
.
Several things worth noting about this table:
- This table contains very sparse data, with very few combinations of regions and variables available.
- Records without a known age bin are discarded, so the sum of all ages may not necessary amount to the variable from the corresponding table.
- The upper and lower range of the range are inclusive values. For example, range
0-9
includes individuals with age zero up to (and including) 9. - A row may have less than 10 bins, but never more than 10. For example:
{age_bin_00: 0-49, age_bin_01: 50-, age_bin_02: null, ...}
Epidemiology and hospitalizations data stratified by sex:
Name | Type | Description | Example |
---|---|---|---|
date | string |
ISO 8601 date (YYYY-MM-DD) of the datapoint | 2020-03-30 |
key | string |
Unique string identifying the region | FR |
${var}_sex_male |
integer |
Instances of ${var} for male individuals |
87 |
${var}_sex_female |
integer |
Instances of ${var} for female individuals |
68 |
Values in this table are stratified versions of the columns available in the
epidemiology and hospitalizatons tables. Each row contains
each variable with either _male
or _female
suffix:
{new_deceased_male: 45, new_deceased_female: 32, new_tested_male: 45, new_tested_female: 32, ...}
.
Several things worth noting about this table:
- This table contains very sparse data, with very few combinations of regions and variables available.
- Records without a known sex are discarded, so the sum of all ages may not necessary amount to the variable from the corresponding table.
For countries where both country-level and subregion-level data is available, the entry which has a
null value for the subregion level columns in the index
table indicates upper-level aggregation.
For example, if a data point has values
{country_code: US, subregion1_code: CA, subregion2_code: null, ...}
then that record will have
data aggregated at the subregion1 (i.e. state/province) level. If subregion1_code
were null, then
it would be data aggregated at the country level.
Another way to tell the level of aggregation is the aggregation_level
of the index
table, see
the schema documentation for more details about how to interpret it.
Please note that, sometimes, the country-level data and the region-level data come from different sources so adding up all region-level values may not equal exactly to the reported country-level value. See the data loading tutorial for more information.
There is also a notices.csv file which is manually updated with quirks about the data. The goal is to be able to query by key and date, to get a list of applicable notices to the requested subset.
Please note that the following datasets are maintained only to preserve backwards compatibility, but shouldn't be used in any new projects:
The output data files are published under the CC BY-SA license. All data is subject to the terms of agreement individual to each data source, refer to the sources of data table for more details. All other code and assets are published under the Apache License 2.0.
All data in this repository is retrieved automatically. When possible, data is retrieved directly from the relevant authorities, like a country's ministry of health.
To update the contents of the output folder, first install the dependencies:
pip install -r requirements.txt
Then run the following script from the source folder to update all datasets:
cd src
python update.py
See the source documentation for more technical details.
From within the src folder run the following command:
python -m unittest discover test
If you spot an error in the data, or there's a country you would like to include, the best way to contribute to this project is by helping maintain the data on the relevant Wikipedia article. Not only can that data be parsed automatically by this project, but it will also help inform millions of others that receive their information from Wikipedia.
For code contributions, take a look at the source directory for more information.
If you do something cool with the data (e.g., visualization or analysis), please let us know!
The main creator of this project is Oscar Wahltinez. Other contributors will be listed here in the future.