tankwin08/data

A collection of public data sets for testing out visualization methods. These data sets are at various stages of preparation, some are just raw data, some are CSV files, and some are exposed as AMD modules. This collection is messy, but with some digging you may find hidden gems.

Targets for import:

TopoJSON Collection World countries and subdivisions
Classic datasets from Petra Isenberg et. al.
Soul of the Community (American Statistical Association)
World Population Prospects (United Nations)
Employment (Bureau of Labor Statistics)
Healthy People (Centers for Disease Control)
GapMinder Data
NASA Satellite-Derived Environmental Indicators
IMF Public Finances in Modern History Database
Executions in the US by type over time
Datasets used in the book, An Introduction to Categorical Data Analysis
Energy Information Administration Open Data
Data sets from Five Thirty Eight
Data sets in the Infovis Wiki
Data sets from Andy Kirk's Link Archive
Makeover Monday Datasets
SOCR Datasets

Here's a listing of data sets with more detail. Columns will be marked in terms of their type for visualization, including:

Q = Quantitative, continuously varying numeric columns
T = Temporal, a timestamp
O = Ordered, distinct categories with a natural order (e.g. Low, Medium, High)
N = Nominal, distinct categories with no natural order (e.g. Ethnicity)
G = Geospatial identifiers (e.g. Country, City)

UCI Machine Learning Repository - Adult (3.8 MB)

This data set demonstrates a mix of quantitative, ordinal, and nominal columns. To analyze this data set using visualization, it would be useful to aggregate the data on the fly before visualization.

age: Q
workclass: N
education: O
education-num: Q
marital-status: N
occupation: N
relationship: N
race: N
sex: N
capital-gain: Q
capital-loss: Q
hours-per-week: Q
native-country: N

Data Canvas Sense Your City (237MB or Real-time API)

This data set contains measures collected by DIY sensor kits across several major cities ["San Francisco", "Bangalore", "Boston", "Geneva", "Rio de Janeiro", "Shanghai", "Singapore"]. There is a visualization competition for this data set, submissions due March 20.

city: G
timestamp: T
temperature: Q
light: Q
airquality: Q
sound: Q
humidity: Q
dust: Q

Medical Store Geospatial Challenge (< 100KB)

This is a data set is small, but comes with a set of real-world questions about the data. This is also a competition, with submissions due April 25.

Referrers - Each row corresponds to information on a particular client referral source.
referrer_code: N
visit_count: Q
city -- referrer city
postal_code_referrer: G
(latitude, longitude): G
Clients - Each row corresponds to a client visit to the store
client_id: N
referrer_code: N
city -- referrer city
postal_code_referrer: G
(latitude, longitude): G
initial_visit_date: T
product_count: Q

UCI Machine Learning Repository - Individual household electric power consumption (20 MB)

This data set would be a great candidate to show multi-scale temporal aggregation.

timestamp: T
global_active_power: Q
global_reactive_power: Q
voltage: Q
global_intensity: Q

BrightKite User Check-ins (57.2 MB)

This data set would be a useful example for multi-scale aggregation in both space and time. This has been used as the motivating example for several Big Data visualization systems based on data cubes (imMens: Real‐time Visual Querying of Big Data, Nanocubes for real-time exploration of spatiotemporal datasets).

user-id: N
timestamp: T
(latitude, longitude): G

ACLED (Armed Conflict Location and Event Data Project) (35MB)

This data set contains entries for each violent event in Africa from 1997 - 2014. This data set would be a good candidate for visualization with a linked timeline and choropleth map, where selections in the timeline can drive the filtering of data shown on the map.

timestamp: T
(latitude, longitude): G
country: G
number of fatalities: Q

Safecast (3.2GB)

Grassroots sensor data about nuclear radiation in Japan

Statistical Computing Statistical Graphics Data expo Airline on-time performance (12GB)

A great data set for scalability testing. This is the data set used in the Crossfilter Demo.

The GDELT Data Set (~100GB)

This would be a great data set for more extreme scalability testing. There is an Open Source project for loading this data set into Spark on AWS.

The Indian Census has lots of public data.

Best Buy has a developer portal for querying their data via a Web API.

tankwin08 / data

Targets for import:

About

Languages