niderhoff / big-data-datasets

Curated list of Publicly available Big Data datasets. Uncompressed size in brackets. No Blockchains.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Big Data Datasets

Curated list of Publicly available Big Data datasets. Uncompressed size in brackets. No Blockchains.

Structured

Text

  • CommonCrawl (AWS) - A corpus of web crawl data composed of over 25 billion web pages.
    • Semi-Structured (includes Metadata): 250 TB
  • DBpedia - curated wikipedia data
  • Freebase
    • Freebase: 22 GB (250 GB)
    • Freebase Deleted Triples: 2 GB (8 GB)
    • Freebase/wikidata Mappings: 22 MB (243 MB)
  • StackOverflow Data (BigQuery) - 182 GB

Image

Audio

Bonus: API / Streamdata / "Self-Service"

Bonus: Opendata / Census / Government data

Meta / Lists / Sources

These pages might link to datastes which are already in the list.

About

Curated list of Publicly available Big Data datasets. Uncompressed size in brackets. No Blockchains.