data-preservation-programs / slingshot

Official public repository for feedback and data collection in Filecoin Slingshot

Home Page:

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[dataset request] Various AWS open datasetclear

xinaxu opened this issue · comments

Name Description Size Format URL
World Bank - Light Every Night Light Every Night - World Bank Nightime Light Data – provides open access to all nightly imagery and data from the Visible Infrared Imaging Radiometer Suite Day-Night Band (VIIRS DNB) from 2012-2020 and the Defense Meteorological Satellite Program Operational Linescan System (DMSP-OLS) from 1992-2013. The underlying data are sourced from the NOAA National Centers for Environmental Information (NCEI) archive. Additional processing by the University of Michigan enables access in Cloud Optimized GeoTIFF format (COG) and search using the Spatial Temporal Asset Catalog (STAC) standard. The data is published and openly available under the terms of the World Bank’s open data license. 273.18 TiB Various
TSBench TSBench comprises thousands of benchmark evaluations for time series forecasting methods. It provides various metrics (i.e. measures of accuracy, latency, number of model parameters, ...) of 13 time series forecasting methods across 44 heterogeneous datasets. Time series forecasting methods include both classical and deep learning methods while several hyperparameters settings are evaluated for the deep learning methods. 570.53 GiB Various
Coupled Model Intercomparison Project Phase 5 (CMIP5) University of Wisconsin-Madison Probabilistic Downscaling Dataset The University of Wisconsin Probabilistic Downscaling (UWPD) is a statistically downscaled dataset based on the Coupled Model Intercomparison Project Phase 5 (CMIP5) climate models. UWPD consists of three variables, daily precipitation and maximum and minimum temperature. The spatial resolution is 0.1°x0.1° degree resolution for the United States and southern Canada east of the Rocky Mountains. 3.44 TiB Various
Toxicant Exposures and Responses by Genomic and Epigenomic Regulators of Transcription (TaRGET) The TaRGET (Toxicant Exposures and Responses by Genomic and Epigenomic Regulators of Transcription) Program is a research consortium funded by the National Institute of Environmental Health Sciences (NIEHS). The goal of the collaboration is to address the role of environmental exposures in disease pathogenesis as a function of epigenome perturbation, including understanding the environmental control of epigenetic mechanisms and assessing the utility of surrogate tissue analysis in mouse models of disease-relevant environmental exposures. 15.95 TiB Various
AI2 Meaningful Citations Data Set 630 paper annotations 397.26 GiB Various
OpenEEW Grillo has developed an IoT-based earthquake early-warning system, 1.81 TiB Various
COVID-19 Molecular Structure and Therapeutics Hub Aggregating critical information to accelerate drug discovery for the molecular modeling and simulation community. 1.27 TiB Various
CBERS on AWS Imagery acquired 269.04 GiB Various
MODIS The Moderate Resolution Imaging Spectroradiometer (MODIS) MCD43A4 Version 6 Nadir Bidirectional Reflectance Distribution Function (BRDF)-Adjusted Reflectance (NBAR) dataset is produced daily using 16 days of Terra and Aqua MODIS data at 500 meter (m) resolution. The view angle effects are removed from the directional reflectances, resulting in a stable and consistent NBAR product. Data are temporally weighted to the ninth day which is reflected in the Julian date in the file name. 26.89 TiB Various
Sentinel-5P Level 2 This data set consists of observations from the Sentinel-5 Precursor (Sentinel-5P) satellite of the European Commission’s Copernicus Earth Observation Programme. Sentinel-5P is a polar orbiting satellite that completes 14 orbits of the Earth a day. It carries the TROPOspheric Monitoring Instrument (TROPOMI) which is a spectrometer that senses ultraviolet (UV), visible (VIS), near (NIR) and short wave infrared (SWIR) to monitor ozone, methane, formaldehyde, aerosol, carbon monoxide, nitrogen dioxide and sulphur dioxide in the atmosphere. The satellite was launched in October 2017 and entered routine operational phase in March 2019. Data is available from July 2018 onwards. 93.04 TiB Various
Nanopore Reference Human Genome This dataset includes the sequencing and assembly of a reference standard human genome (GM12878) using the MinION nanopore sequencing instrument with the R9.4 1D chemistry. 90.4 TiB Various
Cornell EAS Data Lake Earth & Atmospheric Sciences at Cornell University has created a public data lake of climate data. The data is stored in columnar storage formats (ORC) to make it straightforward to query using standard tools like Amazon Athena or Apache Spark. The data itself is originally intended to be used for building decision support tools for farmers and digital agriculture. The first dataset is the historical NDFD / NDGD data distributed by NCEP / NOAA / NWS. The NDFD (National Digital Forecast Database) and NDGD (National Digital Guidance Database) contain gridded forecasts and observations at 2.5km resolution for the Contiguous United States (CONUS). There are also 5km grids for several smaller US regions and non-continguous territories, such as Hawaii, Guam, Puerto Rico and Alaska. NOAA distributes archives of the NDFD/NDGD via its NOAA Operational Model Archive and Distribution System (NOMADS) in Grib2 format. The data has been converted to ORC to optimize storage space and to, more importantly, simplify data access via standard data analytics tools. 5.14 TiB Various
1000 Genomes The 1000 Genomes Project is an international collaboration which has established the most detailed catalogue of human genetic variation, including SNPs, structural variants, and their haplotype context. The final phase of the project sequenced more than 2500 individuals from 26 different populations around the world and produced an integrated set of phased haplotypes with more than 80 million variants for these individuals. 696.81 TiB Various
NOAA National Blend of Models (NBM) The National Blend of Models (NBM) is a nationally consistent and skillful suite of calibrated forecast guidance based on a blend of both NWS and non-NWS numerical weather prediction model data and post-processed model guidance. The goal of the NBM is to create a highly accurate, skillful and consistent starting point for the gridded forecast. 640.54 TiB Various
Transiting Exoplanet Survey Satellite (TESS) The Transiting Exoplanet Survey Satellite (TESS) is a multi-year survey that will discover exoplanets in orbit around bright stars across the entire sky using high-precision photometry. The survey will also enable a wide variety of stellar astrophysics, solar system science, and extragalactic variability studies. More information about TESS is available at MAST and the TESS Science Support Center. 226.18 TiB Various
Kepler Mission Data The Kepler mission observed the brightness of more than 180,000 stars near the Cygnus constellation at a 30 minute cadence for 4 years in order to find transiting exoplanets, study variable stars, and find eclipsing binaries. More information about the Kepler mission is available at MAST. 17.18 TiB Various
Tabula Muris Tabula Muris is a compendium of single cell transcriptomic data from the model organism Mus musculus comprising more than 100,000 cells from 20 organs and tissues. These data represent a new resource for cell biology, reveal gene expression in poorly characterized cell populations, and allow for direct and controlled comparison of gene expression in cell types shared between tissues, such as T-lymphocytes and endothelial cells from different anatomical locations. Two distinct technical approaches were used for most organs: one approach, microfluidic droplet-based 3’-end counting, enabled the survey of thousands of cells at relatively low coverage, while the other, FACS-based full length transcript analysis, enabled characterization of cell types with high sensitivity and coverage. The cumulative data provide the foundation for an atlas of transcriptomic cell biology. See: 4.27 TiB Various
Legal Entity Identifier (LEI) and Legal Entity Reference Data (LE-RD) The Legal Entity Identifier (LEI) is a 20-character, alpha-numeric code based on the ISO 17442 standard developed by the International Organization for Standardization (ISO). It connects to key reference information that enables clear and unique identification of legal entities participating in financial transactions. Each LEI contains information about an entity’s ownership structure and thus answers the questions of 'who is who’ and ‘who owns whom’. Simply put, the publicly available LEI data pool can be regarded as a global directory, which greatly enhances transparency in the global marketplace. The Financial Stability Board (FSB) has reiterated that global LEI adoption underpins “multiple financial stability objectives” such as improved risk management in firms as well as better assessment of micro and macro prudential risks. As a result, it promotes market integrity while containing market abuse and financial fraud. Last but not least, LEI rollout “supports higher quality and accuracy of financial data overall”. The publicly available LEI data pool is a unique key to standardized information on legal entities globally. The data is registered and regularly verified according to protocols and procedures established by the Regulatory Oversight Committee. In cooperation with its partners in the Global LEI System, the Global Legal Entity Identifier Foundation (GLEIF) continues to focus on further optimizing the quality, reliability and usability of LEI data, empowering market participants to benefit from the wealth of information available with the LEI population. The drivers of the LEI initiative, i.e. the Group of 20, the FSB and many regulators around the world, have emphasized the need to make the LEI a broad public good. The Global LEI Index, made available by GLEIF, greatly contributes to meeting this objective. It puts the complete LEI data at the disposal of any interested party, conveniently and free of charge. The benefits for the wider business community to be generated with the Global LEI Index grow in line with the rate of LEI adoption. To maximize the benefits of entity identification across financial markets and beyond, firms are therefore encouraged to engage in the process and get their own LEI. Obtaining an LEI is easy. Registrants simply contact their preferred business partner from the list of LEI issuing organizations available on the GLEIF website. 4.67 TiB Various
Textbook Question Answering (TQA) 1,076 textbook lessons, 26,260 questions, 6229 images 397.26 GiB Various
NASA NEX A collection of Earth science datasets maintained by NASA, including climate change projections and satellite images of the Earth's surface. 53.83 TiB Various
Provision of Web-Scale Parallel Corpora for Official European Languages (ParaCrawl) ParaCrawl is a set of large parallel corpora to/from English for all official EU languages by a broad web crawling effort. State-of-the-art methods are applied for the entire processing chain from identifying web sites with translated text all the way to collecting, cleaning and delivering parallel corpora that are ready as training data for CEF.AT and translation memories for DG Translation. 76.44 TiB Various
Sentinel-3 This data set consists of observations from the Sentinel-3 satellite of the European Commission’s Copernicus Earth Observation Programme. Sentinel-3 is a polar orbiting satellite that completes 14 orbits of the Earth a day. It carries the Ocean and Land Colour Instrument (OLCI) for medium resolution marine and terrestrial optical measurements, the Sea and Land Surface Termperature Radiometer (SLSTR), the SAR Radar Altimeter (SRAL), the MicroWave Radiometer (MWR) and the Precise Orbit Determination (POD) instruments. The satellite was launched in 2016 and entered routine operational phase in 2017. Data is available from July 2017 onwards. 900.04 TiB Various
Coupled Model Intercomparison Project 6 The sixth phase of global coupled ocean-atmosphere general circulation model ensemble. 1.67 PiB Various
CAFE60 reanalysis The CSIRO Climate retrospective Analysis and Forecast Ensemble system: version 1 (CAFE60v1) provides a large ensemble retrospective analysis of the global climate system from 1960 to present with sufficiently many realizations and at spatio-temporal resolutions suitable to enable probabilistic climate studies. Using a variant of the ensemble Kalman filter, 96 climate state estimates are generated over the most recent six decades. These state estimates are constrained by monthly mean ocean, atmosphere and sea ice observations such that their trajectories track the observed state while enabling estimation of the uncertainties in the approximations to the retrospective mean climate over recent decades. Strongly coupled data assimilation (SCDA) is implemented via an ensemble transform Kalman filter in order to constrain a general circulation climate model to observations. Satellite (altimetry, sea surface temperature, sea ice concentration) and in situ ocean temperature and salinity profiles are directly assimilated each month, whereas atmospheric observations are sub-sampled from the JRA55 atmospheric reanalysis. Strong coupling is implemented via explicit cross domain covariances between ocean, atmosphere, sea ice and ocean biogeochemistry. Atmospheric and surface ocean fields are available at daily resolution and monthly resolution for the land, subsurface ocean and sea ice. The system also produces a complete data archive of initial conditions potentially enabling individual forecasts for all members each month over the 60 year period. The size of the ensemble and application of strongly coupled data assimilation lead to new insights for future reanalyses. CAFE60v1 has been validated in comparison to empirical indices of the major climate teleconnections and blocking from various reanalysis products (ERA5, JRA55, NCEP NR1). Estimates of the large scale ocean structure and transports have been compared to those derived from gridded observational products (WOA18, HadISST, ERSSTv5) and climate model projections (CMIP). Sea ice (extent, concentration and variability) and land surface (precipitation and surface air temperatures) are also compared to a variety of model (ERA5, CMIP) and observational (GPCP, AWAP, HadCRU4, GIOMAS, NSIDC, HadISST) products. This analysis shows that CAFE60v1 is a useful, comprehensive and unique data resource for studying internal climate variability and predictability, including the recent climate response to anthropogenic forcing on multi-year to decadal time scales. 58.89 TiB Various
Longitudinal Nutrient Deficiency Dataset associated with the 2021 AAAI Paper- Detection and Prediction of Nutrient Deficiency Stress using Longitudinal Aerial Imagery. The dataset contains 3 image sequences of aerial imagery from 386 farm parcels which have been annotated for nutrient deficiency stress. 1.79 GiB Various
New Jersey Statewide Digital Aerial Imagery Catalog The New Jersey Office of GIS, NJ Office of Information Technology manages a series of 11 digital orthophotography and scanned aerial photo maps collected at various years ranging from 1930 to 2017. Each year’s worth of imagery are available as Cloud Optimized GeoTIFF (COG) files and some years are available as compressed MrSID and/or JP2 files. Additionally, each year of imagery is organized into a tile grid scheme covering the entire geography of New Jersey. Many years share the same tiling grid while others have unique grids as defined by the project at the time. 10.06 TiB Various
Basic Local Alignment Sequences Tool (BLAST) Databases A centralized repository of pre-formatted BLAST databases created by the National Center for Biotechnology Information (NCBI). 142.61 TiB Various
InRad COVID-19 X-Ray and CT Scans This dataset is a collection of anonymized thoracic radiographs (X-Rays) and computed tomography (CT) scans of patients with suspected COVID-19. Images are acommpanied by a positive or negative diagnosis for SARS-CoV2 infection via RT-PCR. These images were provided by Hospital das Clínicas da Universidade de São Paulo, Hospital Sirio-Libanes, and by Laboratory Fleury. 266.3 GiB Various
Ohio State Cardiac MRI Raw Data (OCMR) OCMR is an open-access repository that provides multi-coil k-space data for cardiac cine. The fully sampled MRI datasets are intended for quantitative comparison and evaluation of image reconstruction methods. The free-breathing, prospectively undersampled datasets are intended to evaluate their performance and generalizability qualitatively. 179.96 GiB Various
District of Columbia - Classified Point Cloud LiDAR LiDAR point cloud data for Washington, DC is available for anyone to use on Amazon S3. 314.15 GiB Various
NOAA Global Extratropical Surge and Tide Operational Forecast System (Global ESTOFS) NOAA's Global Extratropical Surge and Tide Operational Forecast System (Global ESTOFS) provides users with nowcasts (analyses of near present conditions) and forecast guidance of water level conditions for the entire globe. Global ESTOFS has been developed to serve the marine navigation, weather forecasting, and disaster mitigation user communities. Global ESTOFS was developed in a collaborative effort between the NOAA/National Ocean Service (NOS)/Office of Coast Survey, the NOAA/National Weather Service (NWS)/National Centers for Environmental Prediction (NCEP) Central Operations (NCO), the University of Notre Dame, the University of North Carolina, and The Water Institute of the Gulf. The model generates forecasts out to 180 hours four times per day; forecast output includes water levels caused by the combined effects of storm surge and tides, by astronomical tides alone, and by sub-tidal water levels (isolated storm surge).

71.61 TiB Various
COVID-19 Genome Sequence Dataset A centralized sequence repository for all records containing sequence associated with the novel corona virus (SARS-CoV-2) submitted to the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA). Included are both the original sequences submitted by the principal investigator as well as SRA-processed sequences that require the SRA Toolkit for analysis. Additionally, submitter provided metadata included in associated BioSample and BioProject records is available alongside NCBI calculated data, such k-mer based taxonomy analysis results, contiguous assemblies (contigs) and associated statistics such as contig length, blast results for the assembled contigs, contig annotation, blast databases of contigs and their annotated peptides, and VCF files generated for each record relative to the SARS-CoV-2 RefSeq record. Finally, metadata is additionally made available in parquet format to facilitate search and filtering using the AWS Athena Service. 1.02 PiB Various
NOAA Coastal Lidar Data Lidar (light detection and ranging) is a technology that can measure the 3-dimentional location of objects, including the solid earth surface. The data consists of a point cloud of the positions of solid objects that reflected a laser pulse, typically from an airborne platform. In addition to the position, each point may also be attributed by the type of object it reflected from, the intensity of the reflection, and other system dependent metadata. The NOAA Coastal Lidar Data is a collection of lidar projects from many different sources and agencies, geographically focused on the coastal areas of the United States of America. The data is provided in Entwine Point Tiles ( format, which is a lossless streamable octree of the point cloud. Datasets are maintained in their original projects and care should be taken when merging projects. The coordinate reference system for the data is The NAD83(2011) UTM zone appropriate for the center of each data set and the orthometric datum appropriate for that area (for example, NAVD88 in the mainland United States, PRVD02 in Puerto Rico, or GUVD03 in Guam). The geoid model used is reflected in the data set resource name. 123.63 TiB Various
AgricultureVision Agriculture-Vision aims to be a publicly available large-scale aerial agricultural image dataset that is high-resolution, multi-band, and with multiple types of patterns annotated by agronomy experts. The original dataset affiliated with the 2020 CVPR paper includes 94,986 512x512images sampled from 3,432 farmlands with nine types of annotations: double plant, drydown, endrow, nutrient deficiency, planter skip, storm damage, water, waterway and weed cluster. All of these patterns have substantial impacts on field conditions and the final yield. These farmland images were captured between 2017 and 2019 across multiple growing seasons in numerous farming locations in the US. Each field image contains four color channels: Near-infrared (NIR), Red, Green and Blue. We first randomly split the 3,432 farmland images with a 6/2/2 train/val/test ratio. We then assign each sampled image to the split of the farmland image they are cropped from. This guarantees that no cropped images from the same farmland will appear in multiple splits in the final dataset. The generated (supervised) Agriculture-Vision dataset thus contains 56,944/18,334/19,708 train/val/test images. 914.14 GiB Various
Terra Fusion Data Sampler The Terra Basic Fusion dataset is a fused dataset of the original Level 1 radiances 136.23 TiB Various
SILAM Air Quality Air Quality is a global SILAM atmospheric composition and air quality forecast performed on a daily basis for > 100 species and covering the troposphere and the stratosphere. The output produces 3D concentration fields and aerosol optical thickness. The data are unique: 20km resolution for global AQ models is unseen worldwide. 104.4 GiB Various
KITTI Vision Benchmark Suite Dataset and benchmarks for computer vision research in the context of autonomous driving. The dataset has been recorded in and around the city of Karlsruhe, Germany using the mobile platform AnnieWay (VW station wagon) which has been equipped with several RGB and monochrome cameras, a Velodyne HDL 64 laser scanner as well as an accurate RTK corrected GPS/IMU localization unit. The dataset has been created for computer vision and machine learning research on stereo, optical flow, visual odometry, semantic segmentation, semantic instance segmentation, road segmentation, single image depth prediction, depth map completion, 2D and 3D object detection and object tracking. In addition, several raw data recordings are provided. The datasets are captured by driving around the mid-size city of Karlsruhe, in rural areas and on highways. Up to 15 cars and 30 pedestrians are visible per image. 894.47 GiB Various
NOAA Fundamental Climate Data Records (FCDR) NOAA's Climate Data Records (CDRs) are robust, sustainable, and scientifically sound climate records that provide trustworthy information on how, where, and to what extent the land, oceans, atmosphere and ice sheets are changing. These datasets are thoroughly vetted time series measurements with the longevity, consistency, and continuity to assess and measure climate variability and change. NOAA CDRs are vetted using standards established by the National Research Council (NRC).

80.47 TiB Various
GATK Test Data The GATK test data resource bundle is a collection of files for resequencing human genomic data with the 975.97 GiB Various
NOAA/PMEL Ocean Climate Stations Moorings The mission of the Ocean Climate Stations (OCS) Project is to make meteorological and 1.64 GiB Various
Boreas Autonomous Driving Dataset This autonomous driving dataset includes data from a 128-beam Velodyne Alpha-Prime lidar, a 5MP Blackfly camera, a 360-degree Navtech radar, and post-processed Applanix POS LV GNSS data. This dataset was collect in various weather conditions (sun, rain, snow) over the course of a year. The intended purpose of this dataset is to enable benchmarking of long-term all-weather odometry and metric localization across various sensor types. In the future, we hope to also support an object detection benchmark. 4.39 TiB Various
Allen Ivy Glioblastoma Atlas This dataset consists of images of glioblastoma human brain tumor tissue sections that have been probed for expression of particular genes believed to play a role in development of the cancer. Each tissue section is adjacent to another section that was stained with a reagent useful for identifying histological features of the tumor. Each of these types of images has been completely annotated for tumor features by a machine learning process trained by expert medical doctors. 8.52 TiB Various
OpenAQ Global, aggregated physical air quality data from public data sources provided by government, research-grade and other sources. These awesome groups do the hard work of measuring these data and publicly sharing them, and our community makes them more universally-accessible to both humans and machines. 1.04 TiB Various
3000 Rice Genomes Project The 3000 Rice Genome Project is an international effort to sequence the genomes of 3,024 rice varieties from 89 countries. 96.64 TiB Various
ChEMBL - Data Lakehouse Ready ChEMBL is a manually curated database of bioactive molecules with drug-like properties. It brings together chemical, bioactivity and genomic data to aid the translation of genomic information into effective new drugs. This representation of ChEMBL is stored in Parquet format and most easily utilized through Amazon Athena. Follow the documentation for install instructions (< 2 minute install). New ChEMBL releases occur sporadically; the most up to date information on ChEMBL releases can be found here. 8.43 GiB Various
Open Targets - Data Lakehouse Ready This a Parquet representation of the Open Targets Platform's latest export. The Open Targets Platform integrates evidence from genetics, genomics, transcriptomics, drugs, animal models and scientific literature to score and rank target-disease associations for drug target identification. The Open Targets Platform ( is a freely available resource for the integration of genetics, genomics, and chemical data to aid systematic drug target identification and prioritisation. 15.72 GiB Various
Smithsonian Open Access The Smithsonian’s mission is the "increase and diffusion of knowledge" and has been collecting since 1846. The Smithsonian, through its efforts to digitize its multidisciplinary collections, has created millions of digital assets and related metadata describing the collection objects. On February 25th, 2020, the Smithsonian released over 2.8 million CC0 interdisciplinary 2-D and 3-D images, related metadata, and additionally, research data from researches across the Smithsonian. The 2.8 million "open access" collections are a subset of the Smithsonian’s 155 million objects, 2.1 million library volumes and 156,000 cubic feet of archival collections held in 19 museums, 9 research centers, libraries, archives and the National Zoo. Digitization of collections is ongoing. 618.51 TiB Various
Multi-Scale Ultra High Resolution (MUR) Sea Surface Temperature (SST) A global, gap-free, gridded, daily 1 km Sea Surface Temperature (SST) dataset created by merging multiple Level-2 satellite SST datasets. Those input datasets include the NASA Advanced Microwave Scanning Radiometer-EOS (AMSR-E), the JAXA Advanced Microwave Scanning Radiometer 2 (AMSR-2) on GCOM-W1, the Moderate Resolution Imaging Spectroradiometers (MODIS) on the NASA Aqua and Terra platforms, the US Navy microwave WindSat radiometer, the Advanced Very High Resolution Radiometer (AVHRR) on several NOAA satellites, and in situ SST observations from the NOAA iQuam project. Data are available from 2002 to present in Zarr format. The original source of the MUR data is the NASA JPL Physical Oceanography DAAC. 8.59 TiB Various
OpenSurfaces A large database of annotated surfaces created from real-world consumer photographs. 1.08 TiB Various
Tabula Muris Senis Tabula Muris Senis is a comprehensive compendium of single cell transcriptomic data from the model organism Mus musculus comprising more than 500,000 cells from 18 organs and tissues across the mouse lifespan. We discovered cell-specific changes occurring across multiple cell types and organs, as well as age related changes in the cellular composition of different organs. Using single-cell transcriptomic data we were able to assess cell type specific manifestations of different hallmarks of aging, such as senescence, changes in the activity of metabolic pathways, depletion of stem-cell populations, genomic instability and the role of inflammation as well as other changes in the organism’s immune system. Tabula Muris Senis provides a wealth of new molecular information about how the most significant hallmarks of aging are reflected in a broad range of tissues and cell types.See: 105.61 TiB Various
Clinical Proteomic Tumor Analysis Consortium 3 (CPTAC-3) The Clinical Proteomic Tumor Analysis Consortium (CPTAC) is a national effort to accelerate the 74.35 GiB Various
NOAA Unified Forecast System Subseasonal to Seasonal Prototypes The Unified Forecast System Subseasonal to Seasonal prototypes consist of reforecast data from the UFS atmosphere-ocean coupled model experimental prototype version 5, 6, and 7 produced by the Medium Range and Subseasonal to Seasonal Application team of the UFS-R2O project. The UFS prototypes are the first dataset released to the broader weather community for analysis and feedback as part of the development of the next generation operational numerical weather prediction system from NWS. The datasets includes all the major weather variables for atmosphere, land, ocean, sea ice, and ocean waves. 152.21 TiB Various
Storm EVent ImageRy (SEVIR) Collection of spatially and temporally aligned GOES-16 ABI satellite imagery, NEXRAD radar mosaics, and GOES-16 GLM lightning detections. 969.68 GiB Various
Yale-CMU-Berkeley (YCB) Object and Model Set This project primarily aims to facilitate performance benchmarking in robotics research. The dataset provides mesh models, RGB, RGB-D and point cloud images of over 80 objects. The physical objects are also available via the YCB benchmarking project. The data are collected by two state of the art systems: UC Berkley's scanning rig and the Google scanner. The UC Berkley's scanning rig data provide meshes generated with Poisson reconstruction, meshes generated with volumetric range image integration, textured versions of both meshes, Kinbody files for using the meshes with OpenRAVE, 600 High-resolution RGB images, 600 RGB-D images, and 600 point cloud images for each object. The Google scanner data provides 3 meshes with different resolutions (16k, 64k, and 512k polygons), textured versions of each mesh, Kinbody files for using the meshes with OpenRAVE. 209.1 GiB Various
NapierOne Mixed File Dataset NapierOne is a modern cybersecurity mixed file data set, primarily aimed at, but not limited to, ransomware detection and forensic analysis. The dataset contains over 500,000 distinct files, representing 44 distinct popular file types. It was designed to address the known deficiency in research reproducibility and improve consistency by facilitating research replication and repeatability. The data set was inspired by the Govdocs1 data set and it is intended that ‘NapierOne’ be used as a complement to this original data set. An investigation was performed with the goal of determining the common files types currently in use. No specific research was found that explicitly provided this information, so an alternative consensus approach was employed. This involved combining the findings from multiple sources of file type usage into an overall ranked list. After which 5,000 real-world example files were gathered, and a specific data subset was created, for each of the common file types identified. In some circumstances, multiple data subsets were created for a specific file type, each subset representing a specific characteristic for that file type. For example, there are multiple data subsets for the ZIP file type with each subset containing examples of a specific compression method. Ransomware execution tends to produce files that have high entropy, so examples of file types that naturally have this attribute are also present. The resulting entire data set comprises of more than 90 separate data subsets divided between 44 distinct file types, resulting in over 500,000 unique files in total. Currently, the data set contains examples of the following file types APK, BIN, BMP, CSS, CSV, DOC, DOCX, DWG, ELF, EPS,EPUB, EXE, GIF, GZIP, HTML, ICS, JS, JPG, JSON, MKV, MP3, MP4, ODS, OXPS, PDF, PNG, PPT, PPTX, PS1, RAR, SVG, TAR, TIF, TXT, WEBP, XLS, XLSX, XML, ZIP, ZLIB, 7Zip 1.6 TiB Various
NOAA Global Mosaic of Geostationary Satellite Imagery (GMGSI) NOAA/NESDIS Global Mosaic of Geostationary Satellite Imagery (GMGSI) visible (VIS), shortwave infrared (SIR), longwave infrared (LIR) imagery, and water vaport imagery (WV) are composited from data from several geostationary satellites orbiting the globe, including the GOES-East and GOES-West Satellites operated by U.S. NOAA/NESDIS, the Meteosat-11 and Meteosat-8 satellites from theMeteosat Second Generation (MSG) series of satellites operated by European Organization for the Exploitation of Meteorological Satellites (EUMETSAT), and the Himawari-8 satellite operated by the Japan Meteorological Agency (JMA). GOES-East is positioned at 75 deg W longitude over the equator. GOES-West is located at 137.2 deg W longitude over the equator. Both satellites cover an area from the eastern Atlantic Ocean to the central Pacific Ocean region. The Meteosat-11 satellite is located at 0 deg E longitude to cover Europe and Africa regions. The Meteosat-8 satellite is located at 41.5 deg E longitude to cover the Indian Ocean region. The Himawari-8 satellite is located at 140.7 deg E longitude to cover the Asia-Oceania region. The visible imagery indicates cloud cover and ice and snow cover. The shortwave, or mid-infrared, indicates cloud cover and fog at night. The longwave, or thermal infrared, depicts cloud cover and land/sea temperature patterns. The water vapor imagery indicates the amount of water vapor contained in the mid to upper levels of the troposphere, with the darker grays indicating drier air and the brighter grays/whites indicating more saturated air. GMGSI composite images have an approximate 8 km (5 mile) horizontal resolution and are updated every hour. 141.73 GiB Various
SpaceNet SpaceNet, launched in August 2016 as an open innovation project offering a repository of freely available 10.65 TiB Various
PoroTomo Released to the public as part of the Department of Energy's Open Energy Data 271.67 TiB Various
Sophos/ReversingLabs 20 Million malware detection dataset A dataset intended to support research on machine learning 9.38 TiB Various
NOAA High-Resolution Rapid Refresh (HRRR) Model The HRRR is a NOAA real-time 3-km resolution, hourly updated, cloud-resolving, convection-allowing atmospheric model, initialized by 3km grids with 3km radar assimilation. Radar data is assimilated in the HRRR every 15 min over a 1-h period adding further detail to that provided by the hourly data assimilation from the 13km radar-enhanced Rapid Refresh. 1.8 PiB Various
NOAA Atmospheric Climate Data Records NOAA's Climate Data Records (CDRs) are robust, sustainable, and scientifically sound climate records that provide trustworthy information on how, where, and to what extent the land, oceans, atmosphere and ice sheets are changing. These datasets are thoroughly vetted time series measurements with the longevity, consistency, and continuity to assess and measure climate variability and change. NOAA CDRs are vetted using standards established by the National Research Council (NRC).

6.9 TiB Various
MODIS MYD13A1, MOD13A1, MYD11A1, MOD11A1, MCD43A4 Data from the Moderate Resolution Imaging Spectroradiometer (MODIS), managed by 564.5 GiB Various
Allen Mouse Brain Atlas The Allen Mouse Brain Atlas is a genome-scale collection of cellular resolution gene expression profiles using in situ hybridization (ISH). Highly methodical data production methods and comprehensive anatomical coverage via dense, uniformly spaced sampling facilitate data consistency and comparability across >20,000 genes. The use of an inbred mouse strain with minimal animal-to-animal variance allows one to treat the brain essentially as a complex but highly reproducible three-dimensional tissue array. The entire Allen Mouse Brain Atlas dataset and associated tools are available through an unrestricted web-based viewing application ( The collection of > 650,000 images have been made available in this Open Data bucket to enable efficient access and analysis of the this dataset. 190.01 TiB Various
Amazon-PQA Amazon product questions and their answers, along with the public product information. 19.42 GiB Various
OpenCell on AWS The OpenCell project is a proteome-scale effort to measure the localization and interactions of human proteins 849.63 GiB Various
NOAA Global Ensemble Forecast System (GEFS) The Global Ensemble Forecast System (GEFS), previously known as the GFS Global ENSemble (GENS), is a weather forecast model made up of 21 separate forecasts, or ensemble members. The National Centers for Environmental Prediction (NCEP) started the GEFS to address the nature of uncertainty in weather observations, which is used to initialize weather forecast models. The GEFS attempts to quantify the amount of uncertainty in a forecast by generating an ensemble of multiple forecasts, each minutely different, or perturbed, from the original observations. With global coverage, GEFS is produced four times a day with weather forecasts going out to 16 days. 1.43 PiB Various
Speedtest by Ookla Global Fixed and Mobile Network Performance Maps Global fixed broadband and mobile (cellular) network performance, allocated to zoom level 16 web mercator tiles (approximately 610.8 meters by 610.8 meters at the equator). Data is provided in both Shapefile format as well as Apache Parquet with geometries represented in Well Known Text (WKT) projected in EPSG:4326. Download speed, upload speed, and latency are collected via the Speedtest by Ookla applications for Android and iOS and averaged for each tile. Measurements are filtered to results containing GPS-quality location accuracy. 16.37 GiB Various
Sudachi Language Resources Japanese dictionaries and pre-trained models (word embeddings and language models) for natural language processing. 73.0 GiB Various
Discrete Reasoning Over the content of Paragraphs (DROP) The DROP dataset contains 96k Question and Answer pairs (QAs) over 6.7K paragraphs, split between train (77k QAs), development (9.5k QAs) and a hidden test partition (9.5k QAs). 397.26 GiB Various
ESA WorldCover The European Space Agency (ESA) WorldCover is a global land cover map with 11 different land cover classes produced at 10m resolution based on combination of both Sentinel-1 and Sentinel-2 data. In areas where Sentinel-2 images are covered by clouds for an extended period of time, Sentinel-1 data then provides complimentary information on the structural characteristics of the observed land cover. Therefore, the combination of Sentinel-1 and Sentinel-2 data makes it possible to update the land cover map almost in real time. WorldCover Map has been produced for 2020 (01 January to 31 December) with a global coverage as part of the 5th Earth Observation Envelope Programme (EOEP-5). It provides valuable information for applications such as biodiversity, food security, carbon assessment and climate modelling. More information can be found on the WorldCover website and the product User Manual. 115.23 GiB Various
Hecatomb Databases Preprocessed databases for use with the Hecatomb pipeline for viral and phage sequence annotation. 58.65 GiB Various
AWS iGenomes Common reference genomes hosted on AWS S3. Can be used when aligning and analysing raw DNA sequencing data. 6.19 TiB Various
Copernicus Digital Elevation Model (DEM) The Copernicus DEM is a Digital Surface Model (DSM) which represents the surface of the Earth including buildings, infrastructure and vegetation. We provide two instances of Copernicus DEM named GLO-30 Public and GLO-90. GLO-90 provides worldwide coverage at 90 meters. GLO-30 Public provides limited worldwide coverage at 30 meters because a small subset of tiles covering specific countries are not yet released to the public by the Copernicus Programme. Note that in both cases ocean areas do not have tiles, there one can assume height values equal to zero. Data is provided as Cloud Optimized GeoTIFFs. 614.89 GiB Various
1000 Genomes Phase 3 Reanalysis with DRAGEN 3.5 and 3.7 This dataset contains alignment files and short nucleotide, copy number, repeat expansion (STR) and structural variant call files from the 1000 Genomes Project Phase 3 dataset (n=3202) using Illumina DRAGEN v3.5.7b and v3.7.6 software. The v3.7.6 dataset also includes results from joint small variant, de novo structural variant, de novo copy number variant and repeat expansion calls on 602 trio families comprised of members from the 1000 Genomes Project Phase 3 dataset, as well as DRAGEN gVCF Genotyper (v3.8.3) analysis on the entire dataset (n=3202). Improvements and new features in the v3.7.6 individual samples analyses include CYP2D6 variant calling and joint detection (see ‘DRAGEN 3.7 User Guide’ for details on these features) and use of graph-based hg19 and hg38 reference hash tables (see ‘DRAGEN Wins at PrecisionFDA Truth Challenge V2 Showcase Accuracy Gains from Alt-aware Mapping and Graph Reference Genomes’ for details). 696.97 TiB Various
Open City Model (OCM) Open City Model is an initiative to provide cityGML data for all the buildings in the United States. 415.93 GiB Various
NREL Wind Integration National Dataset Released to the public as part of the Department of Energy's Open Energy Data Initiative, 1.6 PiB Various
PubSeq - Public Sequence Resource COVID-19 PubSeq is a free and open online bioinformatics public sequence resource with on-the-fly analysis of sequenced SARS-CoV-2 samples that allows for a quick turnaround in identification of new virus strains. PubSeq allows anyone to upload sequence material in the form of FASTA or FASTQ files with accompanying metadata through the web interface or REST API. 2.92 GiB Various
IBL Neuropixels Brainwide Map on AWS Electrophysiological recordings of mouse brain activity acquired using Neuropixels probes. 555.06 GiB Various
Digital Earth Africa GeoMAD GeoMAD is the Digital Earth Africa (DE Africa) surface reflectance geomedian and triple Median Absolute Deviation data service. It is a cloud-free composite of satellite data compiled over specific timeframes. 174.14 TiB Various
Open Observatory of Network Interference (OONI) A free software, global observation network for detecting censorship, surveillance and traffic manipulation on the internet. 91.42 TiB Various
NIH NCBI Sequence Read Archive (SRA) on AWS The Sequence Read Archive (SRA), produced by the National Center for Biotechnology Information (NCBI) at the National Library of Medicine (NLM) at the National Institutes of Health (NIH), stores raw DNA sequencing data and alignment information from high-throughput sequencing platforms. The SRA provides open access to these biological sequence data to support the research community's efforts to enhance reproducibility and make new discoveries by comparing data sets. Buckets in this registry contain public SRA data in the original (user submitted) format from select high value and newly-released studies as well as all public-access SRA formatted ETL+BQS data. Also included is all SRA metadata that can be leveraged for attribute-based data discovery. 12.23 PiB Various
Oxford Nanopore Technologies Benchmark Datasets The ont-open-data registry provides reference sequencing data from Oxford Nanopore Technologies to support, 1) Exploration of the characteristics of nanopore sequence data. 2) Assessment and reproduction of performance benchmarks 3) Development of tools and methods. The data deposited showcases DNA sequences from a representative subset of sequencing chemistries. The datasets correspond to publicly-available reference samples (e.g. GM24385 as reference human). Raw data are provided with metadata and scripts to describe sample and data provenance. 41.98 TiB Various
NOAA World Ocean Database (WOD) The World Ocean Database (WOD) is the largest uniformly formatted, quality-controlled, publicly available historical subsurface ocean profile database. From Captain Cook's second voyage in 1772 to today's automated Argo floats, global aggregation of ocean variable information including temperature, salinity, oxygen, nutrients, and others vs. depth allow for study and understanding of the changing physical, chemical, and to some extent biological state of the World's Oceans. Browse the bucket via the AWS S3 explorer: 123.28 GiB Various
Global Seasonal Sentinel-1 Interferometric Coherence and Backscatter Data Set This data set is the first-of-its-kind spatial representation of multi-seasonal, global SAR repeat-pass interferometric coherence and backscatter signatures. Global coverage comprises all land masses and ice sheets from 82 degrees northern to 79 degress southern latitude. The data set is derived from high-resolution multi-temporal repeat-pass interferometric processing of about 205,000 Sentinel-1 Single-Look-Complex data acquired in Interferometric Wide-Swath mode (Sentinel-1 IW mode) from 1-Dec-2019 to 30-Nov-2020. The data set was developed by Earth Big Data LLC and Gamma Remote Sensing AG, under contract for NASA's Jet Propulsion Laboratory. The data set covers four sets of seasonal (DJF/MAM/JJA/SON) metrics: 1) Median 6-, 12-, 18-, 24-, 36-, and 48-day repeat coherence estimates for C-band VV and HH polarized data, 2) Mean backscatter (gamma naught) for VV, VH, HH, and HV polarizations, 3) Seasonal coherence decay model parameters rho, tau, and rmse, 4) Local incidence and layover/shadow regions for all relative orbits (175 orbits). Note that in the data set filenames the seasons were referred to as northern hemisphere winter (DJF), spring (MAM), summer (JJA), and fall (SON). The data set is available in two main components: 1) 1x1 degree tiles. Each tile contains GeoTiffs at 3 arcsec pixel spacing of all metrics available in the tile. (s3://sentinel-1-global-coherence-earthbigdata/data/tiles/), 2) Global mosaicked tiles as cloud optimized GeoTIFFs (COG) at 0.01 degree pixel spacing (s3://sentinel-1-global-coherence-earthbigdata/data/mosaics/) for each of the computed metrics. 2.1 TiB Various
NOAA Continuously Operating Reference Stations (CORS) Network (NCN) The NOAA Continuously Operating Reference Stations (CORS) Network (NCN), managed by NOAA/National Geodetic Survey (NGS), provide Global Navigation Satellite System (GNSS) data, supporting three dimensional positioning, meteorology, space weather, and geophysical applications throughout the United States. The NCN is a multi-purpose, multi-agency cooperative endeavor, combining the efforts of hundreds of government, academic, and private organizations. The stations are independently owned and operated. Each agency shares their GNSS/GPS carrier phase and code range measurements and station metadata with NGS, which are analyzed and distributed free of charge. 13.75 TiB Various
Orcasound - bioacoustic data for marine conservation Live-streamed and archived audio data (~2018-present) from underwater microphones (hydrophones) containing marine biological signals as well as ambient ocean noise. Hydrophone placement and passive acoustic monitoring effort prioritizes detection of orca sounds (calls, clicks, whistles) and potentially harmful noise. Geographic focus is on the US/Canada critical habitat of Southern Resident killer whales (northern CA to central BC) with initial focus on inland waters of WA. In addition to the raw lossy or lossless compressed data, we provide a growing archive of annotated bioacoustic bouts. 3.75 TiB Various
1000 Genomes Phase 3 Reanalysis with DRAGEN 3.5 - Data Lakehouse Ready The 1000 Genomes Project is an international collaboration which has established the most detailed catalogue of human genetic variation, including SNPs, structural variants, and their haplotype context. There were a total of 3202 individuals sequenced as part of Phase 3 of this project. The high coverage samples were processed using the Illumina DRAGEN v3.5.7b pipeline and are available at s3://1000genomes-dragen/. This dataset contains the VCFs transformed to Parquet/ORC in 3 different schemas - partitioned by samples, partitioned by chromosome and a nested data format. These representations of the 1000 Genomes DRAGEN data are stored in Parquet/ORC format and can be queried through Amazon Athena. To add these tables to your Glue Data Catalog and for sample queries on this dataset, please refer to the link in our Documentation. 1.7 TiB Various
Medical Segmentation Decathlon With recent advances in machine learning, semantic segmentation algorithms are becoming increasingly general purpose and translatable to unseen tasks. Many key algorithmic advances in the field of medical imaging are commonly validated on a small number of tasks, limiting our understanding of the generalisability of the proposed contributions. A model which works out-of-the-box on many tasks, in the spirit of AutoML, would have a tremendous impact on healthcare. The field of medical imaging is also missing a fully open source and comprehensive benchmark for general purpose algorithmic validation and testing covering a large span of challenges, such as: small data, unbalanced labels, large-ranging object scales, multi-class labels, and multimodal imaging, etc. This challenge and dataset aims to provide such resource thorugh the open sourcing of large medical imaging datasets on several highly different tasks, and by standardising the analysis and validation process. 141.4 GiB Various
Therapeutically Applicable Research to Generate Effective Treatments (TARGET) Therapeutically Applicable Research to Generate Effective Treatments (TARGET) is the collaborative effort of a large, diverse consortium of extramural and NCI investigators. The goal of the effort is to accelerate molecular discoveries that drive the initiation and progression of hard-to-treat childhood cancers and facilitate rapid translation of those findings into the clinic. 68.51 GiB Various
High Resolution Downscaled Climate Data for Southeast Alaska This dataset contains historical and projected dynamically downscaled climate data for the Southeast region of the State of Alaska at 1 and 4km spatial resolution and hourly temporal resolution. Select variables are also summarized into daily resolutions. This data was produced using the Weather Research and Forecasting (WRF) model (Version 4.0). We downscaled both Climate Forecast System Reanalysis (CFSR) historical reanalysis data (1980-2019) and both historical and projected runs from two GCM’s from the Coupled Model Inter-comparison Project 5 (CMIP5): GFDL-CM3 and NCAR-CCSM4 (historical run: 1980-2010 and RCP 8.5: 2030-2060). 31.01 TiB Various
Central Weather Bureau OpenData Various kinds of weather raw data and charts from Central Weather Bureau. 25.59 GiB Various
Cloud to Street - Microsoft Flood and Clouds Dataset This dataset consists of chips of Sentinel-1 and Sentinel-2 satellite data. Each Sentinel-1 chip contains a corresponding label for water and each Sentinel-2 chip contains a corresponding label for water and clouds. Data is stored in folders by a unique event identifier as the folder name. Within each event folder there are subfolders for Sentinel-1 (s1) and Sentinel-2 (s2) data. Each chip is contained in its own sub-folder with the folder name being the source image id, followed by a unique chip identifier consisting of a hyphenated set of 5 numbers. All bands of the satellite data, as well as the labels, and overview images are contained within the chip folder. 10.0 GiB Various
Earth Observation Data Cubes for Brazil Earth observation (EO) data cubes produced from analysis-ready data (ARD) of CBERS-4, Sentinel-2 A/B and Landsat-8 satellite images for Brazil. The datacubes are regular in time and use a hierarchical tiling system. Further details are described in Ferreira et al. (2020). 117.79 TiB Various
UniProt The Universal Protein Resource (UniProt) is a comprehensive resource for protein sequence and annotation data. The UniProt databases are the UniProt Knowledgebase (UniProtKB), the UniProt Reference Clusters (UniRef), and the UniProt Archive (UniParc). The UniProt consortium and host institutions EMBL-EBI, SIB Swiss Institute of Bioinformatics and PIR are committed to the long-term preservation of the UniProt databases. 3.21 TiB Various
Southern California Earthquake Data This dataset contains ground motion velocity and acceleration seismic waveforms recorded by the Southern California Seismic Network (SCSN) and archived at the Southern California Earthquake Data Center (SCEDC). 93.74 TiB Various
NOAA Rapid Refresh Forecast System (RRFS) Ensemble [Prototype] The Rapid Refresh Forecast System (RRFS) is the National Oceanic and Atmospheric Administration’s (NOAA) next generation convection-allowing, rapidly-updated ensemble prediction system, currently scheduled for operational implementation in late 2023. The operational configuration will feature a 3 km grid covering North America and include forecasts every hour out to 18 hours, with extensions to 60 hours four times per day at 00, 06, 12, and 18 UTC. Each forecast is planned to be composed of 9-10 members. The RRFS will provide guidance to support forecast interests including, but not limited to, aviation, severe convective weather, renewable energy, heavy precipitation, and winter weather on timescales where rapidly-updated guidance is particularly useful.

197.94 TiB Various
The Cancer Genome Atlas The Cancer Genome Atlas (TCGA), a collaboration between the National Cancer Institute (NCI) and National Human Genome Research Institute (NHGRI), aims to generate comprehensive, multi-dimensional maps of the key genomic changes in major types and subtypes of cancer. TCGA has analyzed matched tumor and normal tissues from 11,000 patients, allowing for the comprehensive characterization of 33 cancer types and subtypes, including 10 rare cancers. 35.12 TiB Various
Natural Scenes Dataset Here, we collected and pre-processed a massive, high-quality 7T fMRI dataset that can be used to advance our understanding of how the brain works. A unique feature of this dataset is the massive amount of data available per individual subject. The data were acquired using ultra-high-field fMRI (7T, whole-brain, 1.8-mm resolution, 1.6-s TR). We measured fMRI responses while each of 8 participants viewed 9,000–10,000 distinct, color natural scenes (22,500–30,000 trials) in 30–40 weekly scan sessions over the course of a year. Additional measures were collected including resting-state data, retinotopy, category localizers, anatomical data (T1, T2, diffusion, venogram, angiogram), physiological data (pulse, respiration), eye-tracking data, and additional behavioral assessments outside the scanner. Because of its unprecedented scale and richness, NSD can be used to explore diverse neuroscientific questions with high power at the level of individual subjects. In particular, the number of images sampled in this dataset is sufficiently large that the dataset may be of high interest for computer vision, machine learning, and other data-driven applications. 13.31 TiB Various
OpenNeuro OpenNeuro is a database of openly-available brain imaging data. The data are shared according to a Creative Commons CC0 license, providing a broad range of brain imaging data to researchers and citizen scientists alike. The database primarily focuses on functional magnetic resonance imaging (fMRI) data, but also includes other imaging modalities including structural and diffusion MRI, electroencephalography (EEG), and magnetoencephalograpy (MEG). OpenfMRI is a project of the Center for Reproducible Neuroscience at Stanford University. Development of the OpenNeuro resource has been funded by the National Science Foundation, National Institute of Mental Health, National Institute on Drug Abuse, and the Laura and John Arnold Foundation. 32.97 TiB Various
Digital Earth Africa Sentinel-2 Level-2A The Sentinel-2 mission is part of the European Union Copernicus programme for Earth observations. Sentinel-2 consists of twin satellites, Sentinel-2A (launched 23 June 2015) and Sentinel-2B (launched 7 March 2017). The two satellites have the same orbit, but 180° apart for optimal coverage and data delivery. Their combined data is used in the Digital Earth Africa Sentinel-2 product. 1.97 PiB Various
U.S. Census ACS PUMS U.S. Census Bureau American Community Survey (ACS) Public Use Microdata Sample (PUMS) available in a linked data format using the Resource Description Framework (RDF) data model. 15.44 GiB Various
Finnish Meteorological Institute Weather Radar Data The up-to-date weather radar from the FMI radar network is available as Open Data. The data contain both single radar data along with composites over Finland in GeoTIFF and HDF5-formats. Available composite parameters consist of radar reflectivity (DBZ), rainfall intensity (RR), and precipitation accumulation of 1, 12, and 24 hours. Single radar parameters consist of radar reflectivity (DBZ), radial velocity (VRAD), rain classification (HCLASS), and Cloud top height (ETOP 20). Raw volume data from singe radars are also provided in HDF5 format with ODIM 2.3 conventions. Radar data becomes available as soon as it's received from the radar and pre-processed into deliverable formats. Typically the most recent radar data was collected less than 5 minutes ago. 101.63 TiB Various
NOAA Joint Polar Satellite System (JPSS) Satellites in the JPSS constellation gather global measurements of atmospheric, terrestrial and oceanic conditions, including sea and land surface temperatures, vegetation, clouds, rainfall, snow and ice cover, fire locations and smoke plumes, atmospheric temperature, water vapor and ozone. JPSS delivers key observations for the Nation's essential products and services, including forecasting severe weather like hurricanes, tornadoes and blizzards days in advance, and assessing environmental hazards such as droughts, forest fires, poor air quality and harmful coastal waters. Further, JPSS will provide continuity of critical, global observations of Earth’s atmosphere, oceans and land through 2038. The data will be available from 2012-01-19 to present. 114.01 TiB Various
NOAA U.S. Climate Normals The U.S. Climate Normals are a large suite of data products that provide information about typical climate conditions for thousands of locations across the United States. Normals act both as a ruler to compare today’s weather and tomorrow’s forecast, and as a predictor of conditions in the near future. The official normals are calculated for a uniform 30 year period, and consist of annual/seasonal, monthly, daily, and hourly averages and statistics of temperature, precipitation, and other climatological variables from almost 15,000 U.S. weather stations.

29.33 GiB Various
iNaturalist Licensed Observation Images iNaturalist is a community science effort in which participants share observations of living organisms that they encounter and document with photographic evidence, location, and date. The community works together reviewing these images to identify these observations to species. This collection represents the licensed images accompanying iNaturalist observations. 185.49 TiB Various
NOAA Global Hydro Estimator (GHE) Global Hydro-Estimator provides a global 254.54 GiB Various
Image classification - datasets Some of the most important datasets for image classification research, including 24.46 GiB Various
Crowdsourced Bathymetry Community provided bathymetry data collected in collaboration with the International Hydrographic Organization. 24.97 GiB Various
MWIS VR Instances Large-scale node-weighted conflict graphs for maximum weight independent set solvers 12.87 GiB Various
Multiview Extended Video with Activities (MEVA) The Multiview Extended Video with Activities (MEVA) dataset consists 498.15 GiB Various
Prefeitura Municipal de São Paulo (PMSP) LiDAR Point Cloud The objective of the Mapa 3D Digital da Cidade (M3DC) of the São Paulo City Hall is to publish LiDAR point cloud data. The initial data was acquired in 2017 by aerial surveying and future data will be added. This publicly accessible dataset is provided in the Entwine Point Tiles format as a lossless octree, full density, based on LASzip (LAZ) encoding. 394.46 GiB Various
OpenStreetMap on AWS OSM is a free, editable map of the world, created and maintained by volunteers. Regular OSM data archives are made available in Amazon S3. 3.72 TiB Various
IRS 990 Filings On December 16, 2021 the IRS announced that it would discontinue updates to the IRS 990 Filings dataset on AWS, starting December 31, 2021. 122.85 GiB Various
Pacific Ocean Sound Recordings This project offers passive acoustic data (sound recordings) from a deep-ocean environment off central California. Recording began in July 2015, has been nearly continuous, and is ongoing. These resources are intended for applications 145.64 TiB Various
Scottish Public Sector LiDAR Dataset This dataset is Lidar data that has been collected by the Scottish public sector and made available under the Open Government Licence. The data are available as point cloud (LAS format or in LAZ compressed format), along with the derived Digital Terrain Model (DTM) and Digital Surface Model (DSM) products as Cloud optimized GeoTIFFs (COG) or standard GeoTIFF. The dataset contains multiple subsets of data which were each commissioned and flown in response to different organisational requirements. The details of each can be found at 1.78 TiB Various
4D Nucleome (4DN) The goal of the National Institutes of Health (NIH) Common Fund’s 4D Nucleome (4DN) program 170.53 TiB Various
STOIC2021 Training The STOIC project collected Computed Tomography (CT) images of 10,735 individuals suspected of being infected with SARS-COV-2 during the first wave of the pandemic in France, from March to April 2020. For each patient in the training set, the dataset contains binary labels for COVID-19 presence, based on RT-PCR test results, and COVID-19 severity, defined as intubation or death within one month from the acquisition of the CT scan. This S3 bucket contains the training sample of the STOIC dataset as used in the STOIC2021 challenge on 243.69 GiB Various
HIRLAM Weather Model HIRLAM (High Resolution Limited Area Model) is an operational synoptic and mesoscale weather prediction model managed by the Finnish Meteorological Institute. 244.26 TiB Various
Broad Genome References Broad maintained human genome reference builds hg19/hg38 and decoy references. 204.58 GiB Various
Reasoning Over Paragraph Effects in Situations (ROPES) 14k QA pairs over 1.7K paragraphs, split between train (10k QAs), development (1.6k QAs) and a hidden test partition (1.7k QAs). 397.26 GiB Various
Atmospheric Models from Météo-France Global and high-resolution regional atmospheric models from Météo-France. 5.05 TiB Various
Digital Earth Africa Landsat Collection 2 Level 2 Digital Earth Africa (DE Africa) provides free and open access to a copy of Landsat Collection 2 Level-2 products over Africa. These products are produced and provided by the United States Geological Survey (USGS). 587.47 TiB Various
1940 Census Population Schedules, Enumeration District Maps, and Enumeration District Descriptions The 1940 Census population schedules were created by the Bureau of the Census in an attempt to enumerate every person living in the United States on April 1, 1940, although some persons were missed. The 1940 census population schedules were digitized by the National Archives and Records Administration (NARA) and released publicly on April 2, 2012. 15.05 TiB Various
COVID-19 Data Lake A centralized repository of up-to-date and curated datasets on or related to the spread and characteristics of the novel corona virus (SARS-CoV-2) and its associated illness, COVID-19. Globally, there are several efforts underway to gather this data, and we are working with partners to make this crucial data freely available and keep it up-to-date. Hosted on the AWS cloud, we have seeded our curated data lake with COVID-19 case tracking data from Johns Hopkins and The New York Times, hospital bed availability from Definitive Healthcare, and over 45,000 research articles about COVID-19 and related coronaviruses from the Allen Institute for AI. 157.68 GiB Various
NOAA Rapid Refresh (RAP) The Rapid Refresh (RAP) is a NOAA/NCEP operational weather prediction system comprised primarily of a numerical forecast model and analysis/assimilation system to initialize that model. It covers North America and is run with a horizontal resolution of 13 km and 50 vertical layers. The RAP was developed to serve users needing frequently updated short-range weather forecasts, including those in the US aviation community and US severe weather forecasting community. The model is run for every hour of the day; it is integrated to 51 hours for the 03/09/15/21 UTC cycles and to 21 hours for every other cycle. The RAP uses the ARW core of the WRF model and the Gridpoint Statistical Interpolation (GSI) analysis - the analysis is aided with the assimilation of cloud and hydrometeor data to provide more skill in short-range cloud and precipitation forecasts. 163.59 TiB Various
RarePlanes RarePlanes is a unique open-source machine learning dataset from CosmiQ Works and AI.Reverie that incorporates both real and synthetically generated satellite imagery. The RarePlanes dataset specifically focuses on the value of AI.Reverie synthetic data to aid computer vision algorithms in their ability to automatically detect aircraft and their attributes in satellite imagery. Although other synthetic/real combination datasets exist, RarePlanes is the largest openly-available very high resolution dataset built to test the value of synthetic data from an overhead perspective. The real portion of the dataset consists of 253 Maxar WorldView-3 satellite scenes spanning 112 locations and 2,142 km^2 with 14,700 hand-annotated aircraft. The accompanying synthetic dataset is generated via AI.Reverie’s novel simulation platform and features 50,000 synthetic satellite images with ~630,000 aircraft annotations. 475.96 GiB Various
Amazon Bin Image Dataset The Amazon Bin Image Dataset contains over 500,000 images and metadata from bins of a pod in an operating Amazon Fulfillment Center. The bin images in this dataset are captured as robot units carry pods as part of normal Amazon Fulfillment Center operations. 29.4 GiB Various
Human Cancer Models Initiative (HCMI) Cancer Model Development Center The Human Cancer Models Initiative (HCMI) is an international consortium that is generating novel, 7.12 GiB Various
NOAA Operational Forecast System (OFS) For decades, mariners in the United States have depended on NOAA's Tide Tables for the best estimate of expected water levels. These tables provide accurate predictions of the astronomical tide (i.e., the change in water level due to the gravitational effects of the moon and sun and the rotation of the Earth); however, they cannot predict water-level changes due to wind, atmospheric pressure, and river flow, which are often significant. 129.37 TiB Various
NEXRAD on AWS Real-time and archival data from the Next Generation Weather Radar (NEXRAD) network. 598.42 TiB Various
Department of Energy's Open Energy Data Initiative (OEDI) Data released under the Department of Energy's Open Energy Data Initiative 52.16 TiB Various
QIIME 2 User Tutorial Datasets QIIME 2 is a powerful, extensible, and decentralized microbiome analysis package with a focus on data and analysis transparency. QIIME 2 enables researchers to start an analysis with raw DNA sequence data and finish with publication-quality figures and statistical results. This dataset contains the user docs (and related datasets) for QIIME 2. 270.78 GiB Various
The Klarna Product-Page Dataset A collection of 51,701 product pages from 8175 e-commerce websites across 8 markets (US, GB, SE, NL, FI, NO, DE, AT) with 5 manually labelled elements, specifically, the product price, name and image, add-to-cart and go-to-cart buttons. 124.53 GiB Various
Sentinel-1 Sentinel-1 is a pair of European radar imaging (SAR) satellites launched in 2014 and 2016. Its 6 days revisit cycle and ability to observe through clouds makes it perfect for sea and land monitoring, emergency response due to environmental disasters, and economic applications. This dataset represents the global Sentinel-1 GRD archive, from beginning to the present, converted to cloud-optimized GeoTIFF format. 34.68 TiB Various
The Human Microbiome Project The NIH-funded Human Microbiome Project (HMP) is a collaborative effort of over 300 scientists from more than 80 organizations to comprehensively characterize the microbial communities inhabiting the human body and elucidate their role in human health and disease. To accomplish this task, microbial community samples were isolated from a cohort of 300 healthy adult human subjects at 18 specific sites within five regions of the body (oral cavity, airways, urogenital track, skin, and gut). Targeted sequencing of the 16S bacterial marker gene and/or whole metagenome shotgun sequencing was performed for thousands of these samples. In addition, whole genome sequences were generated for isolate strains collected from human body sites to act as reference organisms for analysis. Finally, 16S marker and whole metagenome sequencing was also done on additional samples from people suffering from several disease conditions. 5.33 TiB Various
Voices Obscured in Complex Environmental Settings (VOiCES) VOiCES is a speech corpus recorded in acoustically challenging settings, 465.01 GiB Various
Natural Earth Natural Earth is a public domain map dataset available at 1:10m, 1:50m, and 1:110 million scales. Featuring tightly integrated vector and raster data, with Natural Earth you can make a variety of visually pleasing, well-crafted maps with cartography or GIS software. 26.57 GiB Various
Quoref 24K Question/Answer (QA) pairs over 4.7K paragraphs, split between train (19K QAs), development (2.4K QAs) and a hidden test partition (2.5K QAs). 397.26 GiB Various
University of British Columbia Sunflower Genome Dataset This dataset captures Sunflower's genetic diversity originating 67.22 TiB Various
REDASA COVID-19 Open Data The REaltime DAta Synthesis and Analysis (REDASA) COVID-19 snapshot contains the output of the curation protocol produced by our curator community. A detailed description can be found in our paper. The first S3 bucket listed in Resources contains a large collection of medical documents in text format extracted from the CORD-19 dataset, plus other sources deemed relevant by the REDASA consortium. The second S3 bucket contains a series of documents surfaced by Amazon Kendra that were considered relevant for each medical question asked. The final S3 bucket contains the GroundTruth annotations created by our curator community. 37.26 GiB Various
Daylight Map Distribution of OpenStreetMap Daylight is a complete distribution of global, open map data that’s freely available with support from community and professional mapmakers. Meta combines the work of global contributors to projects like OpenStreetMap with quality and consistency checks from Daylight mapping partners to create a free, stable, and easy-to-use street-scale global map. The Daylight Map Distribution contains a validated subset of the OpenStreetMap database. In addition to the standard OpenStreetMap PBF format, Daylight is available in two parquet formats that are optimized for AWS Athena including geometries (Points, LineStrings, Polygons, or MultiPolygons). First, Daylight OSM Features contains the nearly 1B renderable OSM features. Second, Daylight OSM Elements contains all of OSM, including all 7B nodes without attributes, and relations that do not contain geometries, such as turn restrictions. 2.33 TiB Various
Galaxy Evolution Explorer Satellite (GALEX) The Galaxy Evolution Explorer Satellite (GALEX) was a NASA mission led by the California Institute of Technology, whose primary goal was to investigates how star formation in galaxies evolved from the early Universe up to the present. GALEX used microchannel plate detectors to obtain direct images in the near-UV (NUV) and far-UV (FUV), and a grism to disperse light for low resolution spectroscopy. More information about GALEX is available at MAST 15.26 TiB Various
District of Columbia - Classified Point Cloud LiDAR "Please see here for the lates content about this dataset. 314.15 GiB Various
NOAA Global Surface Summary of Day Global Surface Summary of the Day is derived from The Integrated Surface Hourly (ISH) dataset. The ISH dataset includes global data obtained from the USAF Climatology Center, located in the Federal Climate Complex with NCDC. The latest daily summary data are normally available 1-2 days after the date-time of the observations used in the daily summaries. The online data files begin with 1929 and are at the time of this writing at the Version 8 software level. Over 9000 stations' data are typically available. The daily elements included in the dataset (as available from each station) are:
34.77 GiB Various
Allen Brain Observatory - Visual Coding AWS Public Data Set The Allen Brain Observatory – Visual Coding is a large-scale, standardized survey of physiological activity across the mouse visual cortex, hippocampus, and thalamus. It includes datasets collected with both two-photon imaging and Neuropixels probes, two complementary techniques for measuring the activity of neurons in vivo. The two-photon imaging dataset features visually evoked calcium responses from GCaMP6-expressing neurons in a range of cortical layers, visual areas, and Cre lines. The Neuropixels dataset features spiking activity from distributed cortical and subcortical brain regions, collected under analogous conditions to the two-photon imaging experiments. We hope that experimentalists and modelers will use these comprehensive, open datasets as a testbed for theories of visual information processing. 158.79 TiB Various
Geosnap Data, Center for Geospatial Sciences This bucket contains multiple datasets (as Quilt packages) created by the 74.17 GiB Various
NOAA National Water Model CONUS Retrospective Dataset The NOAA National Water Model Retrospective dataset contains input and output from multi-decade CONUS retrospective simulations. These simulations used meteorological input fields from meteorological retrospective datasets. The output frequency and fields available in this historical NWM dataset differ from those contained in the real-time operational NWM forecast model. 258.32 TiB Various
Tabula Sapiens Tabula Sapiens will be a benchmark, first-draft human cell atlas of two million cells from 25 organs of eight normal human subjects. 74.76 TiB Various
Sentinel-2 The Sentinel-2 mission is 34.22 TiB Various
TIGER Training "This dataset contains the training data for the Tumor InfiltratinG lymphocytes in breast cancER or TIGER challenge. TIGER is the first challenge on fully automated assessment of tumor-infiltrating lymphocytes (TILs) in breast cancer histopathology slides. TILs are proving to be an important biomarker in cancer patients as they can play a part in killing tumor cells, particularly in some types of breast cancer. Identifying and measuring TILs can help to better target treatments, particularly immunotherapy, and may result in lower levels of other more aggressive treatments, including chemotherapy." 169.24 GiB Various
CoMMpass from the Multiple Myeloma Research Foundation The Relating Clinical Outcomes in Multiple Myeloma to Personal Assessment of Genetic Profile study is the Multiple Myeloma Research Foundation (MMRF)’s landmark personalized medicine initiative. CoMMpass is a 1.11 GiB Various
The Genome Modeling System The Genome Institute at Washington University has developed a high-throughput, fault-tolerant analysis information management system called the Genome Modeling System (GMS), capable of executing complex, interdependent, and automated genome analysis pipelines at a massive scale. The GMS framework provides detailed tracking of samples and data coupled with reliable and repeatable analysis pipelines. GMS includes a full system image with software and services, expandable from one workstation to a large compute cluster. 363.47 GiB Various
ARPA-E PERFORM Forecast data The ARPA-E PERFORM Program is an ARPA-E funded program that aim to use 414.73 GiB Various
K2 Mission Data The K2 mission observed 100 square degrees for 80 days each across 20 different pointings along the ecliptic, collecting high-precision photometry for a selection of targets within each field. The mission began when the original Kepler mission ended due to loss of the second reaction wheel in 2011. More information about the K2 mission is available at MAST. 4.06 TiB Various
NOAA Water-Column Sonar Data Archive Water-column sonar data archived at the NOAA National Centers for Environmental Information. 147.7 TiB Various
Conformational Space of Short Peptides Co-managed by Toyoko and the Structural Biology Group at the Universidad Nacional de Quilmes, this dataset allows us to explore the conformational space of all possible peptides using the 20 common amino acids. It consists of a collection of exhaustive molecular dynamics simulations of tripeptides and pentapeptides. 1.98 TiB Various
NIH NCBI PMC Article Datasets - Full-Text Biomedical and Life Sciences Journal Articles on AWS PubMed Central® (PMC) is a free full-text archive of biomedical and life sciences journal article at the U.S. National Institutes of Health's National Library of Medicine (NIH/NLM). The PubMed Central (PMC) Article Datasets include full-text articles archived in PMC and made available under license terms that allow for text mining and other types of secondary analysis and reuse. The articles are organized on AWS based on general license type:

591.15 GiB Various
NA-CORDEX - North American component of the Coordinated Regional Downscaling Experiment The NA-CORDEX dataset contains regional climate change scenario data and guidance for North America, for use in impacts, decision-making, and climate science. The NA-CORDEX data archive contains output from regional climate models (RCMs) run over a domain covering most of North America using boundary conditions from global climate model (GCM) simulations in the CMIP5 archive. These simulations run from 1950–2100 with a spatial resolution of 0.22°/25km or 0.44°/50km. This AWS S3 version of the data includes selected variables converted to Zarr format from the original NetCDF. Only daily data are currently available; all daily data were mapped to the Gregorian calendar. Sub-daily data may be added later. Both raw and bias-corrected data are available. Further details about this version of the dataset are available at the documentation link below. 13.15 TiB Various
NOAA National Bathymetric Source Data The National Bathymetric Source (NBS) project creates and maintains 36.16 GiB Various
Xiph.Org Test Media Uncompressed video used for video compression and video processing research. 20.69 TiB Various
Airborne Object Tracking Dataset Airborne Object Tracking (AOT) is a collection of 4,943 flight sequences of around 120 seconds each, collected at 10 Hz in diverse conditions. There are 5.9M+ images and 3.3M+ 2D annotations of airborne objects in the sequences. There are 3,306,350 frames without labels as they contain no airborne objects. For images with labels, there are on average 1.3 labels per image. All airborne objects in the dataset are labelled. 11.27 TiB Various
CAM6 Data Assimilation Research Testbed (DART) Reanalysis: Cloud-Optimized Dataset This is a cloud-hosted subset of the CAM6+DART (Community Atmosphere Model version 6 Data Assimilation Research Testbed) Reanalysis dataset. These data products are designed to facilitate a broad variety of research using the NCAR CESM 2.1 (National Center for Atmospheric Research's Community Earth System Model version 2.1), including model evaluation, ensemble hindcasting, data assimilation experiments, and sensitivity studies. They come from an 80 member ensemble reanalysis of the global troposphere and stratosphere using DART and CAM6. The data products represent states of the atmosphere consistent with observations from 2011 through 2019 at 1 degree horizontal resolution and weekly frequency. Each ensemble member is an equally likely description of the atmosphere, and is also consistent with dynamics and physics of CAM6. The dataset also contains corresponding land surface values at 6-hourly frequency. This dataset is a reformatting, with no change to numerical values, of data from the "CAM6 Data Assimilation Research Testbed (DART) Reanalysis", DOI:10.5065/JG1E-8525. 2.55 TiB Various
AI2 Diagram Dataset (AI2D) 4,817 illustrative diagrams for research on diagram understanding and associated question answering. 397.26 GiB Various
NOAA North American Mesoscale Forecast System (NAM) The North American Mesoscale Forecast System (NAM) is one of the National Centers For Environmental Prediction’s (NCEP) major models for producing weather forecasts. NAM generates multiple grids (or domains) of weather forecasts over the North American continent at various horizontal resolutions. Each grid contains data for dozens of weather parameters, including temperature, precipitation, lightning, and turbulent kinetic energy. NAM uses additional numerical weather models to generate high-resolution forecasts over fixed regions, and occasionally to follow significant weather events like hurricanes. 101.47 TiB Various
Global Database of Events, Language and Tone (GDELT) This project monitors the world's broadcast, print, 4.53 TiB Various
Normalized Difference Urban Index (NDUI) NDUI is combined with cloud shadow-free Landsat Normalized Difference Vegetation Index (NDVI) composite and DMSP/OLS Night Time Light (NTL) to characterize global urban areas at a 30 m resolution,and it can greatly enhance urban areas, which can then be easily distinguished from bare lands including fallows and deserts. With the capability to delineate urban boundaries and, at the same time, to present sufficient spatial details within urban areas, the NDUI has the potential for urbanization studies at regional and global scales. 35.3 GiB Various
Sentinel-1 SLC dataset for South and Southeast Asia, Taiwan, Korea and Japan The S1 Single Look Complex (SLC) dataset contains Synthetic Aperture Radar (SAR) data in the C-Band wavelength. The SAR sensors are installed on a two-satellite (Sentinel-1A and Sentinel-1B) constellation orbiting the Earth with a combined revisit time of six days, operated by the European Space Agency. The S1 SLC data are a Level-1 product that collects radar amplitude and phase information in all-weather, day or night conditions, which is ideal for studying natural hazards and emergency response, land applications, oil spill monitoring, sea-ice conditions, and associated climate change effects. 440.19 TiB Various
Japanese Tokenizer Dictionaries Japanese Tokenizer Dictionaries for use with MeCab. 3.17 GiB Various
NREL National Solar Radiation Database Released to the public as part of the Department of Energy's Open Energy Data Initiative, 692.05 TiB Various
LOFAR ELAIS-N1 cycle 2 observations on AWS These data correspond to the International LOFAR Telescope observations of the sky field ELAIS-N1 (16:10:01 +54:30:36) during the cycle 2 of observations. There are 11 runs of about 8 hours each plus the corresponding observation of the calibration targets before and after the target field. The data are measurement sets (MS) containing the cross-correlated data and metadata divided in 371 frequency sub-bands per target centred at ~150 MHz. 63.85 TiB Various
UK Met Office Global and Regional Weather Forecasts This dataset listing is no longer active. Please go here for current information. Archive data from the UK Met Office Global and Regional Ensemble Prediction System (MOGREPS) available on Amazon S3. Data from two models is available: MOEGREPS-UK, a high resolution weather forecast covering the United Kingdom, and MOGREPS-G, a global weather forecast. 58.1 TiB Various
DOE's Water Power Technology Office's (WPTO) US Wave dataset Released to the public as part of the Department of Energy's Open Energy Data Initiative, 44.21 TiB Various
New Jersey Statewide LiDAR Elevation datasets in New Jersey have been collected over several years as several 8.75 TiB Various
CMIP6 GCMs downscaled using WRF High-resolution historical and future climate simulations from 1980-2100 605.6 TiB Various
Refgenie reference genome assets Pre-built refgenie reference genome data assets used for aligning and analyzing DNA sequence data. 9.9 TiB Various
High-Order Accurate Direct Numerical Simulation of Flow over a MTU-T161 Low Pressure Turbine Blade The archive comprises snapshot, point-probe, and time-average data produced via a high-fidelity computational simulation of turbulent air flow over a low pressure turbine blade, which is an important component in a jet engine. The simulation was undertaken using the open source PyFR flow solver on over 5000 Nvidia K20X GPUs of the Titan supercomputer at Oak Ridge National Laboratory under an INCITE award from the US DOE. The data can be used to develop an enhanced understanding of the complex three-dimensional unsteady air flow patterns over turbine blades in jet engines. This could in turn lead to design of greener more fuel efficient aircraft. It could also be used to train a next-generation of Reynolds Averaged Navier-Stokes turbulence models via a machine learning approach, which would have broad applicability to a wide range of science and engineering problems. 10.54 TiB Various
Radiant MLHub Radiant MLHub is an open library for geospatial training data that hosts datasets generated by Radiant Earth Foundation's team as well as other training data catalogs contributed by Radiant Earth’s partners. Radiant MLHub is open to anyone to access, store, register and/or share their training datasets for high-quality Earth observations. All of the training datasets are stored using a SpatioTemporal Asset Catalog (STAC) compliant catalog and exposed through a common API. Training datasets include pairs of imagery and labels for different types of machine learning problems including image classification, object detection, and semantic segmentation. Labels are generated from ground reference data and/or image annotation. 8.59 TiB Various
Community Earth System Model v2 Large Ensemble (CESM2 LENS) The US National Center for Atmospheric Research partnered with the IBS Center for Climate Physics in South Korea to generate the CESM2 Large Ensemble which consists of 100 ensemble members at 1 degree spatial resolution covering the period 1850-2100 under CMIP6 historical and SSP370 future radiative forcing scenarios. Data sets from this ensemble were made downloadable via the Climate Data Gateway on June 14th, 2021. 309.28 TiB Various
ZINC Database 3D models for molecular docking screens. 658.32 TiB Various
NOAA Global Historical Climatology Network Daily (GHCN-D) Global Historical Climatology Network - Daily is a dataset from NOAA that contains daily observations over global land areas. It contains station-based measurements from land-based stations worldwide, about two thirds of which are for precipitation measurement only. Other meteorological elements include, but are not limited to, daily maximum and minimum temperature, temperature at the time of observation, snowfall and snow depth. It is a composite of climate records from numerous sources that were merged together and subjected to a common suite of quality assurance reviews. Some data are more than 175 years old. The data is in CSV format. Each file corresponds to a year from 1763 to present and is named as such. 109.33 GiB Various
High Resolution Population Density Maps + Demographic Estimates by CIESIN and Meta Population data for a selection of countries, allocated to 1 arcsecond blocks and provided in a combination of CSV 95.62 GiB Various
Rapid7 FDNS ANY Dataset Subset of FDNS ANY queries against domain names produced by Rapid7 Project Sonar, made available in s3. 151.26 GiB Various
Analysis Ready Sentinel-1 Backscatter Imagery The Sentinel-1 mission is a constellation of 49.7 TiB Various
NOAA Geostationary Operational Environmental Satellites (GOES) 16 & 17 NOAA GOES-T will launch in March 2022!! For more information check out the GOES-T Webpage. 784.79 TiB Various
Genome Ark The Genome Ark hosts genomic information for the Vertebrate Genomes Project (VGP) and other related projects. The VGP is an international collaboration that aims to generate complete and near error-free reference genomes for all extant vertebrate species. These genomes will be used to address fundamental questions in biology and disease, to identify species most genetically at risk for extinction, and to preserve genetic information of life. 429.62 TiB Various
Aristo Mini Corpus 1,197,377 science-relevant sentences 397.26 GiB Various
DialoGLUE: A Natural Language Understanding Benchmark for Task-Oriented Dialogue This bucket contains the checkpoints used to reproduce the baseline results reported in the DialoGLUE benchmark hosted 1.23 GiB Various
Digital Earth Africa Sentinel-1 Radiometrically Terrain Corrected DE Africa’s Sentinel-1 backscatter product is developed to be compliant with the CEOS Analysis Ready Data for Land (CARD4L) specifications. 206.7 TiB Various
COVID-19 Harmonized Data A harmonized collection of the core data pertaining to COVID-19 reported cases by geography, in a format prepared for analysis 76.99 GiB Various
International Neuroimaging Data-Sharing Initiative (INDI) This bucket contains multiple neuroimaging datasets that are part of the International Neuroimaging Data-Sharing Initiative. Raw human and non-human primate neuroimaging data include 1) Structural MRI; 2) Functional MRI; 3) Diffusion Tensor Imaging; 4) Electroencephalogram (EEG) 268.82 TiB Various
NOAA Oceanic Climate Data Records NOAA's Climate Data Records (CDRs) are robust, sustainable, and scientifically sound climate records that provide trustworthy information on how, where, and to what extent the land, oceans, atmosphere and ice sheets are changing. These datasets are thoroughly vetted time series measurements with the longevity, consistency, and continuity to assess and measure climate variability and change. NOAA CDRs are vetted using standards established by the National Research Council (NRC).

228.21 GiB Various
NOAA Climate Forecast System (CFS) The Climate Forecast System (CFS) is a model representing the global interaction between Earth's oceans, land, and atmosphere. Produced by several dozen scientists under guidance from the National Centers for Environmental Prediction (NCEP), this model offers hourly data with a horizontal resolution down to one-half of a degree (approximately 56 km) around Earth for many variables. CFS uses the latest scientific approaches for taking in, or assimilating, observations from data sources including surface observations, upper air balloon observations, aircraft observations, and satellite observations.
Please note that the data in this bucket are the CFSv2 Operational Forecasts. To obtain other CFSv2 products such as the Operational Analysis, please visit our website.
357.76 TiB Various
A Realistic Cyber Defense Dataset (CSE-CIC-IDS2018) This dataset is the result of a collaborative project between the Communications Security Establishment (CSE) and The Canadian Institute for Cybersecurity (CIC) that use the notion of profiles to generate cybersecurity dataset in a systematic manner. It incluides a detailed description of intrusions along with abstract distribution models for applications, protocols, or lower level network entities. The dataset includes seven different attack scenarios, namely Brute-force, Heartbleed, Botnet, DoS, DDoS, Web attacks, and infiltration of the network from inside. The attacking infrastructure includes 50 machines and the victim organization has 5 departments includes 420 PCs and 30 servers. This dataset includes the network traffic and log files of each machine from the victim side, along with 80 network traffic features extracted from captured traffic using CICFlowMeter-V3. 452.75 GiB Various
Cell Painting Image Collection The Cell Painting Image Collection is a collection of freely 1.94 TiB Various
YouTube 8 Million - Data Lakehouse Ready This both the original .tfrecords and a Parquet representation of the YouTube 8 Million dataset. YouTube-8M is a large-scale labeled video dataset that consists of millions of YouTube video IDs, with high-quality machine-generated annotations from a diverse vocabulary of 3,800+ visual entities. It comes with precomputed audio-visual features from billions of frames and audio segments, designed to fit on a single hard disk. This dataset also includes the YouTube-8M Segments data from June 2019. 3.17 TiB Various
NOAA National Water Model Short-Range Forecast The National Water Model (NWM) is a water resources model that simulates and forecasts water 27.73 TiB Various
SondeHub Radiosonde Telemetry SondeHub Radiosonde telemetry contains global radiosonde (weather balloon) data captured by SondeHub from our participating radiosonde_auto_rx receiving stations. radiosonde_auto_rx is a open source project aimed at receiving and decoding telemetry from airborne radiosondes using software-defined-radio techniques, enabling study of the telemetry and sometimes recovery of the radiosonde itself. 59.05 GiB Various
NOAA Global Forecast System (GFS) The Global Forecast System (GFS) is a weather forecast model produced 936.41 TiB Various
Sea Surface Temperature Daily Analysis: European Space Agency Climate Change Initiative product version 2.1 Global daily-mean sea surface temperatures, presented on a 0.05° latitude-longitude grid, with gaps between available daily observations filled by statistical means, spanning late 1981 to recent time. Suitable for large-scale oceanographic meteorological and climatological applications, such as evaluating or constraining environmental models or case-studies of marine heat wave events. Includes temperature uncertainty information and auxiliary information about land-sea fraction and sea-ice coverage. For reference and citation see: 273.6 GiB Various
NOAA Global Ensemble Forecast System (GEFS) Re-forecast NOAA has generated a multi-decadal reanalysis and reforecast data set to accompany the next-generation version of its ensemble prediction system, the Global Ensemble Forecast System, version 12 (GEFSv12). Accompanying the real-time forecasts are “reforecasts” of the weather, that is, retrospective forecasts spanning the period 2000-2019. These reforecasts are not as numerous as the real-time data; they were generated only once per day, from 00 UTC initial conditions, and only 5 members were provided, with the following exception. Once weekly, an 11-member reforecast was generated, and these extend in lead time to +35 days. 388.8 TiB Various
Deutsche Börse Public Dataset The Deutsche Börse Public Data Set consists of trade data aggregated to one minute intervals from the Eurex and Xetra trading systems. It provides the initial price, lowest price, highest price, final price and volume for every minute of the trading day, and for every tradeable security. If you need higher resolution data, including untraded price movements, please refer to our historical market data product here. Also, be sure to check out our developer's portal. 16.05 GiB Various
Digital Earth Africa ALOS PALSAR, ALOS-2 PALSAR-2 and JERS-1 The ALOS/PALSAR annual mosaic is a global 25 m resolution dataset that combines data from many images captured by JAXA’s PALSAR and PALSAR-2 sensors on ALOS-1 and ALOS-2 satellites respectively. This product contains radar measurement in L-band and in HH and HV polarizations. It has a spatial resolution of 25 m and is available annually for 2007 to 2010 (ALOS/PALSAR) and 2015 to 2020 (ALOS-2/PALSAR-2). 3.1 TiB Various
ubuntu@ip-172-31-80-59:~/open-data-registry/datasets$ screen -r -d
NOAA Severe Weather Data Inventory (SWDI) The Storm Events Database is an integrated database of severe weather events across the United States from 1950 to this year, with information about a storm event's location, azimuth, distance, impact, and severity, including the cost of damages to property and crops. It contains data documenting: The occurrence of storms and other significant weather phenomena having sufficient intensity to cause loss of life, injuries, significant property damage, and/or disruption to commerce. Rare, unusual, weather phenomena that generate media attention, such as snow flurries in South Florida or the San Diego coastal area. Other significant meteorological events, such as record maximum or minimum temperatures or precipitation that occur in connection with another event. Data about a specific event is added to the dataset within 120 days to allow time for damage assessments and other analysis. 71.29 GiB Various
Community Earth System Model v2 Large Ensemble (CESM2 LENS) The US National Center for Atmospheric Research partnered with the IBS Center for Climate Physics in South Korea to generate the CESM2 Large Ensemble which consists of 100 ensemble members at 1 degree spatial resolution covering the period 1850-2100 under CMIP6 historical and SSP370 future radiative forcing scenarios. Data sets from this ensemble were made downloadable via the Climate Data Gateway on June 14th, 2021. 309.28 TiB Various
ZINC Database 3D models for molecular docking screens. 658.32 TiB Various
NOAA Global Historical Climatology Network Daily (GHCN-D) Global Historical Climatology Network - Daily is a dataset from NOAA that contains daily observations over global land areas. It contains station-based measurements from land-based stations worldwide, about two thirds of which are for precipitation measurement only. Other meteorological elements include, but are not limited to, daily maximum and minimum temperature, temperature at the time of observation, snowfall and snow depth. It is a composite of climate records from numerous sources that were merged together and subjected to a common suite of quality assurance reviews. Some data are more than 175 years old. The data is in CSV format. Each file corresponds to a year from 1763 to present and is named as such. 109.33 GiB Various
High Resolution Population Density Maps + Demographic Estimates by CIESIN and Meta Population data for a selection of countries, allocated to 1 arcsecond blocks and provided in a combination of CSV 95.62 GiB Various
Image localization - datasets Some of the most important datasets for image localization research, including 15.46 GiB Various
Rapid7 FDNS ANY Dataset Subset of FDNS ANY queries against domain names produced by Rapid7 Project Sonar, made available in s3. 151.26 GiB Various
Analysis Ready Sentinel-1 Backscatter Imagery The Sentinel-1 mission is a constellation of 49.7 TiB Various
NOAA Geostationary Operational Environmental Satellites (GOES) 16 & 17 NOAA GOES-T will launch in March 2022!! For more information check out the GOES-T Webpage. 1.36 PiB Various
Genome Ark The Genome Ark hosts genomic information for the Vertebrate Genomes Project (VGP) and other related projects. The VGP is an international collaboration that aims to generate complete and near error-free reference genomes for all extant vertebrate species. These genomes will be used to address fundamental questions in biology and disease, to identify species most genetically at risk for extinction, and to preserve genetic information of life. 429.62 TiB Various
The Massively Multilingual Image Dataset (MMID) MMID is a large-scale, massively multilingual dataset of images paired with the words they represent collected at the University of Pennsylvania. 2.37 TiB Various
Allen Cell Imaging Collections This bucket contains multiple datasets (as Quilt packages) created by the 54.41 TiB Various
Aristo Mini Corpus 1,197,377 science-relevant sentences 397.26 GiB Various
DialoGLUE: A Natural Language Understanding Benchmark for Task-Oriented Dialogue This bucket contains the checkpoints used to reproduce the baseline results reported in the DialoGLUE benchmark hosted 1.23 GiB Various
Digital Earth Africa Sentinel-1 Radiometrically Terrain Corrected DE Africa’s Sentinel-1 backscatter product is developed to be compliant with the CEOS Analysis Ready Data for Land (CARD4L) specifications. 206.7 TiB Various
NOAA Unified Forecast System (UFS) Marine Reanalysis: 1979-2019 The NOAA UFS Marine Reanalysis is a global sea ice ocean coupled reanalysis product produced by the marine data assimilation team of the UFS Research-to-Operation (R2O) project. Underlying forecast and data assimilation systems are based on the UFS model prototype version-6 and the Next Generation Global Ocean Data Assimilation System (NG-GODAS) release of the Joint Effort for Data assimilation Integration (JEDI) Sea Ice Ocean Coupled Assimilation (SOCA). Covering the 40 year reanalysis time period from 1979 to 2019, the data atmosphere option of the UFS coupled global atmosphere ocean sea ice (DATM-MOM6-CICE6) model was applied with two atmospheric forcing data sets: CFSR from 1979 to 1999 and GEFS from 2000 to 2019. Assimilated observation data sets include extensive space-based marine observations and conventional direct measurements of in situ profile data sets. 6.97 TiB Various
COVID-19 Harmonized Data A harmonized collection of the core data pertaining to COVID-19 reported cases by geography, in a format prepared for analysis 76.99 GiB Various
International Neuroimaging Data-Sharing Initiative (INDI) This bucket contains multiple neuroimaging datasets that are part of the International Neuroimaging Data-Sharing Initiative. Raw human and non-human primate neuroimaging data include 1) Structural MRI; 2) Functional MRI; 3) Diffusion Tensor Imaging; 4) Electroencephalogram (EEG) 268.82 TiB Various
NOAA Oceanic Climate Data Records NOAA's Climate Data Records (CDRs) are robust, sustainable, and scientifically sound climate records that provide trustworthy information on how, where, and to what extent the land, oceans, atmosphere and ice sheets are changing. These datasets are thoroughly vetted time series measurements with the longevity, consistency, and continuity to assess and measure climate variability and change. NOAA CDRs are vetted using standards established by the National Research Council (NRC).

228.21 GiB Various
NOAA Climate Forecast System (CFS) The Climate Forecast System (CFS) is a model representing the global interaction between Earth's oceans, land, and atmosphere. Produced by several dozen scientists under guidance from the National Centers for Environmental Prediction (NCEP), this model offers hourly data with a horizontal resolution down to one-half of a degree (approximately 56 km) around Earth for many variables. CFS uses the latest scientific approaches for taking in, or assimilating, observations from data sources including surface observations, upper air balloon observations, aircraft observations, and satellite observations.
Please note that the data in this bucket are the CFSv2 Operational Forecasts. To obtain other CFSv2 products such as the Operational Analysis, please visit our website.
357.76 TiB Various
A Realistic Cyber Defense Dataset (CSE-CIC-IDS2018) This dataset is the result of a collaborative project between the Communications Security Establishment (CSE) and The Canadian Institute for Cybersecurity (CIC) that use the notion of profiles to generate cybersecurity dataset in a systematic manner. It incluides a detailed description of intrusions along with abstract distribution models for applications, protocols, or lower level network entities. The dataset includes seven different attack scenarios, namely Brute-force, Heartbleed, Botnet, DoS, DDoS, Web attacks, and infiltration of the network from inside. The attacking infrastructure includes 50 machines and the victim organization has 5 departments includes 420 PCs and 30 servers. This dataset includes the network traffic and log files of each machine from the victim side, along with 80 network traffic features extracted from captured traffic using CICFlowMeter-V3. 452.75 GiB Various
Cell Painting Image Collection The Cell Painting Image Collection is a collection of freely 1.94 TiB Various
YouTube 8 Million - Data Lakehouse Ready This both the original .tfrecords and a Parquet representation of the YouTube 8 Million dataset. YouTube-8M is a large-scale labeled video dataset that consists of millions of YouTube video IDs, with high-quality machine-generated annotations from a diverse vocabulary of 3,800+ visual entities. It comes with precomputed audio-visual features from billions of frames and audio segments, designed to fit on a single hard disk. This dataset also includes the YouTube-8M Segments data from June 2019. 3.17 TiB Various
NOAA National Water Model Short-Range Forecast The National Water Model (NWM) is a water resources model that simulates and forecasts water 27.73 TiB Various
SondeHub Radiosonde Telemetry SondeHub Radiosonde telemetry contains global radiosonde (weather balloon) data captured by SondeHub from our participating radiosonde_auto_rx receiving stations. radiosonde_auto_rx is a open source project aimed at receiving and decoding telemetry from airborne radiosondes using software-defined-radio techniques, enabling study of the telemetry and sometimes recovery of the radiosonde itself. 59.05 GiB Various
NOAA Global Forecast System (GFS) The Global Forecast System (GFS) is a weather forecast model produced 936.41 TiB Various
UCSC Genome Browser Sequence and Annotations The UCSC Genome Browser is an online graphical viewer for genomes, a genome browser, hosted by the University of California, Santa Cruz (UCSC). The interactive website offers access to genome sequence data from a variety of vertebrate and invertebrate species and major model organisms, integrated with a large collection of aligned annotations. This dataset is a copy of the MySQL tables in MyISAM binary and tab-sep format and all binary files in custom formats, sometimes referred as 'gbdb'-files. Data from the UCSC Genome Browser is free and open for use by anyone. However, every genome annotation track has been created by an academic research group, or, in a few cases, by commercial companies. Please acknowledge them by citing them. The information can be found by going to, selecting the respective genome assembly and clicking on the data track. At the end of the documentation, we provide a list of references and acknowledgements. 73.11 TiB Various
Sea Surface Temperature Daily Analysis: European Space Agency Climate Change Initiative product version 2.1 Global daily-mean sea surface temperatures, presented on a 0.05° latitude-longitude grid, with gaps between available daily observations filled by statistical means, spanning late 1981 to recent time. Suitable for large-scale oceanographic meteorological and climatological applications, such as evaluating or constraining environmental models or case-studies of marine heat wave events. Includes temperature uncertainty information and auxiliary information about land-sea fraction and sea-ice coverage. For reference and citation see: 273.6 GiB Various
NOAA Global Ensemble Forecast System (GEFS) Re-forecast NOAA has generated a multi-decadal reanalysis and reforecast data set to accompany the next-generation version of its ensemble prediction system, the Global Ensemble Forecast System, version 12 (GEFSv12). Accompanying the real-time forecasts are “reforecasts” of the weather, that is, retrospective forecasts spanning the period 2000-2019. These reforecasts are not as numerous as the real-time data; they were generated only once per day, from 00 UTC initial conditions, and only 5 members were provided, with the following exception. Once weekly, an 11-member reforecast was generated, and these extend in lead time to +35 days. 388.8 TiB Various
Deutsche Börse Public Dataset The Deutsche Börse Public Data Set consists of trade data aggregated to one minute intervals from the Eurex and Xetra trading systems. It provides the initial price, lowest price, highest price, final price and volume for every minute of the trading day, and for every tradeable security. If you need higher resolution data, including untraded price movements, please refer to our historical market data product here. Also, be sure to check out our developer's portal. 16.05 GiB Various
Digital Earth Africa ALOS PALSAR, ALOS-2 PALSAR-2 and JERS-1 The ALOS/PALSAR annual mosaic is a global 25 m resolution dataset that combines data from many images captured by JAXA’s PALSAR and PALSAR-2 sensors on ALOS-1 and ALOS-2 satellites respectively. This product contains radar measurement in L-band and in HH and HV polarizations. It has a spatial resolution of 25 m and is available annually for 2007 to 2010 (ALOS/PALSAR) and 2015 to 2020 (ALOS-2/PALSAR-2). 3.1 TiB Various
GeoNet Aotearoa New Zealand Data GeoNet provides geological hazard information for Aotearoa New Zealand. This dataset contains data and products recorded by the GeoNet sensor network. The dataset currently include GNSS data and additional datasets will be added in the near future. GNSS (Global Navigation Satellite System) data include raw data in proprietary and Receiver Independent Exchange Format (RINEX) and local tie-in survey conducted during equipment changes, more details can be found on 'the GeoNet geodetic page' website. Coastal gauge data include relative measurement of sea level measured by tsunami monitoring gauges. Raw and quality control data are provided in CREX format (Character Form for the Representtion and eXchange of metereological data), more details can be found on 'the GeoNet coastal tsunami monitoring gauges page'. 7.73 TiB Various

@xinaxu this is a lot of datasets! wow!

some of these look to be duplicates (both against what we have in Slingshot as well as in some instances across the proposed table). can i propose that you pick the top 10-15 that you'd like to see onboarded or would like to work on yourself?

@orvn @timelytree had some additional thoughts on adding more datasets as well. tagging them to share!



  • The table above has 234 rows, but some were duplicate
  • Below is the same list but
    • Only dataset names
    • Sorted
    • De-duped
    • Some datasets already on Slingshot struck through


Still, there are just over 200 left, which is double the size of Slingshot's 82 current datasets. I think it's valuable to have more datasets, but we also have to consider that adding a large quantity will require modifications to Slingshot's UI, especially:

  • Better dataset filtering in the dataset explorer
  • A search/filter UI for radio buttons when selecting a dataset as a Slingshot participant (since 200+ options will be overwhelming for users)

Deduped list

  1. 1000 Genomes
  2. 1000 Genomes Phase 3 Reanalysis with DRAGEN 3.5 - Data Lakehouse Ready
  3. 1000 Genomes Phase 3 Reanalysis with DRAGEN 3.5 and 3.7
  4. 1940 Census Population Schedules, Enumeration District Maps, and Enumeration District Descriptions
  5. 3000 Rice Genomes Project
  6. 4D Nucleome (4DN)
  7. A Realistic Cyber Defense Dataset (CSE-CIC-IDS2018)
  8. AgricultureVision
  9. AI2 Diagram Dataset (AI2D)
  10. AI2 Meaningful Citations Data Set
  11. Airborne Object Tracking Dataset
  12. Allen Brain Observatory - Visual Coding AWS Public Data Set
  13. Allen Cell Imaging Collections
  14. Allen Ivy Glioblastoma Atlas
  15. Allen Mouse Brain Atlas
  16. Amazon Bin Image Dataset
  17. Amazon-PQA
  18. Analysis Ready Sentinel-1 Backscatter Imagery
  19. Aristo Mini Corpus
  20. ARPA-E PERFORM Forecast data
  21. Atmospheric Models from Météo-France
  22. AWS iGenomes
  23. Basic Local Alignment Sequences Tool (BLAST) Databases
  24. Boreas Autonomous Driving Dataset
  25. Broad Genome References
  26. CAFE60 reanalysis
  27. CAM6 Data Assimilation Research Testbed (DART) Reanalysis: Cloud-Optimized Dataset
  28. CBERS on AWS
  29. Cell Painting Image Collection
  30. Central Weather Bureau OpenData
  31. ChEMBL - Data Lakehouse Ready
  32. Clinical Proteomic Tumor Analysis Consortium 3 (CPTAC-3)
  33. Cloud to Street - Microsoft Flood and Clouds Dataset
  34. CMIP6 GCMs downscaled using WRF
  35. CoMMpass from the Multiple Myeloma Research Foundation
  36. Community Earth System Model v2 Large Ensemble (CESM2 LENS)
  37. Conformational Space of Short Peptides
  38. Copernicus Digital Elevation Model (DEM)
  39. Cornell EAS Data Lake
  40. Coupled Model Intercomparison Project 6
  41. Coupled Model Intercomparison Project Phase 5 (CMIP5) University of Wisconsin-Madison Probabilistic Downscaling Dataset
  42. COVID-19 Data Lake
  43. COVID-19 Genome Sequence Dataset
  44. COVID-19 Harmonized Data
  45. COVID-19 Molecular Structure and Therapeutics Hub
  46. Crowdsourced Bathymetry
  47. Daylight Map Distribution of OpenStreetMap
  48. Department of Energy's Open Energy Data Initiative (OEDI)
  49. Deutsche Börse Public Dataset
  50. DialoGLUE: A Natural Language Understanding Benchmark for Task-Oriented Dialogue
  51. Digital Earth Africa ALOS PALSAR, ALOS-2 PALSAR-2 and JERS-1
  52. Digital Earth Africa GeoMAD
  53. Digital Earth Africa Landsat Collection 2 Level 2
  54. Digital Earth Africa Sentinel-1 Radiometrically Terrain Corrected
  55. Digital Earth Africa Sentinel-2 Level-2A
  56. Discrete Reasoning Over the content of Paragraphs (DROP)
  57. District of Columbia - Classified Point Cloud LiDAR
  58. DOE's Water Power Technology Office's (WPTO) US Wave dataset
  59. Earth Observation Data Cubes for Brazil
  60. ESA WorldCover
  61. Finnish Meteorological Institute Weather Radar Data
  62. Galaxy Evolution Explorer Satellite (GALEX)
  63. GATK Test Data
  64. Genome Ark
  65. GeoNet Aotearoa New Zealand Data
  66. Geosnap Data, Center for Geospatial Sciences
  67. Global Database of Events, Language and Tone (GDELT)
  68. Global Seasonal Sentinel-1 Interferometric Coherence and Backscatter Data Set
  69. Hecatomb Databases
  70. High Resolution Downscaled Climate Data for Southeast Alaska
  71. High Resolution Population Density Maps + Demographic Estimates by CIESIN and Meta
  72. High-Order Accurate Direct Numerical Simulation of Flow over a MTU-T161 Low Pressure Turbine Blade
  73. HIRLAM Weather Model
  74. Human Cancer Models Initiative (HCMI) Cancer Model Development Center
  75. IBL Neuropixels Brainwide Map on AWS
  76. Image classification - datasets
  77. Image localization - datasets
  78. iNaturalist Licensed Observation Images
  79. InRad COVID-19 X-Ray and CT Scans
  80. International Neuroimaging Data-Sharing Initiative (INDI)
  81. IRS 990 Filings
  82. Japanese Tokenizer Dictionaries
  83. K2 Mission Data
  84. Kepler Mission Data
  85. KITTI Vision Benchmark Suite
  86. Legal Entity Identifier (LEI) and Legal Entity Reference Data (LE-RD)
  87. LOFAR ELAIS-N1 cycle 2 observations on AWS
  88. Longitudinal Nutrient Deficiency
  89. Medical Segmentation Decathlon
  90. MODIS
  91. MODIS MYD13A1, MOD13A1, MYD11A1, MOD11A1, MCD43A4
  92. Multi-Scale Ultra High Resolution (MUR) Sea Surface Temperature (SST)
  93. Multiview Extended Video with Activities (MEVA)
  94. MWIS VR Instances
  95. NA-CORDEX - North American component of the Coordinated Regional Downscaling Experiment
  96. Nanopore Reference Human Genome
  97. NapierOne Mixed File Dataset
  98. NASA NEX
  99. Natural Earth
  100. Natural Scenes Dataset
  101. New Jersey Statewide Digital Aerial Imagery Catalog
  102. New Jersey Statewide LiDAR
  103. NEXRAD on AWS
  104. NIH NCBI PMC Article Datasets - Full-Text Biomedical and Life Sciences Journal Articles on AWS
  105. NIH NCBI Sequence Read Archive (SRA) on AWS
  106. NOAA Atmospheric Climate Data Records
  107. NOAA Climate Forecast System (CFS)
  108. NOAA Coastal Lidar Data
  109. NOAA Continuously Operating Reference Stations (CORS) Network (NCN)
  110. NOAA Fundamental Climate Data Records (FCDR)
  111. NOAA Geostationary Operational Environmental Satellites (GOES) 16 & 17
  112. NOAA Global Ensemble Forecast System (GEFS)
  113. NOAA Global Ensemble Forecast System (GEFS) Re-forecast
  114. NOAA Global Extratropical Surge and Tide Operational Forecast System (Global ESTOFS)
  115. NOAA Global Forecast System (GFS)
  116. NOAA Global Historical Climatology Network Daily (GHCN-D)
  117. NOAA Global Hydro Estimator (GHE)
  118. NOAA Global Mosaic of Geostationary Satellite Imagery (GMGSI)
  119. NOAA Global Surface Summary of Day
  120. NOAA High-Resolution Rapid Refresh (HRRR) Model
  121. NOAA Joint Polar Satellite System (JPSS)
  122. NOAA National Bathymetric Source Data
  123. NOAA National Blend of Models (NBM)
  124. NOAA National Water Model CONUS Retrospective Dataset
  125. NOAA National Water Model Short-Range Forecast
  126. NOAA North American Mesoscale Forecast System (NAM)
  127. NOAA Oceanic Climate Data Records
  128. NOAA Operational Forecast System (OFS)
  129. NOAA Rapid Refresh (RAP)
  130. NOAA Rapid Refresh Forecast System (RRFS) Ensemble [Prototype]
  131. NOAA Severe Weather Data Inventory (SWDI)
  132. NOAA U.S. Climate Normals
  133. NOAA Unified Forecast System (UFS) Marine Reanalysis: 1979-2019
  134. NOAA Unified Forecast System Subseasonal to Seasonal Prototypes
  135. NOAA Water-Column Sonar Data Archive
  136. NOAA World Ocean Database (WOD)
  137. NOAA/PMEL Ocean Climate Stations Moorings
  138. Normalized Difference Urban Index (NDUI)
  139. NREL National Solar Radiation Database
  140. NREL Wind Integration National Dataset
  141. Ohio State Cardiac MRI Raw Data (OCMR)
  142. Open City Model (OCM)
  143. Open Observatory of Network Interference (OONI)
  144. Open Targets - Data Lakehouse Ready
  145. OpenAQ
  146. OpenCell on AWS
  147. OpenEEW
  148. OpenNeuro
  149. OpenStreetMap on AWS
  150. OpenSurfaces
  151. Orcasound - bioacoustic data for marine conservation
  152. Oxford Nanopore Technologies Benchmark Datasets
  153. Pacific Ocean Sound Recordings
  154. PoroTomo
  155. Prefeitura Municipal de São Paulo (PMSP) LiDAR Point Cloud
  156. Provision of Web-Scale Parallel Corpora for Official European Languages (ParaCrawl)
  157. PubSeq - Public Sequence Resource
  158. QIIME 2 User Tutorial Datasets
  159. Quoref
  160. Radiant MLHub
  161. Rapid7 FDNS ANY Dataset
  162. RarePlanes
  163. Reasoning Over Paragraph Effects in Situations (ROPES)
  164. REDASA COVID-19 Open Data
  165. Refgenie reference genome assets
  166. Scottish Public Sector LiDAR Dataset
  167. Sea Surface Temperature Daily Analysis: European Space Agency Climate Change Initiative product version 2.1
  168. Sentinel-1
  169. Sentinel-1 SLC dataset for South and Southeast Asia, Taiwan, Korea and Japan
  170. Sentinel-2
  171. Sentinel-3
  172. Sentinel-5P Level 2
  173. SILAM Air Quality
  174. Smithsonian Open Access
  175. SondeHub Radiosonde Telemetry
  176. Sophos/ReversingLabs 20 Million malware detection dataset
  177. Southern California Earthquake Data
  178. SpaceNet
  179. Speedtest by Ookla Global Fixed and Mobile Network Performance Maps
  180. STOIC2021 Training
  181. Storm EVent ImageRy (SEVIR)
  182. Sudachi Language Resources
  183. Tabula Muris
  184. Tabula Muris Senis
  185. Tabula Sapiens
  186. Terra Fusion Data Sampler
  187. Textbook Question Answering (TQA)
  188. The Cancer Genome Atlas
  189. The Genome Modeling System
  190. The Human Microbiome Project
  191. The Klarna Product-Page Dataset
  192. The Massively Multilingual Image Dataset (MMID)
  193. Therapeutically Applicable Research to Generate Effective Treatments (TARGET)
  194. TIGER Training
  195. Toxicant Exposures and Responses by Genomic and Epigenomic Regulators of Transcription (TaRGET)
  196. Transiting Exoplanet Survey Satellite (TESS)
  197. TSBench
  198. U.S. Census ACS PUMS
  199. UCSC Genome Browser Sequence and Annotations
  200. UK Met Office Global and Regional Weather Forecasts
  201. UniProt
  202. University of British Columbia Sunflower Genome Dataset
  203. Voices Obscured in Complex Environmental Settings (VOiCES)
  204. World Bank - Light Every Night
  205. Xiph.Org Test Media
  206. Yale-CMU-Berkeley (YCB) Object and Model Set
  207. YouTube 8 Million - Data Lakehouse Ready
  208. ZINC Database

@dkkapur I want to have most of them eligible to Slingshot at once. It's 40PiB of data total, assume 10x replication, that's 400PiB or 0.4 EiB of useful data over 15 EiB of current network capacity. It will be a good story to show and tell. Also,there is not much dataset left for Slingshot. Bringing this list will encourage more people to join slingshot.
@orvn Thanks for spending time to sort and dedup. Those dataset are also duplicates - KITTI, MMID. Hope it's not a great effort to add the filtering on the UI.

@xinaxu - I agree that it would be good to scope this in. Proposing that we pull these in (maybe in subsets) for 3.1 with an impending design change in the program (June-ish onwards).

@dkkapur Sounds good. Will wait for that and revisit this.

Closing since those dataset are being used for V3