[dataset request] Various AWS open datasetclear

Question

[dataset request] Various AWS open datasetclear

xinaxu opened this issue 3 years ago · comments

Name	Description	Size	Format	URL
World Bank - Light Every Night	Light Every Night - World Bank Nightime Light Data – provides open access to all nightly imagery and data from the Visible Infrared Imaging Radiometer Suite Day-Night Band (VIIRS DNB) from 2012-2020 and the Defense Meteorological Satellite Program Operational Linescan System (DMSP-OLS) from 1992-2013. The underlying data are sourced from the NOAA National Centers for Environmental Information (NCEI) archive. Additional processing by the University of Michigan enables access in Cloud Optimized GeoTIFF format (COG) and search using the Spatial Temporal Asset Catalog (STAC) standard. The data is published and openly available under the terms of the World Bank’s open data license.	273.18 TiB	Various	https://registry.opendata.aws/wb-light-every-night/
TSBench	TSBench comprises thousands of benchmark evaluations for time series forecasting methods. It provides various metrics (i.e. measures of accuracy, latency, number of model parameters, ...) of 13 time series forecasting methods across 44 heterogeneous datasets. Time series forecasting methods include both classical and deep learning methods while several hyperparameters settings are evaluated for the deep learning methods.	570.53 GiB	Various	https://registry.opendata.aws/tsbench/
Coupled Model Intercomparison Project Phase 5 (CMIP5) University of Wisconsin-Madison Probabilistic Downscaling Dataset	The University of Wisconsin Probabilistic Downscaling (UWPD) is a statistically downscaled dataset based on the Coupled Model Intercomparison Project Phase 5 (CMIP5) climate models. UWPD consists of three variables, daily precipitation and maximum and minimum temperature. The spatial resolution is 0.1°x0.1° degree resolution for the United States and southern Canada east of the Rocky Mountains.	3.44 TiB	Various	https://registry.opendata.aws/noaa-cmip5/
Toxicant Exposures and Responses by Genomic and Epigenomic Regulators of Transcription (TaRGET)	The TaRGET (Toxicant Exposures and Responses by Genomic and Epigenomic Regulators of Transcription) Program is a research consortium funded by the National Institute of Environmental Health Sciences (NIEHS). The goal of the collaboration is to address the role of environmental exposures in disease pathogenesis as a function of epigenome perturbation, including understanding the environmental control of epigenetic mechanisms and assessing the utility of surrogate tissue analysis in mouse models of disease-relevant environmental exposures.	15.95 TiB	Various	https://registry.opendata.aws/targetepigenomics/
AI2 Meaningful Citations Data Set	630 paper annotations	397.26 GiB	Various	https://registry.opendata.aws/allenai-meaningful-citations/
OpenEEW	Grillo has developed an IoT-based earthquake early-warning system,	1.81 TiB	Various	https://registry.opendata.aws/grillo-openeew/
COVID-19 Molecular Structure and Therapeutics Hub	Aggregating critical information to accelerate drug discovery for the molecular modeling and simulation community.	1.27 TiB	Various	https://registry.opendata.aws/molssi-covid19-hub/
CBERS on AWS	Imagery acquired	269.04 GiB	Various	https://registry.opendata.aws/cbers/
MODIS	The Moderate Resolution Imaging Spectroradiometer (MODIS) MCD43A4 Version 6 Nadir Bidirectional Reflectance Distribution Function (BRDF)-Adjusted Reflectance (NBAR) dataset is produced daily using 16 days of Terra and Aqua MODIS data at 500 meter (m) resolution. The view angle effects are removed from the directional reflectances, resulting in a stable and consistent NBAR product. Data are temporally weighted to the ninth day which is reflected in the Julian date in the file name.	26.89 TiB	Various	https://registry.opendata.aws/modis/
Sentinel-5P Level 2	This data set consists of observations from the Sentinel-5 Precursor (Sentinel-5P) satellite of the European Commission’s Copernicus Earth Observation Programme. Sentinel-5P is a polar orbiting satellite that completes 14 orbits of the Earth a day. It carries the TROPOspheric Monitoring Instrument (TROPOMI) which is a spectrometer that senses ultraviolet (UV), visible (VIS), near (NIR) and short wave infrared (SWIR) to monitor ozone, methane, formaldehyde, aerosol, carbon monoxide, nitrogen dioxide and sulphur dioxide in the atmosphere. The satellite was launched in October 2017 and entered routine operational phase in March 2019. Data is available from July 2018 onwards.	93.04 TiB	Various	https://registry.opendata.aws/sentinel5p/
Nanopore Reference Human Genome	This dataset includes the sequencing and assembly of a reference standard human genome (GM12878) using the MinION nanopore sequencing instrument with the R9.4 1D chemistry.	90.4 TiB	Various	https://registry.opendata.aws/nanopore/
Cornell EAS Data Lake	Earth & Atmospheric Sciences at Cornell University has created a public data lake of climate data. The data is stored in columnar storage formats (ORC) to make it straightforward to query using standard tools like Amazon Athena or Apache Spark. The data itself is originally intended to be used for building decision support tools for farmers and digital agriculture. The first dataset is the historical NDFD / NDGD data distributed by NCEP / NOAA / NWS. The NDFD (National Digital Forecast Database) and NDGD (National Digital Guidance Database) contain gridded forecasts and observations at 2.5km resolution for the Contiguous United States (CONUS). There are also 5km grids for several smaller US regions and non-continguous territories, such as Hawaii, Guam, Puerto Rico and Alaska. NOAA distributes archives of the NDFD/NDGD via its NOAA Operational Model Archive and Distribution System (NOMADS) in Grib2 format. The data has been converted to ORC to optimize storage space and to, more importantly, simplify data access via standard data analytics tools.	5.14 TiB	Various	https://registry.opendata.aws/cornell-eas-data-lake/
1000 Genomes	The 1000 Genomes Project is an international collaboration which has established the most detailed catalogue of human genetic variation, including SNPs, structural variants, and their haplotype context. The final phase of the project sequenced more than 2500 individuals from 26 different populations around the world and produced an integrated set of phased haplotypes with more than 80 million variants for these individuals.	696.81 TiB	Various	https://registry.opendata.aws/1000-genomes/
NOAA National Blend of Models (NBM)	The National Blend of Models (NBM) is a nationally consistent and skillful suite of calibrated forecast guidance based on a blend of both NWS and non-NWS numerical weather prediction model data and post-processed model guidance. The goal of the NBM is to create a highly accurate, skillful and consistent starting point for the gridded forecast.	640.54 TiB	Various	https://registry.opendata.aws/noaa-nbm/
Transiting Exoplanet Survey Satellite (TESS)	The Transiting Exoplanet Survey Satellite (TESS) is a multi-year survey that will discover exoplanets in orbit around bright stars across the entire sky using high-precision photometry. The survey will also enable a wide variety of stellar astrophysics, solar system science, and extragalactic variability studies. More information about TESS is available at MAST and the TESS Science Support Center.	226.18 TiB	Various	https://registry.opendata.aws/tess/
Kepler Mission Data	The Kepler mission observed the brightness of more than 180,000 stars near the Cygnus constellation at a 30 minute cadence for 4 years in order to find transiting exoplanets, study variable stars, and find eclipsing binaries. More information about the Kepler mission is available at MAST.	17.18 TiB	Various	https://registry.opendata.aws/kepler/
Tabula Muris	Tabula Muris is a compendium of single cell transcriptomic data from the model organism Mus musculus comprising more than 100,000 cells from 20 organs and tissues. These data represent a new resource for cell biology, reveal gene expression in poorly characterized cell populations, and allow for direct and controlled comparison of gene expression in cell types shared between tissues, such as T-lymphocytes and endothelial cells from different anatomical locations. Two distinct technical approaches were used for most organs: one approach, microfluidic droplet-based 3’-end counting, enabled the survey of thousands of cells at relatively low coverage, while the other, FACS-based full length transcript analysis, enabled characterization of cell types with high sensitivity and coverage. The cumulative data provide the foundation for an atlas of transcriptomic cell biology. See: https://www.nature.com/articles/s41586-018-0590-4	4.27 TiB	Various	https://registry.opendata.aws/tabula-muris/
Legal Entity Identifier (LEI) and Legal Entity Reference Data (LE-RD)	The Legal Entity Identifier (LEI) is a 20-character, alpha-numeric code based on the ISO 17442 standard developed by the International Organization for Standardization (ISO). It connects to key reference information that enables clear and unique identification of legal entities participating in financial transactions. Each LEI contains information about an entity’s ownership structure and thus answers the questions of 'who is who’ and ‘who owns whom’. Simply put, the publicly available LEI data pool can be regarded as a global directory, which greatly enhances transparency in the global marketplace. The Financial Stability Board (FSB) has reiterated that global LEI adoption underpins “multiple financial stability objectives” such as improved risk management in firms as well as better assessment of micro and macro prudential risks. As a result, it promotes market integrity while containing market abuse and financial fraud. Last but not least, LEI rollout “supports higher quality and accuracy of financial data overall”. The publicly available LEI data pool is a unique key to standardized information on legal entities globally. The data is registered and regularly verified according to protocols and procedures established by the Regulatory Oversight Committee. In cooperation with its partners in the Global LEI System, the Global Legal Entity Identifier Foundation (GLEIF) continues to focus on further optimizing the quality, reliability and usability of LEI data, empowering market participants to benefit from the wealth of information available with the LEI population. The drivers of the LEI initiative, i.e. the Group of 20, the FSB and many regulators around the world, have emphasized the need to make the LEI a broad public good. The Global LEI Index, made available by GLEIF, greatly contributes to meeting this objective. It puts the complete LEI data at the disposal of any interested party, conveniently and free of charge. The benefits for the wider business community to be generated with the Global LEI Index grow in line with the rate of LEI adoption. To maximize the benefits of entity identification across financial markets and beyond, firms are therefore encouraged to engage in the process and get their own LEI. Obtaining an LEI is easy. Registrants simply contact their preferred business partner from the list of LEI issuing organizations available on the GLEIF website.	4.67 TiB	Various	https://registry.opendata.aws/lei/
Textbook Question Answering (TQA)	1,076 textbook lessons, 26,260 questions, 6229 images	397.26 GiB	Various	https://registry.opendata.aws/allenai-tqa/
NASA NEX	A collection of Earth science datasets maintained by NASA, including climate change projections and satellite images of the Earth's surface.	53.83 TiB	Various	https://registry.opendata.aws/nasanex/
Provision of Web-Scale Parallel Corpora for Official European Languages (ParaCrawl)	ParaCrawl is a set of large parallel corpora to/from English for all official EU languages by a broad web crawling effort. State-of-the-art methods are applied for the entire processing chain from identifying web sites with translated text all the way to collecting, cleaning and delivering parallel corpora that are ready as training data for CEF.AT and translation memories for DG Translation.	76.44 TiB	Various	https://registry.opendata.aws/paracrawl/
Sentinel-3	This data set consists of observations from the Sentinel-3 satellite of the European Commission’s Copernicus Earth Observation Programme. Sentinel-3 is a polar orbiting satellite that completes 14 orbits of the Earth a day. It carries the Ocean and Land Colour Instrument (OLCI) for medium resolution marine and terrestrial optical measurements, the Sea and Land Surface Termperature Radiometer (SLSTR), the SAR Radar Altimeter (SRAL), the MicroWave Radiometer (MWR) and the Precise Orbit Determination (POD) instruments. The satellite was launched in 2016 and entered routine operational phase in 2017. Data is available from July 2017 onwards.	900.04 TiB	Various	https://registry.opendata.aws/sentinel-3/
Coupled Model Intercomparison Project 6	The sixth phase of global coupled ocean-atmosphere general circulation model ensemble.	1.67 PiB	Various	https://registry.opendata.aws/cmip6/
CAFE60 reanalysis	The CSIRO Climate retrospective Analysis and Forecast Ensemble system: version 1 (CAFE60v1) provides a large ensemble retrospective analysis of the global climate system from 1960 to present with sufficiently many realizations and at spatio-temporal resolutions suitable to enable probabilistic climate studies. Using a variant of the ensemble Kalman filter, 96 climate state estimates are generated over the most recent six decades. These state estimates are constrained by monthly mean ocean, atmosphere and sea ice observations such that their trajectories track the observed state while enabling estimation of the uncertainties in the approximations to the retrospective mean climate over recent decades. Strongly coupled data assimilation (SCDA) is implemented via an ensemble transform Kalman filter in order to constrain a general circulation climate model to observations. Satellite (altimetry, sea surface temperature, sea ice concentration) and in situ ocean temperature and salinity profiles are directly assimilated each month, whereas atmospheric observations are sub-sampled from the JRA55 atmospheric reanalysis. Strong coupling is implemented via explicit cross domain covariances between ocean, atmosphere, sea ice and ocean biogeochemistry. Atmospheric and surface ocean fields are available at daily resolution and monthly resolution for the land, subsurface ocean and sea ice. The system also produces a complete data archive of initial conditions potentially enabling individual forecasts for all members each month over the 60 year period. The size of the ensemble and application of strongly coupled data assimilation lead to new insights for future reanalyses. CAFE60v1 has been validated in comparison to empirical indices of the major climate teleconnections and blocking from various reanalysis products (ERA5, JRA55, NCEP NR1). Estimates of the large scale ocean structure and transports have been compared to those derived from gridded observational products (WOA18, HadISST, ERSSTv5) and climate model projections (CMIP). Sea ice (extent, concentration and variability) and land surface (precipitation and surface air temperatures) are also compared to a variety of model (ERA5, CMIP) and observational (GPCP, AWAP, HadCRU4, GIOMAS, NSIDC, HadISST) products. This analysis shows that CAFE60v1 is a useful, comprehensive and unique data resource for studying internal climate variability and predictability, including the recent climate response to anthropogenic forcing on multi-year to decadal time scales.	58.89 TiB	Various	https://registry.opendata.aws/csiro-cafe60/
Longitudinal Nutrient Deficiency	Dataset associated with the 2021 AAAI Paper- Detection and Prediction of Nutrient Deficiency Stress using Longitudinal Aerial Imagery. The dataset contains 3 image sequences of aerial imagery from 386 farm parcels which have been annotated for nutrient deficiency stress.	1.79 GiB	Various	https://registry.opendata.aws/intelinair_longitudinal_nutrient_deficiency/
New Jersey Statewide Digital Aerial Imagery Catalog	The New Jersey Office of GIS, NJ Office of Information Technology manages a series of 11 digital orthophotography and scanned aerial photo maps collected at various years ranging from 1930 to 2017. Each year’s worth of imagery are available as Cloud Optimized GeoTIFF (COG) files and some years are available as compressed MrSID and/or JP2 files. Additionally, each year of imagery is organized into a tile grid scheme covering the entire geography of New Jersey. Many years share the same tiling grid while others have unique grids as defined by the project at the time.	10.06 TiB	Various	https://registry.opendata.aws/nj-imagery/
Basic Local Alignment Sequences Tool (BLAST) Databases	A centralized repository of pre-formatted BLAST databases created by the National Center for Biotechnology Information (NCBI).	142.61 TiB	Various	https://registry.opendata.aws/ncbi-blast-databases/
InRad COVID-19 X-Ray and CT Scans	This dataset is a collection of anonymized thoracic radiographs (X-Rays) and computed tomography (CT) scans of patients with suspected COVID-19. Images are acommpanied by a positive or negative diagnosis for SARS-CoV2 infection via RT-PCR. These images were provided by Hospital das Clínicas da Universidade de São Paulo, Hospital Sirio-Libanes, and by Laboratory Fleury.	266.3 GiB	Various	https://registry.opendata.aws/inlab-covid-19-images-dataset/
Ohio State Cardiac MRI Raw Data (OCMR)	OCMR is an open-access repository that provides multi-coil k-space data for cardiac cine. The fully sampled MRI datasets are intended for quantitative comparison and evaluation of image reconstruction methods. The free-breathing, prospectively undersampled datasets are intended to evaluate their performance and generalizability qualitatively.	179.96 GiB	Various	https://registry.opendata.aws/ocmr_data/
District of Columbia - Classified Point Cloud LiDAR	LiDAR point cloud data for Washington, DC is available for anyone to use on Amazon S3.	314.15 GiB	Various	https://registry.opendata.aws/dc-lidar/
NOAA Global Extratropical Surge and Tide Operational Forecast System (Global ESTOFS)	NOAA's Global Extratropical Surge and Tide Operational Forecast System (Global ESTOFS) provides users with nowcasts (analyses of near present conditions) and forecast guidance of water level conditions for the entire globe. Global ESTOFS has been developed to serve the marine navigation, weather forecasting, and disaster mitigation user communities. Global ESTOFS was developed in a collaborative effort between the NOAA/National Ocean Service (NOS)/Office of Coast Survey, the NOAA/National Weather Service (NWS)/National Centers for Environmental Prediction (NCEP) Central Operations (NCO), the University of Notre Dame, the University of North Carolina, and The Water Institute of the Gulf. The model generates forecasts out to 180 hours four times per day; forecast output includes water levels caused by the combined effects of storm surge and tides, by astronomical tides alone, and by sub-tidal water levels (isolated storm surge).	71.61 TiB	Various	https://registry.opendata.aws/noaa-gestofs/
COVID-19 Genome Sequence Dataset	A centralized sequence repository for all records containing sequence associated with the novel corona virus (SARS-CoV-2) submitted to the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA). Included are both the original sequences submitted by the principal investigator as well as SRA-processed sequences that require the SRA Toolkit for analysis. Additionally, submitter provided metadata included in associated BioSample and BioProject records is available alongside NCBI calculated data, such k-mer based taxonomy analysis results, contiguous assemblies (contigs) and associated statistics such as contig length, blast results for the assembled contigs, contig annotation, blast databases of contigs and their annotated peptides, and VCF files generated for each record relative to the SARS-CoV-2 RefSeq record. Finally, metadata is additionally made available in parquet format to facilitate search and filtering using the AWS Athena Service.	1.02 PiB	Various	https://registry.opendata.aws/ncbi-covid-19/
NOAA Coastal Lidar Data	Lidar (light detection and ranging) is a technology that can measure the 3-dimentional location of objects, including the solid earth surface. The data consists of a point cloud of the positions of solid objects that reflected a laser pulse, typically from an airborne platform. In addition to the position, each point may also be attributed by the type of object it reflected from, the intensity of the reflection, and other system dependent metadata. The NOAA Coastal Lidar Data is a collection of lidar projects from many different sources and agencies, geographically focused on the coastal areas of the United States of America. The data is provided in Entwine Point Tiles (https://entwine.io) format, which is a lossless streamable octree of the point cloud. Datasets are maintained in their original projects and care should be taken when merging projects. The coordinate reference system for the data is The NAD83(2011) UTM zone appropriate for the center of each data set and the orthometric datum appropriate for that area (for example, NAVD88 in the mainland United States, PRVD02 in Puerto Rico, or GUVD03 in Guam). The geoid model used is reflected in the data set resource name.	123.63 TiB	Various	https://registry.opendata.aws/noaa-coastal-lidar/
AgricultureVision	Agriculture-Vision aims to be a publicly available large-scale aerial agricultural image dataset that is high-resolution, multi-band, and with multiple types of patterns annotated by agronomy experts. The original dataset affiliated with the 2020 CVPR paper includes 94,986 512x512images sampled from 3,432 farmlands with nine types of annotations: double plant, drydown, endrow, nutrient deficiency, planter skip, storm damage, water, waterway and weed cluster. All of these patterns have substantial impacts on field conditions and the final yield. These farmland images were captured between 2017 and 2019 across multiple growing seasons in numerous farming locations in the US. Each field image contains four color channels: Near-infrared (NIR), Red, Green and Blue. We first randomly split the 3,432 farmland images with a 6/2/2 train/val/test ratio. We then assign each sampled image to the split of the farmland image they are cropped from. This guarantees that no cropped images from the same farmland will appear in multiple splits in the final dataset. The generated (supervised) Agriculture-Vision dataset thus contains 56,944/18,334/19,708 train/val/test images.	914.14 GiB	Various	https://registry.opendata.aws/intelinair_agriculture_vision/
Terra Fusion Data Sampler	The Terra Basic Fusion dataset is a fused dataset of the original Level 1 radiances	136.23 TiB	Various	https://registry.opendata.aws/terrafusion/
SILAM Air Quality	Air Quality is a global SILAM atmospheric composition and air quality forecast performed on a daily basis for > 100 species and covering the troposphere and the stratosphere. The output produces 3D concentration fields and aerosol optical thickness. The data are unique: 20km resolution for global AQ models is unseen worldwide.	104.4 GiB	Various	https://registry.opendata.aws/silam/
KITTI Vision Benchmark Suite	Dataset and benchmarks for computer vision research in the context of autonomous driving. The dataset has been recorded in and around the city of Karlsruhe, Germany using the mobile platform AnnieWay (VW station wagon) which has been equipped with several RGB and monochrome cameras, a Velodyne HDL 64 laser scanner as well as an accurate RTK corrected GPS/IMU localization unit. The dataset has been created for computer vision and machine learning research on stereo, optical flow, visual odometry, semantic segmentation, semantic instance segmentation, road segmentation, single image depth prediction, depth map completion, 2D and 3D object detection and object tracking. In addition, several raw data recordings are provided. The datasets are captured by driving around the mid-size city of Karlsruhe, in rural areas and on highways. Up to 15 cars and 30 pedestrians are visible per image.	894.47 GiB	Various	https://registry.opendata.aws/kitti/
NOAA Fundamental Climate Data Records (FCDR)	NOAA's Climate Data Records (CDRs) are robust, sustainable, and scientifically sound climate records that provide trustworthy information on how, where, and to what extent the land, oceans, atmosphere and ice sheets are changing. These datasets are thoroughly vetted time series measurements with the longevity, consistency, and continuity to assess and measure climate variability and change. NOAA CDRs are vetted using standards established by the National Research Council (NRC).	80.47 TiB	Various	https://registry.opendata.aws/noaa-cdr-fundamental/
GATK Test Data	The GATK test data resource bundle is a collection of files for resequencing human genomic data with the	975.97 GiB	Various	https://registry.opendata.aws/gatk-test-data/
NOAA/PMEL Ocean Climate Stations Moorings	The mission of the Ocean Climate Stations (OCS) Project is to make meteorological and	1.64 GiB	Various	https://registry.opendata.aws/noaa-ocean-climate-stations/
Boreas Autonomous Driving Dataset	This autonomous driving dataset includes data from a 128-beam Velodyne Alpha-Prime lidar, a 5MP Blackfly camera, a 360-degree Navtech radar, and post-processed Applanix POS LV GNSS data. This dataset was collect in various weather conditions (sun, rain, snow) over the course of a year. The intended purpose of this dataset is to enable benchmarking of long-term all-weather odometry and metric localization across various sensor types. In the future, we hope to also support an object detection benchmark.	4.39 TiB	Various	https://registry.opendata.aws/boreas/
Allen Ivy Glioblastoma Atlas	This dataset consists of images of glioblastoma human brain tumor tissue sections that have been probed for expression of particular genes believed to play a role in development of the cancer. Each tissue section is adjacent to another section that was stained with a reagent useful for identifying histological features of the tumor. Each of these types of images has been completely annotated for tumor features by a machine learning process trained by expert medical doctors.	8.52 TiB	Various	https://registry.opendata.aws/allen-ivy-glioblastoma-atlas/
OpenAQ	Global, aggregated physical air quality data from public data sources provided by government, research-grade and other sources. These awesome groups do the hard work of measuring these data and publicly sharing them, and our community makes them more universally-accessible to both humans and machines.	1.04 TiB	Various	https://registry.opendata.aws/openaq/
3000 Rice Genomes Project	The 3000 Rice Genome Project is an international effort to sequence the genomes of 3,024 rice varieties from 89 countries.	96.64 TiB	Various	https://registry.opendata.aws/3kricegenome/
ChEMBL - Data Lakehouse Ready	ChEMBL is a manually curated database of bioactive molecules with drug-like properties. It brings together chemical, bioactivity and genomic data to aid the translation of genomic information into effective new drugs. This representation of ChEMBL is stored in Parquet format and most easily utilized through Amazon Athena. Follow the documentation for install instructions (< 2 minute install). New ChEMBL releases occur sporadically; the most up to date information on ChEMBL releases can be found here.	8.43 GiB	Various	https://registry.opendata.aws/chembl/
Open Targets - Data Lakehouse Ready	This a Parquet representation of the Open Targets Platform's latest export. The Open Targets Platform integrates evidence from genetics, genomics, transcriptomics, drugs, animal models and scientific literature to score and rank target-disease associations for drug target identification. The Open Targets Platform (https://www.targetvalidation.org) is a freely available resource for the integration of genetics, genomics, and chemical data to aid systematic drug target identification and prioritisation.	15.72 GiB	Various	https://registry.opendata.aws/opentargets/
Smithsonian Open Access	The Smithsonian’s mission is the "increase and diffusion of knowledge" and has been collecting since 1846. The Smithsonian, through its efforts to digitize its multidisciplinary collections, has created millions of digital assets and related metadata describing the collection objects. On February 25th, 2020, the Smithsonian released over 2.8 million CC0 interdisciplinary 2-D and 3-D images, related metadata, and additionally, research data from researches across the Smithsonian. The 2.8 million "open access" collections are a subset of the Smithsonian’s 155 million objects, 2.1 million library volumes and 156,000 cubic feet of archival collections held in 19 museums, 9 research centers, libraries, archives and the National Zoo. Digitization of collections is ongoing.	618.51 TiB	Various	https://registry.opendata.aws/smithsonian-open-access/
Multi-Scale Ultra High Resolution (MUR) Sea Surface Temperature (SST)	A global, gap-free, gridded, daily 1 km Sea Surface Temperature (SST) dataset created by merging multiple Level-2 satellite SST datasets. Those input datasets include the NASA Advanced Microwave Scanning Radiometer-EOS (AMSR-E), the JAXA Advanced Microwave Scanning Radiometer 2 (AMSR-2) on GCOM-W1, the Moderate Resolution Imaging Spectroradiometers (MODIS) on the NASA Aqua and Terra platforms, the US Navy microwave WindSat radiometer, the Advanced Very High Resolution Radiometer (AVHRR) on several NOAA satellites, and in situ SST observations from the NOAA iQuam project. Data are available from 2002 to present in Zarr format. The original source of the MUR data is the NASA JPL Physical Oceanography DAAC.	8.59 TiB	Various	https://registry.opendata.aws/mur/
OpenSurfaces	A large database of annotated surfaces created from real-world consumer photographs.	1.08 TiB	Various	https://registry.opendata.aws/opensurfaces/
Tabula Muris Senis	Tabula Muris Senis is a comprehensive compendium of single cell transcriptomic data from the model organism Mus musculus comprising more than 500,000 cells from 18 organs and tissues across the mouse lifespan. We discovered cell-specific changes occurring across multiple cell types and organs, as well as age related changes in the cellular composition of different organs. Using single-cell transcriptomic data we were able to assess cell type specific manifestations of different hallmarks of aging, such as senescence, changes in the activity of metabolic pathways, depletion of stem-cell populations, genomic instability and the role of inflammation as well as other changes in the organism’s immune system. Tabula Muris Senis provides a wealth of new molecular information about how the most significant hallmarks of aging are reflected in a broad range of tissues and cell types.See: https://www.biorxiv.org/content/10.1101/661728v1	105.61 TiB	Various	https://registry.opendata.aws/tabula-muris-senis/
Clinical Proteomic Tumor Analysis Consortium 3 (CPTAC-3)	The Clinical Proteomic Tumor Analysis Consortium (CPTAC) is a national effort to accelerate the	74.35 GiB	Various	https://registry.opendata.aws/cptac-3/
NOAA Unified Forecast System Subseasonal to Seasonal Prototypes	The Unified Forecast System Subseasonal to Seasonal prototypes consist of reforecast data from the UFS atmosphere-ocean coupled model experimental prototype version 5, 6, and 7 produced by the Medium Range and Subseasonal to Seasonal Application team of the UFS-R2O project. The UFS prototypes are the first dataset released to the broader weather community for analysis and feedback as part of the development of the next generation operational numerical weather prediction system from NWS. The datasets includes all the major weather variables for atmosphere, land, ocean, sea ice, and ocean waves.	152.21 TiB	Various	https://registry.opendata.aws/noaa-ufs-s2s/
Storm EVent ImageRy (SEVIR)	Collection of spatially and temporally aligned GOES-16 ABI satellite imagery, NEXRAD radar mosaics, and GOES-16 GLM lightning detections.	969.68 GiB	Various	https://registry.opendata.aws/sevir/
Yale-CMU-Berkeley (YCB) Object and Model Set	This project primarily aims to facilitate performance benchmarking in robotics research. The dataset provides mesh models, RGB, RGB-D and point cloud images of over 80 objects. The physical objects are also available via the YCB benchmarking project. The data are collected by two state of the art systems: UC Berkley's scanning rig and the Google scanner. The UC Berkley's scanning rig data provide meshes generated with Poisson reconstruction, meshes generated with volumetric range image integration, textured versions of both meshes, Kinbody files for using the meshes with OpenRAVE, 600 High-resolution RGB images, 600 RGB-D images, and 600 point cloud images for each object. The Google scanner data provides 3 meshes with different resolutions (16k, 64k, and 512k polygons), textured versions of each mesh, Kinbody files for using the meshes with OpenRAVE.	209.1 GiB	Various	https://registry.opendata.aws/ycb-benchmarks/
NapierOne Mixed File Dataset	NapierOne is a modern cybersecurity mixed file data set, primarily aimed at, but not limited to, ransomware detection and forensic analysis. The dataset contains over 500,000 distinct files, representing 44 distinct popular file types. It was designed to address the known deficiency in research reproducibility and improve consistency by facilitating research replication and repeatability. The data set was inspired by the Govdocs1 data set and it is intended that ‘NapierOne’ be used as a complement to this original data set. An investigation was performed with the goal of determining the common files types currently in use. No specific research was found that explicitly provided this information, so an alternative consensus approach was employed. This involved combining the findings from multiple sources of file type usage into an overall ranked list. After which 5,000 real-world example files were gathered, and a specific data subset was created, for each of the common file types identified. In some circumstances, multiple data subsets were created for a specific file type, each subset representing a specific characteristic for that file type. For example, there are multiple data subsets for the ZIP file type with each subset containing examples of a specific compression method. Ransomware execution tends to produce files that have high entropy, so examples of file types that naturally have this attribute are also present. The resulting entire data set comprises of more than 90 separate data subsets divided between 44 distinct file types, resulting in over 500,000 unique files in total. Currently, the data set contains examples of the following file types APK, BIN, BMP, CSS, CSV, DOC, DOCX, DWG, ELF, EPS,EPUB, EXE, GIF, GZIP, HTML, ICS, JS, JPG, JSON, MKV, MP3, MP4, ODS, OXPS, PDF, PNG, PPT, PPTX, PS1, RAR, SVG, TAR, TIF, TXT, WEBP, XLS, XLSX, XML, ZIP, ZLIB, 7Zip	1.6 TiB	Various	https://registry.opendata.aws/napierone/
NOAA Global Mosaic of Geostationary Satellite Imagery (GMGSI)	NOAA/NESDIS Global Mosaic of Geostationary Satellite Imagery (GMGSI) visible (VIS), shortwave infrared (SIR), longwave infrared (LIR) imagery, and water vaport imagery (WV) are composited from data from several geostationary satellites orbiting the globe, including the GOES-East and GOES-West Satellites operated by U.S. NOAA/NESDIS, the Meteosat-11 and Meteosat-8 satellites from theMeteosat Second Generation (MSG) series of satellites operated by European Organization for the Exploitation of Meteorological Satellites (EUMETSAT), and the Himawari-8 satellite operated by the Japan Meteorological Agency (JMA). GOES-East is positioned at 75 deg W longitude over the equator. GOES-West is located at 137.2 deg W longitude over the equator. Both satellites cover an area from the eastern Atlantic Ocean to the central Pacific Ocean region. The Meteosat-11 satellite is located at 0 deg E longitude to cover Europe and Africa regions. The Meteosat-8 satellite is located at 41.5 deg E longitude to cover the Indian Ocean region. The Himawari-8 satellite is located at 140.7 deg E longitude to cover the Asia-Oceania region. The visible imagery indicates cloud cover and ice and snow cover. The shortwave, or mid-infrared, indicates cloud cover and fog at night. The longwave, or thermal infrared, depicts cloud cover and land/sea temperature patterns. The water vapor imagery indicates the amount of water vapor contained in the mid to upper levels of the troposphere, with the darker grays indicating drier air and the brighter grays/whites indicating more saturated air. GMGSI composite images have an approximate 8 km (5 mile) horizontal resolution and are updated every hour.	141.73 GiB	Various	https://registry.opendata.aws/noaa-gmgsi/
SpaceNet	SpaceNet, launched in August 2016 as an open innovation project offering a repository of freely available	10.65 TiB	Various	https://registry.opendata.aws/spacenet/
PoroTomo	Released to the public as part of the Department of Energy's Open Energy Data	271.67 TiB	Various	https://registry.opendata.aws/nrel-pds-porotomo/
Sophos/ReversingLabs 20 Million malware detection dataset	A dataset intended to support research on machine learning	9.38 TiB	Various	https://registry.opendata.aws/sorel-20m/
NOAA High-Resolution Rapid Refresh (HRRR) Model	The HRRR is a NOAA real-time 3-km resolution, hourly updated, cloud-resolving, convection-allowing atmospheric model, initialized by 3km grids with 3km radar assimilation. Radar data is assimilated in the HRRR every 15 min over a 1-h period adding further detail to that provided by the hourly data assimilation from the 13km radar-enhanced Rapid Refresh.	1.8 PiB	Various	https://registry.opendata.aws/noaa-hrrr-pds/
NOAA Atmospheric Climate Data Records	NOAA's Climate Data Records (CDRs) are robust, sustainable, and scientifically sound climate records that provide trustworthy information on how, where, and to what extent the land, oceans, atmosphere and ice sheets are changing. These datasets are thoroughly vetted time series measurements with the longevity, consistency, and continuity to assess and measure climate variability and change. NOAA CDRs are vetted using standards established by the National Research Council (NRC).	6.9 TiB	Various	https://registry.opendata.aws/noaa-cdr-atmospheric/
MODIS MYD13A1, MOD13A1, MYD11A1, MOD11A1, MCD43A4	Data from the Moderate Resolution Imaging Spectroradiometer (MODIS), managed by	564.5 GiB	Various	https://registry.opendata.aws/modis-astraea/
Allen Mouse Brain Atlas	The Allen Mouse Brain Atlas is a genome-scale collection of cellular resolution gene expression profiles using in situ hybridization (ISH). Highly methodical data production methods and comprehensive anatomical coverage via dense, uniformly spaced sampling facilitate data consistency and comparability across >20,000 genes. The use of an inbred mouse strain with minimal animal-to-animal variance allows one to treat the brain essentially as a complex but highly reproducible three-dimensional tissue array. The entire Allen Mouse Brain Atlas dataset and associated tools are available through an unrestricted web-based viewing application (http://mouse.brain-map.org). The collection of > 650,000 images have been made available in this Open Data bucket to enable efficient access and analysis of the this dataset.	190.01 TiB	Various	https://registry.opendata.aws/allen-mouse-brain-atlas/
Amazon-PQA	Amazon product questions and their answers, along with the public product information.	19.42 GiB	Various	https://registry.opendata.aws/amazon-pqa/
OpenCell on AWS	The OpenCell project is a proteome-scale effort to measure the localization and interactions of human proteins	849.63 GiB	Various	https://registry.opendata.aws/czb-opencell/
NOAA Global Ensemble Forecast System (GEFS)	The Global Ensemble Forecast System (GEFS), previously known as the GFS Global ENSemble (GENS), is a weather forecast model made up of 21 separate forecasts, or ensemble members. The National Centers for Environmental Prediction (NCEP) started the GEFS to address the nature of uncertainty in weather observations, which is used to initialize weather forecast models. The GEFS attempts to quantify the amount of uncertainty in a forecast by generating an ensemble of multiple forecasts, each minutely different, or perturbed, from the original observations. With global coverage, GEFS is produced four times a day with weather forecasts going out to 16 days.	1.43 PiB	Various	https://registry.opendata.aws/noaa-gefs/
Speedtest by Ookla Global Fixed and Mobile Network Performance Maps	Global fixed broadband and mobile (cellular) network performance, allocated to zoom level 16 web mercator tiles (approximately 610.8 meters by 610.8 meters at the equator). Data is provided in both Shapefile format as well as Apache Parquet with geometries represented in Well Known Text (WKT) projected in EPSG:4326. Download speed, upload speed, and latency are collected via the Speedtest by Ookla applications for Android and iOS and averaged for each tile. Measurements are filtered to results containing GPS-quality location accuracy.	16.37 GiB	Various	https://registry.opendata.aws/speedtest-global-performance/
Sudachi Language Resources	Japanese dictionaries and pre-trained models (word embeddings and language models) for natural language processing.	73.0 GiB	Various	https://registry.opendata.aws/sudachi/
Discrete Reasoning Over the content of Paragraphs (DROP)	The DROP dataset contains 96k Question and Answer pairs (QAs) over 6.7K paragraphs, split between train (77k QAs), development (9.5k QAs) and a hidden test partition (9.5k QAs).	397.26 GiB	Various	https://registry.opendata.aws/allenai-drop/
ESA WorldCover	The European Space Agency (ESA) WorldCover is a global land cover map with 11 different land cover classes produced at 10m resolution based on combination of both Sentinel-1 and Sentinel-2 data. In areas where Sentinel-2 images are covered by clouds for an extended period of time, Sentinel-1 data then provides complimentary information on the structural characteristics of the observed land cover. Therefore, the combination of Sentinel-1 and Sentinel-2 data makes it possible to update the land cover map almost in real time. WorldCover Map has been produced for 2020 (01 January to 31 December) with a global coverage as part of the 5th Earth Observation Envelope Programme (EOEP-5). It provides valuable information for applications such as biodiversity, food security, carbon assessment and climate modelling. More information can be found on the WorldCover website and the product User Manual.	115.23 GiB	Various	https://registry.opendata.aws/esa-worldcover/
Hecatomb Databases	Preprocessed databases for use with the Hecatomb pipeline for viral and phage sequence annotation.	58.65 GiB	Various	https://registry.opendata.aws/hecatomb/
AWS iGenomes	Common reference genomes hosted on AWS S3. Can be used when aligning and analysing raw DNA sequencing data.	6.19 TiB	Various	https://registry.opendata.aws/aws-igenomes/
Copernicus Digital Elevation Model (DEM)	The Copernicus DEM is a Digital Surface Model (DSM) which represents the surface of the Earth including buildings, infrastructure and vegetation. We provide two instances of Copernicus DEM named GLO-30 Public and GLO-90. GLO-90 provides worldwide coverage at 90 meters. GLO-30 Public provides limited worldwide coverage at 30 meters because a small subset of tiles covering specific countries are not yet released to the public by the Copernicus Programme. Note that in both cases ocean areas do not have tiles, there one can assume height values equal to zero. Data is provided as Cloud Optimized GeoTIFFs.	614.89 GiB	Various	https://registry.opendata.aws/copernicus-dem/
1000 Genomes Phase 3 Reanalysis with DRAGEN 3.5 and 3.7	This dataset contains alignment files and short nucleotide, copy number, repeat expansion (STR) and structural variant call files from the 1000 Genomes Project Phase 3 dataset (n=3202) using Illumina DRAGEN v3.5.7b and v3.7.6 software. The v3.7.6 dataset also includes results from joint small variant, de novo structural variant, de novo copy number variant and repeat expansion calls on 602 trio families comprised of members from the 1000 Genomes Project Phase 3 dataset, as well as DRAGEN gVCF Genotyper (v3.8.3) analysis on the entire dataset (n=3202). Improvements and new features in the v3.7.6 individual samples analyses include CYP2D6 variant calling and joint detection (see ‘DRAGEN 3.7 User Guide’ for details on these features) and use of graph-based hg19 and hg38 reference hash tables (see ‘DRAGEN Wins at PrecisionFDA Truth Challenge V2 Showcase Accuracy Gains from Alt-aware Mapping and Graph Reference Genomes’ for details).	696.97 TiB	Various	https://registry.opendata.aws/ilmn-dragen-1kgp/
Open City Model (OCM)	Open City Model is an initiative to provide cityGML data for all the buildings in the United States.	415.93 GiB	Various	https://registry.opendata.aws/opencitymodel/
NREL Wind Integration National Dataset	Released to the public as part of the Department of Energy's Open Energy Data Initiative,	1.6 PiB	Various	https://registry.opendata.aws/nrel-pds-wtk/
PubSeq - Public Sequence Resource	COVID-19 PubSeq is a free and open online bioinformatics public sequence resource with on-the-fly analysis of sequenced SARS-CoV-2 samples that allows for a quick turnaround in identification of new virus strains. PubSeq allows anyone to upload sequence material in the form of FASTA or FASTQ files with accompanying metadata through the web interface or REST API.	2.92 GiB	Various	https://registry.opendata.aws/pubseq/
IBL Neuropixels Brainwide Map on AWS	Electrophysiological recordings of mouse brain activity acquired using Neuropixels probes.	555.06 GiB	Various	https://registry.opendata.aws/ibl-brain-wide-map/
Digital Earth Africa GeoMAD	GeoMAD is the Digital Earth Africa (DE Africa) surface reflectance geomedian and triple Median Absolute Deviation data service. It is a cloud-free composite of satellite data compiled over specific timeframes.	174.14 TiB	Various	https://registry.opendata.aws/deafrica-geomad/
Open Observatory of Network Interference (OONI)	A free software, global observation network for detecting censorship, surveillance and traffic manipulation on the internet.	91.42 TiB	Various	https://registry.opendata.aws/ooni/
NIH NCBI Sequence Read Archive (SRA) on AWS	The Sequence Read Archive (SRA), produced by the National Center for Biotechnology Information (NCBI) at the National Library of Medicine (NLM) at the National Institutes of Health (NIH), stores raw DNA sequencing data and alignment information from high-throughput sequencing platforms. The SRA provides open access to these biological sequence data to support the research community's efforts to enhance reproducibility and make new discoveries by comparing data sets. Buckets in this registry contain public SRA data in the original (user submitted) format from select high value and newly-released studies as well as all public-access SRA formatted ETL+BQS data. Also included is all SRA metadata that can be leveraged for attribute-based data discovery.	12.23 PiB	Various	https://registry.opendata.aws/ncbi-sra/
Oxford Nanopore Technologies Benchmark Datasets	The ont-open-data registry provides reference sequencing data from Oxford Nanopore Technologies to support, 1) Exploration of the characteristics of nanopore sequence data. 2) Assessment and reproduction of performance benchmarks 3) Development of tools and methods. The data deposited showcases DNA sequences from a representative subset of sequencing chemistries. The datasets correspond to publicly-available reference samples (e.g. GM24385 as reference human). Raw data are provided with metadata and scripts to describe sample and data provenance.	41.98 TiB	Various	https://registry.opendata.aws/ont-open-data/
NOAA World Ocean Database (WOD)	The World Ocean Database (WOD) is the largest uniformly formatted, quality-controlled, publicly available historical subsurface ocean profile database. From Captain Cook's second voyage in 1772 to today's automated Argo floats, global aggregation of ocean variable information including temperature, salinity, oxygen, nutrients, and others vs. depth allow for study and understanding of the changing physical, chemical, and to some extent biological state of the World's Oceans. Browse the bucket via the AWS S3 explorer: https://noaa-wod-pds.s3.amazonaws.com/index.html	123.28 GiB	Various	https://registry.opendata.aws/noaa-wod/
Global Seasonal Sentinel-1 Interferometric Coherence and Backscatter Data Set	This data set is the first-of-its-kind spatial representation of multi-seasonal, global SAR repeat-pass interferometric coherence and backscatter signatures. Global coverage comprises all land masses and ice sheets from 82 degrees northern to 79 degress southern latitude. The data set is derived from high-resolution multi-temporal repeat-pass interferometric processing of about 205,000 Sentinel-1 Single-Look-Complex data acquired in Interferometric Wide-Swath mode (Sentinel-1 IW mode) from 1-Dec-2019 to 30-Nov-2020. The data set was developed by Earth Big Data LLC and Gamma Remote Sensing AG, under contract for NASA's Jet Propulsion Laboratory. The data set covers four sets of seasonal (DJF/MAM/JJA/SON) metrics: 1) Median 6-, 12-, 18-, 24-, 36-, and 48-day repeat coherence estimates for C-band VV and HH polarized data, 2) Mean backscatter (gamma naught) for VV, VH, HH, and HV polarizations, 3) Seasonal coherence decay model parameters rho, tau, and rmse, 4) Local incidence and layover/shadow regions for all relative orbits (175 orbits). Note that in the data set filenames the seasons were referred to as northern hemisphere winter (DJF), spring (MAM), summer (JJA), and fall (SON). The data set is available in two main components: 1) 1x1 degree tiles. Each tile contains GeoTiffs at 3 arcsec pixel spacing of all metrics available in the tile. (s3://sentinel-1-global-coherence-earthbigdata/data/tiles/), 2) Global mosaicked tiles as cloud optimized GeoTIFFs (COG) at 0.01 degree pixel spacing (s3://sentinel-1-global-coherence-earthbigdata/data/mosaics/) for each of the computed metrics.	2.1 TiB	Various	https://registry.opendata.aws/ebd-sentinel-1-global-coherence-backscatter/
NOAA Continuously Operating Reference Stations (CORS) Network (NCN)	The NOAA Continuously Operating Reference Stations (CORS) Network (NCN), managed by NOAA/National Geodetic Survey (NGS), provide Global Navigation Satellite System (GNSS) data, supporting three dimensional positioning, meteorology, space weather, and geophysical applications throughout the United States. The NCN is a multi-purpose, multi-agency cooperative endeavor, combining the efforts of hundreds of government, academic, and private organizations. The stations are independently owned and operated. Each agency shares their GNSS/GPS carrier phase and code range measurements and station metadata with NGS, which are analyzed and distributed free of charge.	13.75 TiB	Various	https://registry.opendata.aws/noaa-ncn/
Orcasound - bioacoustic data for marine conservation	Live-streamed and archived audio data (~2018-present) from underwater microphones (hydrophones) containing marine biological signals as well as ambient ocean noise. Hydrophone placement and passive acoustic monitoring effort prioritizes detection of orca sounds (calls, clicks, whistles) and potentially harmful noise. Geographic focus is on the US/Canada critical habitat of Southern Resident killer whales (northern CA to central BC) with initial focus on inland waters of WA. In addition to the raw lossy or lossless compressed data, we provide a growing archive of annotated bioacoustic bouts.	3.75 TiB	Various	https://registry.opendata.aws/orcasound/
1000 Genomes Phase 3 Reanalysis with DRAGEN 3.5 - Data Lakehouse Ready	The 1000 Genomes Project is an international collaboration which has established the most detailed catalogue of human genetic variation, including SNPs, structural variants, and their haplotype context. There were a total of 3202 individuals sequenced as part of Phase 3 of this project. The high coverage samples were processed using the Illumina DRAGEN v3.5.7b pipeline and are available at s3://1000genomes-dragen/. This dataset contains the VCFs transformed to Parquet/ORC in 3 different schemas - partitioned by samples, partitioned by chromosome and a nested data format. These representations of the 1000 Genomes DRAGEN data are stored in Parquet/ORC format and can be queried through Amazon Athena. To add these tables to your Glue Data Catalog and for sample queries on this dataset, please refer to the link in our Documentation.	1.7 TiB	Various	https://registry.opendata.aws/1000-genomes-data-lakehouse-ready/
Medical Segmentation Decathlon	With recent advances in machine learning, semantic segmentation algorithms are becoming increasingly general purpose and translatable to unseen tasks. Many key algorithmic advances in the field of medical imaging are commonly validated on a small number of tasks, limiting our understanding of the generalisability of the proposed contributions. A model which works out-of-the-box on many tasks, in the spirit of AutoML, would have a tremendous impact on healthcare. The field of medical imaging is also missing a fully open source and comprehensive benchmark for general purpose algorithmic validation and testing covering a large span of challenges, such as: small data, unbalanced labels, large-ranging object scales, multi-class labels, and multimodal imaging, etc. This challenge and dataset aims to provide such resource thorugh the open sourcing of large medical imaging datasets on several highly different tasks, and by standardising the analysis and validation process.	141.4 GiB	Various	https://registry.opendata.aws/msd/
Therapeutically Applicable Research to Generate Effective Treatments (TARGET)	Therapeutically Applicable Research to Generate Effective Treatments (TARGET) is the collaborative effort of a large, diverse consortium of extramural and NCI investigators. The goal of the effort is to accelerate molecular discoveries that drive the initiation and progression of hard-to-treat childhood cancers and facilitate rapid translation of those findings into the clinic.	68.51 GiB	Various	https://registry.opendata.aws/target/
High Resolution Downscaled Climate Data for Southeast Alaska	This dataset contains historical and projected dynamically downscaled climate data for the Southeast region of the State of Alaska at 1 and 4km spatial resolution and hourly temporal resolution. Select variables are also summarized into daily resolutions. This data was produced using the Weather Research and Forecasting (WRF) model (Version 4.0). We downscaled both Climate Forecast System Reanalysis (CFSR) historical reanalysis data (1980-2019) and both historical and projected runs from two GCM’s from the Coupled Model Inter-comparison Project 5 (CMIP5): GFDL-CM3 and NCAR-CCSM4 (historical run: 1980-2010 and RCP 8.5: 2030-2060).	31.01 TiB	Various	https://registry.opendata.aws/wrf-se-alaska-snap/
Central Weather Bureau OpenData	Various kinds of weather raw data and charts from Central Weather Bureau.	25.59 GiB	Various	https://registry.opendata.aws/cwb_opendata/
Cloud to Street - Microsoft Flood and Clouds Dataset	This dataset consists of chips of Sentinel-1 and Sentinel-2 satellite data. Each Sentinel-1 chip contains a corresponding label for water and each Sentinel-2 chip contains a corresponding label for water and clouds. Data is stored in folders by a unique event identifier as the folder name. Within each event folder there are subfolders for Sentinel-1 (s1) and Sentinel-2 (s2) data. Each chip is contained in its own sub-folder with the folder name being the source image id, followed by a unique chip identifier consisting of a hyphenated set of 5 numbers. All bands of the satellite data, as well as the labels, and overview images are contained within the chip folder.	10.0 GiB	Various	https://registry.opendata.aws/c2smsfloods/
Earth Observation Data Cubes for Brazil	Earth observation (EO) data cubes produced from analysis-ready data (ARD) of CBERS-4, Sentinel-2 A/B and Landsat-8 satellite images for Brazil. The datacubes are regular in time and use a hierarchical tiling system. Further details are described in Ferreira et al. (2020).	117.79 TiB	Various	https://registry.opendata.aws/brazil-data-cubes/
UniProt	The Universal Protein Resource (UniProt) is a comprehensive resource for protein sequence and annotation data. The UniProt databases are the UniProt Knowledgebase (UniProtKB), the UniProt Reference Clusters (UniRef), and the UniProt Archive (UniParc). The UniProt consortium and host institutions EMBL-EBI, SIB Swiss Institute of Bioinformatics and PIR are committed to the long-term preservation of the UniProt databases.	3.21 TiB	Various	https://registry.opendata.aws/uniprot/
Southern California Earthquake Data	This dataset contains ground motion velocity and acceleration seismic waveforms recorded by the Southern California Seismic Network (SCSN) and archived at the Southern California Earthquake Data Center (SCEDC).	93.74 TiB	Various	https://registry.opendata.aws/southern-california-earthquakes/
NOAA Rapid Refresh Forecast System (RRFS) Ensemble [Prototype]	The Rapid Refresh Forecast System (RRFS) is the National Oceanic and Atmospheric Administration’s (NOAA) next generation convection-allowing, rapidly-updated ensemble prediction system, currently scheduled for operational implementation in late 2023. The operational configuration will feature a 3 km grid covering North America and include forecasts every hour out to 18 hours, with extensions to 60 hours four times per day at 00, 06, 12, and 18 UTC. Each forecast is planned to be composed of 9-10 members. The RRFS will provide guidance to support forecast interests including, but not limited to, aviation, severe convective weather, renewable energy, heavy precipitation, and winter weather on timescales where rapidly-updated guidance is particularly useful.	197.94 TiB	Various	https://registry.opendata.aws/noaa-rrfs/
The Cancer Genome Atlas	The Cancer Genome Atlas (TCGA), a collaboration between the National Cancer Institute (NCI) and National Human Genome Research Institute (NHGRI), aims to generate comprehensive, multi-dimensional maps of the key genomic changes in major types and subtypes of cancer. TCGA has analyzed matched tumor and normal tissues from 11,000 patients, allowing for the comprehensive characterization of 33 cancer types and subtypes, including 10 rare cancers.	35.12 TiB	Various	https://registry.opendata.aws/tcga/
Natural Scenes Dataset	Here, we collected and pre-processed a massive, high-quality 7T fMRI dataset that can be used to advance our understanding of how the brain works. A unique feature of this dataset is the massive amount of data available per individual subject. The data were acquired using ultra-high-field fMRI (7T, whole-brain, 1.8-mm resolution, 1.6-s TR). We measured fMRI responses while each of 8 participants viewed 9,000–10,000 distinct, color natural scenes (22,500–30,000 trials) in 30–40 weekly scan sessions over the course of a year. Additional measures were collected including resting-state data, retinotopy, category localizers, anatomical data (T1, T2, diffusion, venogram, angiogram), physiological data (pulse, respiration), eye-tracking data, and additional behavioral assessments outside the scanner. Because of its unprecedented scale and richness, NSD can be used to explore diverse neuroscientific questions with high power at the level of individual subjects. In particular, the number of images sampled in this dataset is sufficiently large that the dataset may be of high interest for computer vision, machine learning, and other data-driven applications.	13.31 TiB	Various	https://registry.opendata.aws/nsd/
OpenNeuro	OpenNeuro is a database of openly-available brain imaging data. The data are shared according to a Creative Commons CC0 license, providing a broad range of brain imaging data to researchers and citizen scientists alike. The database primarily focuses on functional magnetic resonance imaging (fMRI) data, but also includes other imaging modalities including structural and diffusion MRI, electroencephalography (EEG), and magnetoencephalograpy (MEG). OpenfMRI is a project of the Center for Reproducible Neuroscience at Stanford University. Development of the OpenNeuro resource has been funded by the National Science Foundation, National Institute of Mental Health, National Institute on Drug Abuse, and the Laura and John Arnold Foundation.	32.97 TiB	Various	https://registry.opendata.aws/openneuro/
Digital Earth Africa Sentinel-2 Level-2A	The Sentinel-2 mission is part of the European Union Copernicus programme for Earth observations. Sentinel-2 consists of twin satellites, Sentinel-2A (launched 23 June 2015) and Sentinel-2B (launched 7 March 2017). The two satellites have the same orbit, but 180° apart for optimal coverage and data delivery. Their combined data is used in the Digital Earth Africa Sentinel-2 product.	1.97 PiB	Various	https://registry.opendata.aws/deafrica-sentinel-2/
U.S. Census ACS PUMS	U.S. Census Bureau American Community Survey (ACS) Public Use Microdata Sample (PUMS) available in a linked data format using the Resource Description Framework (RDF) data model.	15.44 GiB	Various	https://registry.opendata.aws/census-dataworld-pums/
Finnish Meteorological Institute Weather Radar Data	The up-to-date weather radar from the FMI radar network is available as Open Data. The data contain both single radar data along with composites over Finland in GeoTIFF and HDF5-formats. Available composite parameters consist of radar reflectivity (DBZ), rainfall intensity (RR), and precipitation accumulation of 1, 12, and 24 hours. Single radar parameters consist of radar reflectivity (DBZ), radial velocity (VRAD), rain classification (HCLASS), and Cloud top height (ETOP 20). Raw volume data from singe radars are also provided in HDF5 format with ODIM 2.3 conventions. Radar data becomes available as soon as it's received from the radar and pre-processed into deliverable formats. Typically the most recent radar data was collected less than 5 minutes ago.	101.63 TiB	Various	https://registry.opendata.aws/fmi-radar/
NOAA Joint Polar Satellite System (JPSS)	Satellites in the JPSS constellation gather global measurements of atmospheric, terrestrial and oceanic conditions, including sea and land surface temperatures, vegetation, clouds, rainfall, snow and ice cover, fire locations and smoke plumes, atmospheric temperature, water vapor and ozone. JPSS delivers key observations for the Nation's essential products and services, including forecasting severe weather like hurricanes, tornadoes and blizzards days in advance, and assessing environmental hazards such as droughts, forest fires, poor air quality and harmful coastal waters. Further, JPSS will provide continuity of critical, global observations of Earth’s atmosphere, oceans and land through 2038. The data will be available from 2012-01-19 to present.	114.01 TiB	Various	https://registry.opendata.aws/noaa-jpss/
NOAA U.S. Climate Normals	The U.S. Climate Normals are a large suite of data products that provide information about typical climate conditions for thousands of locations across the United States. Normals act both as a ruler to compare today’s weather and tomorrow’s forecast, and as a predictor of conditions in the near future. The official normals are calculated for a uniform 30 year period, and consist of annual/seasonal, monthly, daily, and hourly averages and statistics of temperature, precipitation, and other climatological variables from almost 15,000 U.S. weather stations.	29.33 GiB	Various	https://registry.opendata.aws/noaa-climate-normals/
iNaturalist Licensed Observation Images	iNaturalist is a community science effort in which participants share observations of living organisms that they encounter and document with photographic evidence, location, and date. The community works together reviewing these images to identify these observations to species. This collection represents the licensed images accompanying iNaturalist observations.	185.49 TiB	Various	https://registry.opendata.aws/inaturalist-open-data/
NOAA Global Hydro Estimator (GHE)	Global Hydro-Estimator provides a global	254.54 GiB	Various	https://registry.opendata.aws/noaa-ghe/
Image classification - fast.ai datasets	Some of the most important datasets for image classification research, including	24.46 GiB	Various	https://registry.opendata.aws/fast-ai-imageclas/
Crowdsourced Bathymetry	Community provided bathymetry data collected in collaboration with the International Hydrographic Organization.	24.97 GiB	Various	https://registry.opendata.aws/odp-noaa-nesdis-ncei-csb/
MWIS VR Instances	Large-scale node-weighted conflict graphs for maximum weight independent set solvers	12.87 GiB	Various	https://registry.opendata.aws/mwis-vr-instances/
Multiview Extended Video with Activities (MEVA)	The Multiview Extended Video with Activities (MEVA) dataset consists	498.15 GiB	Various	https://registry.opendata.aws/mevadata/
Prefeitura Municipal de São Paulo (PMSP) LiDAR Point Cloud	The objective of the Mapa 3D Digital da Cidade (M3DC) of the São Paulo City Hall is to publish LiDAR point cloud data. The initial data was acquired in 2017 by aerial surveying and future data will be added. This publicly accessible dataset is provided in the Entwine Point Tiles format as a lossless octree, full density, based on LASzip (LAZ) encoding.	394.46 GiB	Various	https://registry.opendata.aws/pmsp-lidar/
OpenStreetMap on AWS	OSM is a free, editable map of the world, created and maintained by volunteers. Regular OSM data archives are made available in Amazon S3.	3.72 TiB	Various	https://registry.opendata.aws/osm/
IRS 990 Filings	On December 16, 2021 the IRS announced that it would discontinue updates to the IRS 990 Filings dataset on AWS, starting December 31, 2021.	122.85 GiB	Various	https://registry.opendata.aws/irs990/
Pacific Ocean Sound Recordings	This project offers passive acoustic data (sound recordings) from a deep-ocean environment off central California. Recording began in July 2015, has been nearly continuous, and is ongoing. These resources are intended for applications	145.64 TiB	Various	https://registry.opendata.aws/pacific-sound/
Scottish Public Sector LiDAR Dataset	This dataset is Lidar data that has been collected by the Scottish public sector and made available under the Open Government Licence. The data are available as point cloud (LAS format or in LAZ compressed format), along with the derived Digital Terrain Model (DTM) and Digital Surface Model (DSM) products as Cloud optimized GeoTIFFs (COG) or standard GeoTIFF. The dataset contains multiple subsets of data which were each commissioned and flown in response to different organisational requirements. The details of each can be found at https://remotesensingdata.gov.scot/data#/list	1.78 TiB	Various	https://registry.opendata.aws/scottish-lidar/
4D Nucleome (4DN)	The goal of the National Institutes of Health (NIH) Common Fund’s 4D Nucleome (4DN) program	170.53 TiB	Various	https://registry.opendata.aws/4dnucleome/
STOIC2021 Training	The STOIC project collected Computed Tomography (CT) images of 10,735 individuals suspected of being infected with SARS-COV-2 during the first wave of the pandemic in France, from March to April 2020. For each patient in the training set, the dataset contains binary labels for COVID-19 presence, based on RT-PCR test results, and COVID-19 severity, defined as intubation or death within one month from the acquisition of the CT scan. This S3 bucket contains the training sample of the STOIC dataset as used in the STOIC2021 challenge on grand-challenge.org.	243.69 GiB	Various	https://registry.opendata.aws/stoic2021-training/
HIRLAM Weather Model	HIRLAM (High Resolution Limited Area Model) is an operational synoptic and mesoscale weather prediction model managed by the Finnish Meteorological Institute.	244.26 TiB	Various	https://registry.opendata.aws/hirlam/
Broad Genome References	Broad maintained human genome reference builds hg19/hg38 and decoy references.	204.58 GiB	Various	https://registry.opendata.aws/broad-references/
Reasoning Over Paragraph Effects in Situations (ROPES)	14k QA pairs over 1.7K paragraphs, split between train (10k QAs), development (1.6k QAs) and a hidden test partition (1.7k QAs).	397.26 GiB	Various	https://registry.opendata.aws/allenai-ropes/
Atmospheric Models from Météo-France	Global and high-resolution regional atmospheric models from Météo-France.	5.05 TiB	Various	https://registry.opendata.aws/meteo-france-models/
Digital Earth Africa Landsat Collection 2 Level 2	Digital Earth Africa (DE Africa) provides free and open access to a copy of Landsat Collection 2 Level-2 products over Africa. These products are produced and provided by the United States Geological Survey (USGS).	587.47 TiB	Various	https://registry.opendata.aws/deafrica-landsat/
1940 Census Population Schedules, Enumeration District Maps, and Enumeration District Descriptions	The 1940 Census population schedules were created by the Bureau of the Census in an attempt to enumerate every person living in the United States on April 1, 1940, although some persons were missed. The 1940 census population schedules were digitized by the National Archives and Records Administration (NARA) and released publicly on April 2, 2012.	15.05 TiB	Various	https://registry.opendata.aws/nara-1940-census/
COVID-19 Data Lake	A centralized repository of up-to-date and curated datasets on or related to the spread and characteristics of the novel corona virus (SARS-CoV-2) and its associated illness, COVID-19. Globally, there are several efforts underway to gather this data, and we are working with partners to make this crucial data freely available and keep it up-to-date. Hosted on the AWS cloud, we have seeded our curated data lake with COVID-19 case tracking data from Johns Hopkins and The New York Times, hospital bed availability from Definitive Healthcare, and over 45,000 research articles about COVID-19 and related coronaviruses from the Allen Institute for AI.	157.68 GiB	Various	https://registry.opendata.aws/aws-covid19-lake/
NOAA Rapid Refresh (RAP)	The Rapid Refresh (RAP) is a NOAA/NCEP operational weather prediction system comprised primarily of a numerical forecast model and analysis/assimilation system to initialize that model. It covers North America and is run with a horizontal resolution of 13 km and 50 vertical layers. The RAP was developed to serve users needing frequently updated short-range weather forecasts, including those in the US aviation community and US severe weather forecasting community. The model is run for every hour of the day; it is integrated to 51 hours for the 03/09/15/21 UTC cycles and to 21 hours for every other cycle. The RAP uses the ARW core of the WRF model and the Gridpoint Statistical Interpolation (GSI) analysis - the analysis is aided with the assimilation of cloud and hydrometeor data to provide more skill in short-range cloud and precipitation forecasts.	163.59 TiB	Various	https://registry.opendata.aws/noaa-rap/
RarePlanes	RarePlanes is a unique open-source machine learning dataset from CosmiQ Works and AI.Reverie that incorporates both real and synthetically generated satellite imagery. The RarePlanes dataset specifically focuses on the value of AI.Reverie synthetic data to aid computer vision algorithms in their ability to automatically detect aircraft and their attributes in satellite imagery. Although other synthetic/real combination datasets exist, RarePlanes is the largest openly-available very high resolution dataset built to test the value of synthetic data from an overhead perspective. The real portion of the dataset consists of 253 Maxar WorldView-3 satellite scenes spanning 112 locations and 2,142 km^2 with 14,700 hand-annotated aircraft. The accompanying synthetic dataset is generated via AI.Reverie’s novel simulation platform and features 50,000 synthetic satellite images with ~630,000 aircraft annotations.	475.96 GiB	Various	https://registry.opendata.aws/rareplanes/
Amazon Bin Image Dataset	The Amazon Bin Image Dataset contains over 500,000 images and metadata from bins of a pod in an operating Amazon Fulfillment Center. The bin images in this dataset are captured as robot units carry pods as part of normal Amazon Fulfillment Center operations.	29.4 GiB	Various	https://registry.opendata.aws/amazon-bin-imagery/
Human Cancer Models Initiative (HCMI) Cancer Model Development Center	The Human Cancer Models Initiative (HCMI) is an international consortium that is generating novel,	7.12 GiB	Various	https://registry.opendata.aws/hcmi-cmdc/
NOAA Operational Forecast System (OFS)	For decades, mariners in the United States have depended on NOAA's Tide Tables for the best estimate of expected water levels. These tables provide accurate predictions of the astronomical tide (i.e., the change in water level due to the gravitational effects of the moon and sun and the rotation of the Earth); however, they cannot predict water-level changes due to wind, atmospheric pressure, and river flow, which are often significant.	129.37 TiB	Various	https://registry.opendata.aws/noaa-ofs/
NEXRAD on AWS	Real-time and archival data from the Next Generation Weather Radar (NEXRAD) network.	598.42 TiB	Various	https://registry.opendata.aws/noaa-nexrad/
Department of Energy's Open Energy Data Initiative (OEDI)	Data released under the Department of Energy's Open Energy Data Initiative	52.16 TiB	Various	https://registry.opendata.aws/oedi-data-lake/
QIIME 2 User Tutorial Datasets	QIIME 2 is a powerful, extensible, and decentralized microbiome analysis package with a focus on data and analysis transparency. QIIME 2 enables researchers to start an analysis with raw DNA sequence data and finish with publication-quality figures and statistical results. This dataset contains the user docs (and related datasets) for QIIME 2.	270.78 GiB	Various	https://registry.opendata.aws/qiime2/
The Klarna Product-Page Dataset	A collection of 51,701 product pages from 8175 e-commerce websites across 8 markets (US, GB, SE, NL, FI, NO, DE, AT) with 5 manually labelled elements, specifically, the product price, name and image, add-to-cart and go-to-cart buttons.	124.53 GiB	Various	https://registry.opendata.aws/klarna_productpage_dataset/
Sentinel-1	Sentinel-1 is a pair of European radar imaging (SAR) satellites launched in 2014 and 2016. Its 6 days revisit cycle and ability to observe through clouds makes it perfect for sea and land monitoring, emergency response due to environmental disasters, and economic applications. This dataset represents the global Sentinel-1 GRD archive, from beginning to the present, converted to cloud-optimized GeoTIFF format.	34.68 TiB	Various	https://registry.opendata.aws/sentinel-1/
The Human Microbiome Project	The NIH-funded Human Microbiome Project (HMP) is a collaborative effort of over 300 scientists from more than 80 organizations to comprehensively characterize the microbial communities inhabiting the human body and elucidate their role in human health and disease. To accomplish this task, microbial community samples were isolated from a cohort of 300 healthy adult human subjects at 18 specific sites within five regions of the body (oral cavity, airways, urogenital track, skin, and gut). Targeted sequencing of the 16S bacterial marker gene and/or whole metagenome shotgun sequencing was performed for thousands of these samples. In addition, whole genome sequences were generated for isolate strains collected from human body sites to act as reference organisms for analysis. Finally, 16S marker and whole metagenome sequencing was also done on additional samples from people suffering from several disease conditions.	5.33 TiB	Various	https://registry.opendata.aws/human-microbiome-project/
Voices Obscured in Complex Environmental Settings (VOiCES)	VOiCES is a speech corpus recorded in acoustically challenging settings,	465.01 GiB	Various	https://registry.opendata.aws/lab41-sri-voices/
Natural Earth	Natural Earth is a public domain map dataset available at 1:10m, 1:50m, and 1:110 million scales. Featuring tightly integrated vector and raster data, with Natural Earth you can make a variety of visually pleasing, well-crafted maps with cartography or GIS software.	26.57 GiB	Various	https://registry.opendata.aws/naturalearth/
Quoref	24K Question/Answer (QA) pairs over 4.7K paragraphs, split between train (19K QAs), development (2.4K QAs) and a hidden test partition (2.5K QAs).	397.26 GiB	Various	https://registry.opendata.aws/allenai-quoref/
University of British Columbia Sunflower Genome Dataset	This dataset captures Sunflower's genetic diversity originating	67.22 TiB	Various	https://registry.opendata.aws/ubc-sunflower-genome/
REDASA COVID-19 Open Data	The REaltime DAta Synthesis and Analysis (REDASA) COVID-19 snapshot contains the output of the curation protocol produced by our curator community. A detailed description can be found in our paper. The first S3 bucket listed in Resources contains a large collection of medical documents in text format extracted from the CORD-19 dataset, plus other sources deemed relevant by the REDASA consortium. The second S3 bucket contains a series of documents surfaced by Amazon Kendra that were considered relevant for each medical question asked. The final S3 bucket contains the GroundTruth annotations created by our curator community.	37.26 GiB	Various	https://registry.opendata.aws/redasa-covid-data/
Daylight Map Distribution of OpenStreetMap	Daylight is a complete distribution of global, open map data that’s freely available with support from community and professional mapmakers. Meta combines the work of global contributors to projects like OpenStreetMap with quality and consistency checks from Daylight mapping partners to create a free, stable, and easy-to-use street-scale global map. The Daylight Map Distribution contains a validated subset of the OpenStreetMap database. In addition to the standard OpenStreetMap PBF format, Daylight is available in two parquet formats that are optimized for AWS Athena including geometries (Points, LineStrings, Polygons, or MultiPolygons). First, Daylight OSM Features contains the nearly 1B renderable OSM features. Second, Daylight OSM Elements contains all of OSM, including all 7B nodes without attributes, and relations that do not contain geometries, such as turn restrictions.	2.33 TiB	Various	https://registry.opendata.aws/daylight-osm/
Galaxy Evolution Explorer Satellite (GALEX)	The Galaxy Evolution Explorer Satellite (GALEX) was a NASA mission led by the California Institute of Technology, whose primary goal was to investigates how star formation in galaxies evolved from the early Universe up to the present. GALEX used microchannel plate detectors to obtain direct images in the near-UV (NUV) and far-UV (FUV), and a grism to disperse light for low resolution spectroscopy. More information about GALEX is available at MAST	15.26 TiB	Various	https://registry.opendata.aws/galex/
District of Columbia - Classified Point Cloud LiDAR	"Please see here for the lates content about this dataset.	314.15 GiB	Various	https://registry.opendata.aws/dc-lidar-2015/
NOAA Global Surface Summary of Day	Global Surface Summary of the Day is derived from The Integrated Surface Hourly (ISH) dataset. The ISH dataset includes global data obtained from the USAF Climatology Center, located in the Federal Climate Complex with NCDC. The latest daily summary data are normally available 1-2 days after the date-time of the observations used in the daily summaries. The online data files begin with 1929 and are at the time of this writing at the Version 8 software level. Over 9000 stations' data are typically available. The daily elements included in the dataset (as available from each station) are:	34.77 GiB	Various	https://registry.opendata.aws/noaa-gsod/
Allen Brain Observatory - Visual Coding AWS Public Data Set	The Allen Brain Observatory – Visual Coding is a large-scale, standardized survey of physiological activity across the mouse visual cortex, hippocampus, and thalamus. It includes datasets collected with both two-photon imaging and Neuropixels probes, two complementary techniques for measuring the activity of neurons in vivo. The two-photon imaging dataset features visually evoked calcium responses from GCaMP6-expressing neurons in a range of cortical layers, visual areas, and Cre lines. The Neuropixels dataset features spiking activity from distributed cortical and subcortical brain regions, collected under analogous conditions to the two-photon imaging experiments. We hope that experimentalists and modelers will use these comprehensive, open datasets as a testbed for theories of visual information processing.	158.79 TiB	Various	https://registry.opendata.aws/allen-brain-observatory/
Geosnap Data, Center for Geospatial Sciences	This bucket contains multiple datasets (as Quilt packages) created by the	74.17 GiB	Various	https://registry.opendata.aws/spatial-ucr/
NOAA National Water Model CONUS Retrospective Dataset	The NOAA National Water Model Retrospective dataset contains input and output from multi-decade CONUS retrospective simulations. These simulations used meteorological input fields from meteorological retrospective datasets. The output frequency and fields available in this historical NWM dataset differ from those contained in the real-time operational NWM forecast model.	258.32 TiB	Various	https://registry.opendata.aws/nwm-archive/
Tabula Sapiens	Tabula Sapiens will be a benchmark, first-draft human cell atlas of two million cells from 25 organs of eight normal human subjects.	74.76 TiB	Various	https://registry.opendata.aws/tabula-sapiens/
Sentinel-2	The Sentinel-2 mission is	34.22 TiB	Various	https://registry.opendata.aws/sentinel-2/
TIGER Training	"This dataset contains the training data for the Tumor InfiltratinG lymphocytes in breast cancER or TIGER challenge. TIGER is the first challenge on fully automated assessment of tumor-infiltrating lymphocytes (TILs) in breast cancer histopathology slides. TILs are proving to be an important biomarker in cancer patients as they can play a part in killing tumor cells, particularly in some types of breast cancer. Identifying and measuring TILs can help to better target treatments, particularly immunotherapy, and may result in lower levels of other more aggressive treatments, including chemotherapy."	169.24 GiB	Various	https://registry.opendata.aws/tiger/
CoMMpass from the Multiple Myeloma Research Foundation	The Relating Clinical Outcomes in Multiple Myeloma to Personal Assessment of Genetic Profile study is the Multiple Myeloma Research Foundation (MMRF)’s landmark personalized medicine initiative. CoMMpass is a	1.11 GiB	Various	https://registry.opendata.aws/mmrf-commpass/
The Genome Modeling System	The Genome Institute at Washington University has developed a high-throughput, fault-tolerant analysis information management system called the Genome Modeling System (GMS), capable of executing complex, interdependent, and automated genome analysis pipelines at a massive scale. The GMS framework provides detailed tracking of samples and data coupled with reliable and repeatable analysis pipelines. GMS includes a full system image with software and services, expandable from one workstation to a large compute cluster.	363.47 GiB	Various	https://registry.opendata.aws/gmsdata/
ARPA-E PERFORM Forecast data	The ARPA-E PERFORM Program is an ARPA-E funded program that aim to use	414.73 GiB	Various	https://registry.opendata.aws/arpa-e-perform/
K2 Mission Data	The K2 mission observed 100 square degrees for 80 days each across 20 different pointings along the ecliptic, collecting high-precision photometry for a selection of targets within each field. The mission began when the original Kepler mission ended due to loss of the second reaction wheel in 2011. More information about the K2 mission is available at MAST.	4.06 TiB	Various	https://registry.opendata.aws/k2/
NOAA Water-Column Sonar Data Archive	Water-column sonar data archived at the NOAA National Centers for Environmental Information.	147.7 TiB	Various	https://registry.opendata.aws/ncei-wcsd-archive/
Conformational Space of Short Peptides	Co-managed by Toyoko and the Structural Biology Group at the Universidad Nacional de Quilmes, this dataset allows us to explore the conformational space of all possible peptides using the 20 common amino acids. It consists of a collection of exhaustive molecular dynamics simulations of tripeptides and pentapeptides.	1.98 TiB	Various	https://registry.opendata.aws/short_peptides/
NIH NCBI PMC Article Datasets - Full-Text Biomedical and Life Sciences Journal Articles on AWS	PubMed Central® (PMC) is a free full-text archive of biomedical and life sciences journal article at the U.S. National Institutes of Health's National Library of Medicine (NIH/NLM). The PubMed Central (PMC) Article Datasets include full-text articles archived in PMC and made available under license terms that allow for text mining and other types of secondary analysis and reuse. The articles are organized on AWS based on general license type:	591.15 GiB	Various	https://registry.opendata.aws/ncbi-pmc/
NA-CORDEX - North American component of the Coordinated Regional Downscaling Experiment	The NA-CORDEX dataset contains regional climate change scenario data and guidance for North America, for use in impacts, decision-making, and climate science. The NA-CORDEX data archive contains output from regional climate models (RCMs) run over a domain covering most of North America using boundary conditions from global climate model (GCM) simulations in the CMIP5 archive. These simulations run from 1950–2100 with a spatial resolution of 0.22°/25km or 0.44°/50km. This AWS S3 version of the data includes selected variables converted to Zarr format from the original NetCDF. Only daily data are currently available; all daily data were mapped to the Gregorian calendar. Sub-daily data may be added later. Both raw and bias-corrected data are available. Further details about this version of the dataset are available at the documentation link below.	13.15 TiB	Various	https://registry.opendata.aws/ncar-na-cordex/
NOAA National Bathymetric Source Data	The National Bathymetric Source (NBS) project creates and maintains	36.16 GiB	Various	https://registry.opendata.aws/noaa-bathymetry/
Xiph.Org Test Media	Uncompressed video used for video compression and video processing research.	20.69 TiB	Various	https://registry.opendata.aws/xiph-media/
Airborne Object Tracking Dataset	Airborne Object Tracking (AOT) is a collection of 4,943 flight sequences of around 120 seconds each, collected at 10 Hz in diverse conditions. There are 5.9M+ images and 3.3M+ 2D annotations of airborne objects in the sequences. There are 3,306,350 frames without labels as they contain no airborne objects. For images with labels, there are on average 1.3 labels per image. All airborne objects in the dataset are labelled.	11.27 TiB	Various	https://registry.opendata.aws/airborne-object-tracking/
CAM6 Data Assimilation Research Testbed (DART) Reanalysis: Cloud-Optimized Dataset	This is a cloud-hosted subset of the CAM6+DART (Community Atmosphere Model version 6 Data Assimilation Research Testbed) Reanalysis dataset. These data products are designed to facilitate a broad variety of research using the NCAR CESM 2.1 (National Center for Atmospheric Research's Community Earth System Model version 2.1), including model evaluation, ensemble hindcasting, data assimilation experiments, and sensitivity studies. They come from an 80 member ensemble reanalysis of the global troposphere and stratosphere using DART and CAM6. The data products represent states of the atmosphere consistent with observations from 2011 through 2019 at 1 degree horizontal resolution and weekly frequency. Each ensemble member is an equally likely description of the atmosphere, and is also consistent with dynamics and physics of CAM6. The dataset also contains corresponding land surface values at 6-hourly frequency. This dataset is a reformatting, with no change to numerical values, of data from the "CAM6 Data Assimilation Research Testbed (DART) Reanalysis", DOI:10.5065/JG1E-8525.	2.55 TiB	Various	https://registry.opendata.aws/ncar-dart-cam6/
AI2 Diagram Dataset (AI2D)	4,817 illustrative diagrams for research on diagram understanding and associated question answering.	397.26 GiB	Various	https://registry.opendata.aws/allenai-diagrams/
NOAA North American Mesoscale Forecast System (NAM)	The North American Mesoscale Forecast System (NAM) is one of the National Centers For Environmental Prediction’s (NCEP) major models for producing weather forecasts. NAM generates multiple grids (or domains) of weather forecasts over the North American continent at various horizontal resolutions. Each grid contains data for dozens of weather parameters, including temperature, precipitation, lightning, and turbulent kinetic energy. NAM uses additional numerical weather models to generate high-resolution forecasts over fixed regions, and occasionally to follow significant weather events like hurricanes.	101.47 TiB	Various	https://registry.opendata.aws/noaa-nam/
Global Database of Events, Language and Tone (GDELT)	This project monitors the world's broadcast, print,	4.53 TiB	Various	https://registry.opendata.aws/gdelt/
Normalized Difference Urban Index (NDUI)	NDUI is combined with cloud shadow-free Landsat Normalized Difference Vegetation Index (NDVI) composite and DMSP/OLS Night Time Light (NTL) to characterize global urban areas at a 30 m resolution,and it can greatly enhance urban areas, which can then be easily distinguished from bare lands including fallows and deserts. With the capability to delineate urban boundaries and, at the same time, to present sufficient spatial details within urban areas, the NDUI has the potential for urbanization studies at regional and global scales.	35.3 GiB	Various	https://registry.opendata.aws/ndui/
Sentinel-1 SLC dataset for South and Southeast Asia, Taiwan, Korea and Japan	The S1 Single Look Complex (SLC) dataset contains Synthetic Aperture Radar (SAR) data in the C-Band wavelength. The SAR sensors are installed on a two-satellite (Sentinel-1A and Sentinel-1B) constellation orbiting the Earth with a combined revisit time of six days, operated by the European Space Agency. The S1 SLC data are a Level-1 product that collects radar amplitude and phase information in all-weather, day or night conditions, which is ideal for studying natural hazards and emergency response, land applications, oil spill monitoring, sea-ice conditions, and associated climate change effects.	440.19 TiB	Various	https://registry.opendata.aws/sentinel1-slc-seasia-pds/
Japanese Tokenizer Dictionaries	Japanese Tokenizer Dictionaries for use with MeCab.	3.17 GiB	Various	https://registry.opendata.aws/cotonoha-dic/
NREL National Solar Radiation Database	Released to the public as part of the Department of Energy's Open Energy Data Initiative,	692.05 TiB	Various	https://registry.opendata.aws/nrel-pds-nsrdb/
LOFAR ELAIS-N1 cycle 2 observations on AWS	These data correspond to the International LOFAR Telescope observations of the sky field ELAIS-N1 (16:10:01 +54:30:36) during the cycle 2 of observations. There are 11 runs of about 8 hours each plus the corresponding observation of the calibration targets before and after the target field. The data are measurement sets (MS) containing the cross-correlated data and metadata divided in 371 frequency sub-bands per target centred at ~150 MHz.	63.85 TiB	Various	https://registry.opendata.aws/lofar-elais-n1/
UK Met Office Global and Regional Weather Forecasts	This dataset listing is no longer active. Please go here for current information. Archive data from the UK Met Office Global and Regional Ensemble Prediction System (MOGREPS) available on Amazon S3. Data from two models is available: MOEGREPS-UK, a high resolution weather forecast covering the United Kingdom, and MOGREPS-G, a global weather forecast.	58.1 TiB	Various	https://registry.opendata.aws/mogreps/
DOE's Water Power Technology Office's (WPTO) US Wave dataset	Released to the public as part of the Department of Energy's Open Energy Data Initiative,	44.21 TiB	Various	https://registry.opendata.aws/wpto-pds-us-wave/
New Jersey Statewide LiDAR	Elevation datasets in New Jersey have been collected over several years as several	8.75 TiB	Various	https://registry.opendata.aws/nj-lidar/
CMIP6 GCMs downscaled using WRF	High-resolution historical and future climate simulations from 1980-2100	605.6 TiB	Various	https://registry.opendata.aws/wrf-cmip6/
Refgenie reference genome assets	Pre-built refgenie reference genome data assets used for aligning and analyzing DNA sequence data.	9.9 TiB	Various	https://registry.opendata.aws/refgenie/
High-Order Accurate Direct Numerical Simulation of Flow over a MTU-T161 Low Pressure Turbine Blade	The archive comprises snapshot, point-probe, and time-average data produced via a high-fidelity computational simulation of turbulent air flow over a low pressure turbine blade, which is an important component in a jet engine. The simulation was undertaken using the open source PyFR flow solver on over 5000 Nvidia K20X GPUs of the Titan supercomputer at Oak Ridge National Laboratory under an INCITE award from the US DOE. The data can be used to develop an enhanced understanding of the complex three-dimensional unsteady air flow patterns over turbine blades in jet engines. This could in turn lead to design of greener more fuel efficient aircraft. It could also be used to train a next-generation of Reynolds Averaged Navier-Stokes turbulence models via a machine learning approach, which would have broad applicability to a wide range of science and engineering problems.	10.54 TiB	Various	https://registry.opendata.aws/pyfr-mtu-t161-dns-data/
Radiant MLHub	Radiant MLHub is an open library for geospatial training data that hosts datasets generated by Radiant Earth Foundation's team as well as other training data catalogs contributed by Radiant Earth’s partners. Radiant MLHub is open to anyone to access, store, register and/or share their training datasets for high-quality Earth observations. All of the training datasets are stored using a SpatioTemporal Asset Catalog (STAC) compliant catalog and exposed through a common API. Training datasets include pairs of imagery and labels for different types of machine learning problems including image classification, object detection, and semantic segmentation. Labels are generated from ground reference data and/or image annotation.	8.59 TiB	Various	https://registry.opendata.aws/radiant-mlhub/
Community Earth System Model v2 Large Ensemble (CESM2 LENS)	The US National Center for Atmospheric Research partnered with the IBS Center for Climate Physics in South Korea to generate the CESM2 Large Ensemble which consists of 100 ensemble members at 1 degree spatial resolution covering the period 1850-2100 under CMIP6 historical and SSP370 future radiative forcing scenarios. Data sets from this ensemble were made downloadable via the Climate Data Gateway on June 14th, 2021.	309.28 TiB	Various	https://registry.opendata.aws/ncar-cesm2-lens/
ZINC Database	3D models for molecular docking screens.	658.32 TiB	Various	https://registry.opendata.aws/zinc15/
NOAA Global Historical Climatology Network Daily (GHCN-D)	Global Historical Climatology Network - Daily is a dataset from NOAA that contains daily observations over global land areas. It contains station-based measurements from land-based stations worldwide, about two thirds of which are for precipitation measurement only. Other meteorological elements include, but are not limited to, daily maximum and minimum temperature, temperature at the time of observation, snowfall and snow depth. It is a composite of climate records from numerous sources that were merged together and subjected to a common suite of quality assurance reviews. Some data are more than 175 years old. The data is in CSV format. Each file corresponds to a year from 1763 to present and is named as such.	109.33 GiB	Various	https://registry.opendata.aws/noaa-ghcn/
High Resolution Population Density Maps + Demographic Estimates by CIESIN and Meta	Population data for a selection of countries, allocated to 1 arcsecond blocks and provided in a combination of CSV	95.62 GiB	Various	https://registry.opendata.aws/dataforgood-fb-hrsl/
Rapid7 FDNS ANY Dataset	Subset of FDNS ANY queries against domain names produced by Rapid7 Project Sonar, made available in s3.	151.26 GiB	Various	https://registry.opendata.aws/rapid7-fdns-any/
Analysis Ready Sentinel-1 Backscatter Imagery	The Sentinel-1 mission is a constellation of	49.7 TiB	Various	https://registry.opendata.aws/sentinel-1-rtc-indigo/
NOAA Geostationary Operational Environmental Satellites (GOES) 16 & 17	NOAA GOES-T will launch in March 2022!! For more information check out the GOES-T Webpage.	784.79 TiB	Various	https://registry.opendata.aws/noaa-goes/
Genome Ark	The Genome Ark hosts genomic information for the Vertebrate Genomes Project (VGP) and other related projects. The VGP is an international collaboration that aims to generate complete and near error-free reference genomes for all extant vertebrate species. These genomes will be used to address fundamental questions in biology and disease, to identify species most genetically at risk for extinction, and to preserve genetic information of life.	429.62 TiB	Various	https://registry.opendata.aws/genomeark/
Aristo Mini Corpus	1,197,377 science-relevant sentences	397.26 GiB	Various	https://registry.opendata.aws/allenai-aristo-mini/
DialoGLUE: A Natural Language Understanding Benchmark for Task-Oriented Dialogue	This bucket contains the checkpoints used to reproduce the baseline results reported in the DialoGLUE benchmark hosted	1.23 GiB	Various	https://registry.opendata.aws/dialoglue/
Digital Earth Africa Sentinel-1 Radiometrically Terrain Corrected	DE Africa’s Sentinel-1 backscatter product is developed to be compliant with the CEOS Analysis Ready Data for Land (CARD4L) specifications.	206.7 TiB	Various	https://registry.opendata.aws/deafrica-sentinel-1/
COVID-19 Harmonized Data	A harmonized collection of the core data pertaining to COVID-19 reported cases by geography, in a format prepared for analysis	76.99 GiB	Various	https://registry.opendata.aws/talend-covid19/
International Neuroimaging Data-Sharing Initiative (INDI)	This bucket contains multiple neuroimaging datasets that are part of the International Neuroimaging Data-Sharing Initiative. Raw human and non-human primate neuroimaging data include 1) Structural MRI; 2) Functional MRI; 3) Diffusion Tensor Imaging; 4) Electroencephalogram (EEG)	268.82 TiB	Various	https://registry.opendata.aws/fcp-indi/
NOAA Oceanic Climate Data Records	NOAA's Climate Data Records (CDRs) are robust, sustainable, and scientifically sound climate records that provide trustworthy information on how, where, and to what extent the land, oceans, atmosphere and ice sheets are changing. These datasets are thoroughly vetted time series measurements with the longevity, consistency, and continuity to assess and measure climate variability and change. NOAA CDRs are vetted using standards established by the National Research Council (NRC).	228.21 GiB	Various	https://registry.opendata.aws/noaa-cdr-oceanic/
NOAA Climate Forecast System (CFS)	The Climate Forecast System (CFS) is a model representing the global interaction between Earth's oceans, land, and atmosphere. Produced by several dozen scientists under guidance from the National Centers for Environmental Prediction (NCEP), this model offers hourly data with a horizontal resolution down to one-half of a degree (approximately 56 km) around Earth for many variables. CFS uses the latest scientific approaches for taking in, or assimilating, observations from data sources including surface observations, upper air balloon observations, aircraft observations, and satellite observations. Please note that the data in this bucket are the CFSv2 Operational Forecasts. To obtain other CFSv2 products such as the Operational Analysis, please visit our website.	357.76 TiB	Various	https://registry.opendata.aws/noaa-cfs/
A Realistic Cyber Defense Dataset (CSE-CIC-IDS2018)	This dataset is the result of a collaborative project between the Communications Security Establishment (CSE) and The Canadian Institute for Cybersecurity (CIC) that use the notion of profiles to generate cybersecurity dataset in a systematic manner. It incluides a detailed description of intrusions along with abstract distribution models for applications, protocols, or lower level network entities. The dataset includes seven different attack scenarios, namely Brute-force, Heartbleed, Botnet, DoS, DDoS, Web attacks, and infiltration of the network from inside. The attacking infrastructure includes 50 machines and the victim organization has 5 departments includes 420 PCs and 30 servers. This dataset includes the network traffic and log files of each machine from the victim side, along with 80 network traffic features extracted from captured traffic using CICFlowMeter-V3.	452.75 GiB	Various	https://registry.opendata.aws/cse-cic-ids2018/
Cell Painting Image Collection	The Cell Painting Image Collection is a collection of freely	1.94 TiB	Various	https://registry.opendata.aws/cell-painting-image-collection/
YouTube 8 Million - Data Lakehouse Ready	This both the original .tfrecords and a Parquet representation of the YouTube 8 Million dataset. YouTube-8M is a large-scale labeled video dataset that consists of millions of YouTube video IDs, with high-quality machine-generated annotations from a diverse vocabulary of 3,800+ visual entities. It comes with precomputed audio-visual features from billions of frames and audio segments, designed to fit on a single hard disk. This dataset also includes the YouTube-8M Segments data from June 2019.	3.17 TiB	Various	https://registry.opendata.aws/yt8m/
NOAA National Water Model Short-Range Forecast	The National Water Model (NWM) is a water resources model that simulates and forecasts water	27.73 TiB	Various	https://registry.opendata.aws/noaa-nwm-pds/
SondeHub Radiosonde Telemetry	SondeHub Radiosonde telemetry contains global radiosonde (weather balloon) data captured by SondeHub from our participating radiosonde_auto_rx receiving stations. radiosonde_auto_rx is a open source project aimed at receiving and decoding telemetry from airborne radiosondes using software-defined-radio techniques, enabling study of the telemetry and sometimes recovery of the radiosonde itself.	59.05 GiB	Various	https://registry.opendata.aws/sondehub-telemetry/
NOAA Global Forecast System (GFS)	The Global Forecast System (GFS) is a weather forecast model produced	936.41 TiB	Various	https://registry.opendata.aws/noaa-gfs-bdp-pds/
Sea Surface Temperature Daily Analysis: European Space Agency Climate Change Initiative product version 2.1	Global daily-mean sea surface temperatures, presented on a 0.05° latitude-longitude grid, with gaps between available daily observations filled by statistical means, spanning late 1981 to recent time. Suitable for large-scale oceanographic meteorological and climatological applications, such as evaluating or constraining environmental models or case-studies of marine heat wave events. Includes temperature uncertainty information and auxiliary information about land-sea fraction and sea-ice coverage. For reference and citation see: www.nature.com/articles/s41597-019-0236-x.	273.6 GiB	Various	https://registry.opendata.aws/surftemp-sst/
NOAA Global Ensemble Forecast System (GEFS) Re-forecast	NOAA has generated a multi-decadal reanalysis and reforecast data set to accompany the next-generation version of its ensemble prediction system, the Global Ensemble Forecast System, version 12 (GEFSv12). Accompanying the real-time forecasts are “reforecasts” of the weather, that is, retrospective forecasts spanning the period 2000-2019. These reforecasts are not as numerous as the real-time data; they were generated only once per day, from 00 UTC initial conditions, and only 5 members were provided, with the following exception. Once weekly, an 11-member reforecast was generated, and these extend in lead time to +35 days.	388.8 TiB	Various	https://registry.opendata.aws/noaa-gefs-reforecast/
Deutsche Börse Public Dataset	The Deutsche Börse Public Data Set consists of trade data aggregated to one minute intervals from the Eurex and Xetra trading systems. It provides the initial price, lowest price, highest price, final price and volume for every minute of the trading day, and for every tradeable security. If you need higher resolution data, including untraded price movements, please refer to our historical market data product here. Also, be sure to check out our developer's portal.	16.05 GiB	Various	https://registry.opendata.aws/deutsche-boerse-pds/
Digital Earth Africa ALOS PALSAR, ALOS-2 PALSAR-2 and JERS-1	The ALOS/PALSAR annual mosaic is a global 25 m resolution dataset that combines data from many images captured by JAXA’s PALSAR and PALSAR-2 sensors on ALOS-1 and ALOS-2 satellites respectively. This product contains radar measurement in L-band and in HH and HV polarizations. It has a spatial resolution of 25 m and is available annually for 2007 to 2010 (ALOS/PALSAR) and 2015 to 2020 (ALOS-2/PALSAR-2).	3.1 TiB	Various	https://registry.opendata.aws/deafrica-alos-jers/
ubuntu@ip-172-31-80-59:~/open-data-registry/datasets$ screen -r -d
NOAA Severe Weather Data Inventory (SWDI)	The Storm Events Database is an integrated database of severe weather events across the United States from 1950 to this year, with information about a storm event's location, azimuth, distance, impact, and severity, including the cost of damages to property and crops. It contains data documenting: The occurrence of storms and other significant weather phenomena having sufficient intensity to cause loss of life, injuries, significant property damage, and/or disruption to commerce. Rare, unusual, weather phenomena that generate media attention, such as snow flurries in South Florida or the San Diego coastal area. Other significant meteorological events, such as record maximum or minimum temperatures or precipitation that occur in connection with another event. Data about a specific event is added to the dataset within 120 days to allow time for damage assessments and other analysis.	71.29 GiB	Various	https://registry.opendata.aws/noaa-swdi/
Community Earth System Model v2 Large Ensemble (CESM2 LENS)	The US National Center for Atmospheric Research partnered with the IBS Center for Climate Physics in South Korea to generate the CESM2 Large Ensemble which consists of 100 ensemble members at 1 degree spatial resolution covering the period 1850-2100 under CMIP6 historical and SSP370 future radiative forcing scenarios. Data sets from this ensemble were made downloadable via the Climate Data Gateway on June 14th, 2021.	309.28 TiB	Various	https://registry.opendata.aws/ncar-cesm2-lens/
ZINC Database	3D models for molecular docking screens.	658.32 TiB	Various	https://registry.opendata.aws/zinc15/
NOAA Global Historical Climatology Network Daily (GHCN-D)	Global Historical Climatology Network - Daily is a dataset from NOAA that contains daily observations over global land areas. It contains station-based measurements from land-based stations worldwide, about two thirds of which are for precipitation measurement only. Other meteorological elements include, but are not limited to, daily maximum and minimum temperature, temperature at the time of observation, snowfall and snow depth. It is a composite of climate records from numerous sources that were merged together and subjected to a common suite of quality assurance reviews. Some data are more than 175 years old. The data is in CSV format. Each file corresponds to a year from 1763 to present and is named as such.	109.33 GiB	Various	https://registry.opendata.aws/noaa-ghcn/
High Resolution Population Density Maps + Demographic Estimates by CIESIN and Meta	Population data for a selection of countries, allocated to 1 arcsecond blocks and provided in a combination of CSV	95.62 GiB	Various	https://registry.opendata.aws/dataforgood-fb-hrsl/
Image localization - fast.ai datasets	Some of the most important datasets for image localization research, including	15.46 GiB	Various	https://registry.opendata.aws/fast-ai-imagelocal/
Rapid7 FDNS ANY Dataset	Subset of FDNS ANY queries against domain names produced by Rapid7 Project Sonar, made available in s3.	151.26 GiB	Various	https://registry.opendata.aws/rapid7-fdns-any/
Analysis Ready Sentinel-1 Backscatter Imagery	The Sentinel-1 mission is a constellation of	49.7 TiB	Various	https://registry.opendata.aws/sentinel-1-rtc-indigo/
NOAA Geostationary Operational Environmental Satellites (GOES) 16 & 17	NOAA GOES-T will launch in March 2022!! For more information check out the GOES-T Webpage.	1.36 PiB	Various	https://registry.opendata.aws/noaa-goes/
Genome Ark	The Genome Ark hosts genomic information for the Vertebrate Genomes Project (VGP) and other related projects. The VGP is an international collaboration that aims to generate complete and near error-free reference genomes for all extant vertebrate species. These genomes will be used to address fundamental questions in biology and disease, to identify species most genetically at risk for extinction, and to preserve genetic information of life.	429.62 TiB	Various	https://registry.opendata.aws/genomeark/
The Massively Multilingual Image Dataset (MMID)	MMID is a large-scale, massively multilingual dataset of images paired with the words they represent collected at the University of Pennsylvania.	2.37 TiB	Various	https://registry.opendata.aws/mmid/
Allen Cell Imaging Collections	This bucket contains multiple datasets (as Quilt packages) created by the	54.41 TiB	Various	https://registry.opendata.aws/allen-cell-imaging-collections/
Aristo Mini Corpus	1,197,377 science-relevant sentences	397.26 GiB	Various	https://registry.opendata.aws/allenai-aristo-mini/
DialoGLUE: A Natural Language Understanding Benchmark for Task-Oriented Dialogue	This bucket contains the checkpoints used to reproduce the baseline results reported in the DialoGLUE benchmark hosted	1.23 GiB	Various	https://registry.opendata.aws/dialoglue/
Digital Earth Africa Sentinel-1 Radiometrically Terrain Corrected	DE Africa’s Sentinel-1 backscatter product is developed to be compliant with the CEOS Analysis Ready Data for Land (CARD4L) specifications.	206.7 TiB	Various	https://registry.opendata.aws/deafrica-sentinel-1/
NOAA Unified Forecast System (UFS) Marine Reanalysis: 1979-2019	The NOAA UFS Marine Reanalysis is a global sea ice ocean coupled reanalysis product produced by the marine data assimilation team of the UFS Research-to-Operation (R2O) project. Underlying forecast and data assimilation systems are based on the UFS model prototype version-6 and the Next Generation Global Ocean Data Assimilation System (NG-GODAS) release of the Joint Effort for Data assimilation Integration (JEDI) Sea Ice Ocean Coupled Assimilation (SOCA). Covering the 40 year reanalysis time period from 1979 to 2019, the data atmosphere option of the UFS coupled global atmosphere ocean sea ice (DATM-MOM6-CICE6) model was applied with two atmospheric forcing data sets: CFSR from 1979 to 1999 and GEFS from 2000 to 2019. Assimilated observation data sets include extensive space-based marine observations and conventional direct measurements of in situ profile data sets.	6.97 TiB	Various	https://registry.opendata.aws/noaa-ufs-marinereanalysis/
COVID-19 Harmonized Data	A harmonized collection of the core data pertaining to COVID-19 reported cases by geography, in a format prepared for analysis	76.99 GiB	Various	https://registry.opendata.aws/talend-covid19/
International Neuroimaging Data-Sharing Initiative (INDI)	This bucket contains multiple neuroimaging datasets that are part of the International Neuroimaging Data-Sharing Initiative. Raw human and non-human primate neuroimaging data include 1) Structural MRI; 2) Functional MRI; 3) Diffusion Tensor Imaging; 4) Electroencephalogram (EEG)	268.82 TiB	Various	https://registry.opendata.aws/fcp-indi/
NOAA Oceanic Climate Data Records	NOAA's Climate Data Records (CDRs) are robust, sustainable, and scientifically sound climate records that provide trustworthy information on how, where, and to what extent the land, oceans, atmosphere and ice sheets are changing. These datasets are thoroughly vetted time series measurements with the longevity, consistency, and continuity to assess and measure climate variability and change. NOAA CDRs are vetted using standards established by the National Research Council (NRC).	228.21 GiB	Various	https://registry.opendata.aws/noaa-cdr-oceanic/
NOAA Climate Forecast System (CFS)	The Climate Forecast System (CFS) is a model representing the global interaction between Earth's oceans, land, and atmosphere. Produced by several dozen scientists under guidance from the National Centers for Environmental Prediction (NCEP), this model offers hourly data with a horizontal resolution down to one-half of a degree (approximately 56 km) around Earth for many variables. CFS uses the latest scientific approaches for taking in, or assimilating, observations from data sources including surface observations, upper air balloon observations, aircraft observations, and satellite observations. Please note that the data in this bucket are the CFSv2 Operational Forecasts. To obtain other CFSv2 products such as the Operational Analysis, please visit our website.	357.76 TiB	Various	https://registry.opendata.aws/noaa-cfs/
A Realistic Cyber Defense Dataset (CSE-CIC-IDS2018)	This dataset is the result of a collaborative project between the Communications Security Establishment (CSE) and The Canadian Institute for Cybersecurity (CIC) that use the notion of profiles to generate cybersecurity dataset in a systematic manner. It incluides a detailed description of intrusions along with abstract distribution models for applications, protocols, or lower level network entities. The dataset includes seven different attack scenarios, namely Brute-force, Heartbleed, Botnet, DoS, DDoS, Web attacks, and infiltration of the network from inside. The attacking infrastructure includes 50 machines and the victim organization has 5 departments includes 420 PCs and 30 servers. This dataset includes the network traffic and log files of each machine from the victim side, along with 80 network traffic features extracted from captured traffic using CICFlowMeter-V3.	452.75 GiB	Various	https://registry.opendata.aws/cse-cic-ids2018/
Cell Painting Image Collection	The Cell Painting Image Collection is a collection of freely	1.94 TiB	Various	https://registry.opendata.aws/cell-painting-image-collection/
YouTube 8 Million - Data Lakehouse Ready	This both the original .tfrecords and a Parquet representation of the YouTube 8 Million dataset. YouTube-8M is a large-scale labeled video dataset that consists of millions of YouTube video IDs, with high-quality machine-generated annotations from a diverse vocabulary of 3,800+ visual entities. It comes with precomputed audio-visual features from billions of frames and audio segments, designed to fit on a single hard disk. This dataset also includes the YouTube-8M Segments data from June 2019.	3.17 TiB	Various	https://registry.opendata.aws/yt8m/
NOAA National Water Model Short-Range Forecast	The National Water Model (NWM) is a water resources model that simulates and forecasts water	27.73 TiB	Various	https://registry.opendata.aws/noaa-nwm-pds/
SondeHub Radiosonde Telemetry	SondeHub Radiosonde telemetry contains global radiosonde (weather balloon) data captured by SondeHub from our participating radiosonde_auto_rx receiving stations. radiosonde_auto_rx is a open source project aimed at receiving and decoding telemetry from airborne radiosondes using software-defined-radio techniques, enabling study of the telemetry and sometimes recovery of the radiosonde itself.	59.05 GiB	Various	https://registry.opendata.aws/sondehub-telemetry/
NOAA Global Forecast System (GFS)	The Global Forecast System (GFS) is a weather forecast model produced	936.41 TiB	Various	https://registry.opendata.aws/noaa-gfs-bdp-pds/
UCSC Genome Browser Sequence and Annotations	The UCSC Genome Browser is an online graphical viewer for genomes, a genome browser, hosted by the University of California, Santa Cruz (UCSC). The interactive website offers access to genome sequence data from a variety of vertebrate and invertebrate species and major model organisms, integrated with a large collection of aligned annotations. This dataset is a copy of the MySQL tables in MyISAM binary and tab-sep format and all binary files in custom formats, sometimes referred as 'gbdb'-files. Data from the UCSC Genome Browser is free and open for use by anyone. However, every genome annotation track has been created by an academic research group, or, in a few cases, by commercial companies. Please acknowledge them by citing them. The information can be found by going to https://genome.ucsc.edu, selecting the respective genome assembly and clicking on the data track. At the end of the documentation, we provide a list of references and acknowledgements.	73.11 TiB	Various	https://registry.opendata.aws/ucsc-genome-browser/
Sea Surface Temperature Daily Analysis: European Space Agency Climate Change Initiative product version 2.1	Global daily-mean sea surface temperatures, presented on a 0.05° latitude-longitude grid, with gaps between available daily observations filled by statistical means, spanning late 1981 to recent time. Suitable for large-scale oceanographic meteorological and climatological applications, such as evaluating or constraining environmental models or case-studies of marine heat wave events. Includes temperature uncertainty information and auxiliary information about land-sea fraction and sea-ice coverage. For reference and citation see: www.nature.com/articles/s41597-019-0236-x.	273.6 GiB	Various	https://registry.opendata.aws/surftemp-sst/
NOAA Global Ensemble Forecast System (GEFS) Re-forecast	NOAA has generated a multi-decadal reanalysis and reforecast data set to accompany the next-generation version of its ensemble prediction system, the Global Ensemble Forecast System, version 12 (GEFSv12). Accompanying the real-time forecasts are “reforecasts” of the weather, that is, retrospective forecasts spanning the period 2000-2019. These reforecasts are not as numerous as the real-time data; they were generated only once per day, from 00 UTC initial conditions, and only 5 members were provided, with the following exception. Once weekly, an 11-member reforecast was generated, and these extend in lead time to +35 days.	388.8 TiB	Various	https://registry.opendata.aws/noaa-gefs-reforecast/
Deutsche Börse Public Dataset	The Deutsche Börse Public Data Set consists of trade data aggregated to one minute intervals from the Eurex and Xetra trading systems. It provides the initial price, lowest price, highest price, final price and volume for every minute of the trading day, and for every tradeable security. If you need higher resolution data, including untraded price movements, please refer to our historical market data product here. Also, be sure to check out our developer's portal.	16.05 GiB	Various	https://registry.opendata.aws/deutsche-boerse-pds/
Digital Earth Africa ALOS PALSAR, ALOS-2 PALSAR-2 and JERS-1	The ALOS/PALSAR annual mosaic is a global 25 m resolution dataset that combines data from many images captured by JAXA’s PALSAR and PALSAR-2 sensors on ALOS-1 and ALOS-2 satellites respectively. This product contains radar measurement in L-band and in HH and HV polarizations. It has a spatial resolution of 25 m and is available annually for 2007 to 2010 (ALOS/PALSAR) and 2015 to 2020 (ALOS-2/PALSAR-2).	3.1 TiB	Various	https://registry.opendata.aws/deafrica-alos-jers/
GeoNet Aotearoa New Zealand Data	GeoNet provides geological hazard information for Aotearoa New Zealand. This dataset contains data and products recorded by the GeoNet sensor network. The dataset currently include GNSS data and additional datasets will be added in the near future. GNSS (Global Navigation Satellite System) data include raw data in proprietary and Receiver Independent Exchange Format (RINEX) and local tie-in survey conducted during equipment changes, more details can be found on 'the GeoNet geodetic page' website. Coastal gauge data include relative measurement of sea level measured by tsunami monitoring gauges. Raw and quality control data are provided in CREX format (Character Form for the Representtion and eXchange of metereological data), more details can be found on 'the GeoNet coastal tsunami monitoring gauges page'.	7.73 TiB	Various	https://registry.opendata.aws/geonet/

Xinan Xu commented 3 years ago

@dkkapur

Deep Kapur · Answer 1 · Thu Mar 31 2022 03:16:41 GMT+0800 (China Standard Time)

@xinaxu this is a lot of datasets! wow!

some of these look to be duplicates (both against what we have in Slingshot as well as in some instances across the proposed table). can i propose that you pick the top 10-15 that you'd like to see onboarded or would like to work on yourself?

@orvn @timelytree had some additional thoughts on adding more datasets as well. tagging them to share!

orun · Answer 2 · Thu Mar 31 2022 13:03:26 GMT+0800 (China Standard Time)

Overview

The table above has 234 rows, but some were duplicate
Below is the same list but
- Only dataset names
- Sorted
- De-duped
- Some datasets already on Slingshot struck through

Implications

Still, there are just over 200 left, which is double the size of Slingshot's 82 current datasets. I think it's valuable to have more datasets, but we also have to consider that adding a large quantity will require modifications to Slingshot's UI, especially:

Better dataset filtering in the dataset explorer
A search/filter UI for radio buttons when selecting a dataset as a Slingshot participant (since 200+ options will be overwhelming for users)

Deduped list

1000 Genomes
1000 Genomes Phase 3 Reanalysis with DRAGEN 3.5 - Data Lakehouse Ready
1000 Genomes Phase 3 Reanalysis with DRAGEN 3.5 and 3.7
1940 Census Population Schedules, Enumeration District Maps, and Enumeration District Descriptions
~~3000 Rice Genomes Project~~
4D Nucleome (4DN)
A Realistic Cyber Defense Dataset (CSE-CIC-IDS2018)
AgricultureVision
AI2 Diagram Dataset (AI2D)
AI2 Meaningful Citations Data Set
Airborne Object Tracking Dataset
Allen Brain Observatory - Visual Coding AWS Public Data Set
Allen Cell Imaging Collections
Allen Ivy Glioblastoma Atlas
Allen Mouse Brain Atlas
Amazon Bin Image Dataset
Amazon-PQA
Analysis Ready Sentinel-1 Backscatter Imagery
Aristo Mini Corpus
ARPA-E PERFORM Forecast data
Atmospheric Models from Météo-France
AWS iGenomes
Basic Local Alignment Sequences Tool (BLAST) Databases
Boreas Autonomous Driving Dataset
Broad Genome References
CAFE60 reanalysis
CAM6 Data Assimilation Research Testbed (DART) Reanalysis: Cloud-Optimized Dataset
CBERS on AWS
Cell Painting Image Collection
Central Weather Bureau OpenData
ChEMBL - Data Lakehouse Ready
Clinical Proteomic Tumor Analysis Consortium 3 (CPTAC-3)
Cloud to Street - Microsoft Flood and Clouds Dataset
CMIP6 GCMs downscaled using WRF
CoMMpass from the Multiple Myeloma Research Foundation
Community Earth System Model v2 Large Ensemble (CESM2 LENS)
Conformational Space of Short Peptides
Copernicus Digital Elevation Model (DEM)
Cornell EAS Data Lake
Coupled Model Intercomparison Project 6
Coupled Model Intercomparison Project Phase 5 (CMIP5) University of Wisconsin-Madison Probabilistic Downscaling Dataset
COVID-19 Data Lake
COVID-19 Genome Sequence Dataset
COVID-19 Harmonized Data
COVID-19 Molecular Structure and Therapeutics Hub
Crowdsourced Bathymetry
Daylight Map Distribution of OpenStreetMap
Department of Energy's Open Energy Data Initiative (OEDI)
Deutsche Börse Public Dataset
DialoGLUE: A Natural Language Understanding Benchmark for Task-Oriented Dialogue
Digital Earth Africa ALOS PALSAR, ALOS-2 PALSAR-2 and JERS-1
Digital Earth Africa GeoMAD
Digital Earth Africa Landsat Collection 2 Level 2
Digital Earth Africa Sentinel-1 Radiometrically Terrain Corrected
Digital Earth Africa Sentinel-2 Level-2A
Discrete Reasoning Over the content of Paragraphs (DROP)
District of Columbia - Classified Point Cloud LiDAR
DOE's Water Power Technology Office's (WPTO) US Wave dataset
~~Earth Observation Data Cubes for Brazil~~
ESA WorldCover
Finnish Meteorological Institute Weather Radar Data
Galaxy Evolution Explorer Satellite (GALEX)
GATK Test Data
~~Genome Ark~~
GeoNet Aotearoa New Zealand Data
Geosnap Data, Center for Geospatial Sciences
Global Database of Events, Language and Tone (GDELT)
Global Seasonal Sentinel-1 Interferometric Coherence and Backscatter Data Set
Hecatomb Databases
High Resolution Downscaled Climate Data for Southeast Alaska
High Resolution Population Density Maps + Demographic Estimates by CIESIN and Meta
High-Order Accurate Direct Numerical Simulation of Flow over a MTU-T161 Low Pressure Turbine Blade
HIRLAM Weather Model
Human Cancer Models Initiative (HCMI) Cancer Model Development Center
IBL Neuropixels Brainwide Map on AWS
Image classification - fast.ai datasets
Image localization - fast.ai datasets
iNaturalist Licensed Observation Images
InRad COVID-19 X-Ray and CT Scans
International Neuroimaging Data-Sharing Initiative (INDI)
IRS 990 Filings
Japanese Tokenizer Dictionaries
K2 Mission Data
Kepler Mission Data
KITTI Vision Benchmark Suite
Legal Entity Identifier (LEI) and Legal Entity Reference Data (LE-RD)
LOFAR ELAIS-N1 cycle 2 observations on AWS
Longitudinal Nutrient Deficiency
Medical Segmentation Decathlon
MODIS
MODIS MYD13A1, MOD13A1, MYD11A1, MOD11A1, MCD43A4
Multi-Scale Ultra High Resolution (MUR) Sea Surface Temperature (SST)
Multiview Extended Video with Activities (MEVA)
MWIS VR Instances
NA-CORDEX - North American component of the Coordinated Regional Downscaling Experiment
Nanopore Reference Human Genome
NapierOne Mixed File Dataset
NASA NEX
Natural Earth
Natural Scenes Dataset
New Jersey Statewide Digital Aerial Imagery Catalog
New Jersey Statewide LiDAR
NEXRAD on AWS
NIH NCBI PMC Article Datasets - Full-Text Biomedical and Life Sciences Journal Articles on AWS
NIH NCBI Sequence Read Archive (SRA) on AWS
NOAA Atmospheric Climate Data Records
NOAA Climate Forecast System (CFS)
NOAA Coastal Lidar Data
NOAA Continuously Operating Reference Stations (CORS) Network (NCN)
NOAA Fundamental Climate Data Records (FCDR)
NOAA Geostationary Operational Environmental Satellites (GOES) 16 & 17
NOAA Global Ensemble Forecast System (GEFS)
NOAA Global Ensemble Forecast System (GEFS) Re-forecast
NOAA Global Extratropical Surge and Tide Operational Forecast System (Global ESTOFS)
NOAA Global Forecast System (GFS)
NOAA Global Historical Climatology Network Daily (GHCN-D)
NOAA Global Hydro Estimator (GHE)
NOAA Global Mosaic of Geostationary Satellite Imagery (GMGSI)
NOAA Global Surface Summary of Day
NOAA High-Resolution Rapid Refresh (HRRR) Model
NOAA Joint Polar Satellite System (JPSS)
NOAA National Bathymetric Source Data
NOAA National Blend of Models (NBM)
NOAA National Water Model CONUS Retrospective Dataset
NOAA National Water Model Short-Range Forecast
NOAA North American Mesoscale Forecast System (NAM)
NOAA Oceanic Climate Data Records
NOAA Operational Forecast System (OFS)
NOAA Rapid Refresh (RAP)
NOAA Rapid Refresh Forecast System (RRFS) Ensemble [Prototype]
NOAA Severe Weather Data Inventory (SWDI)
NOAA U.S. Climate Normals
NOAA Unified Forecast System (UFS) Marine Reanalysis: 1979-2019
NOAA Unified Forecast System Subseasonal to Seasonal Prototypes
NOAA Water-Column Sonar Data Archive
NOAA World Ocean Database (WOD)
NOAA/PMEL Ocean Climate Stations Moorings
Normalized Difference Urban Index (NDUI)
NREL National Solar Radiation Database
NREL Wind Integration National Dataset
Ohio State Cardiac MRI Raw Data (OCMR)
Open City Model (OCM)
Open Observatory of Network Interference (OONI)
Open Targets - Data Lakehouse Ready
OpenAQ
OpenCell on AWS
OpenEEW
OpenNeuro
OpenStreetMap on AWS
OpenSurfaces
Orcasound - bioacoustic data for marine conservation
Oxford Nanopore Technologies Benchmark Datasets
Pacific Ocean Sound Recordings
PoroTomo
Prefeitura Municipal de São Paulo (PMSP) LiDAR Point Cloud
Provision of Web-Scale Parallel Corpora for Official European Languages (ParaCrawl)
PubSeq - Public Sequence Resource
QIIME 2 User Tutorial Datasets
Quoref
Radiant MLHub
Rapid7 FDNS ANY Dataset
RarePlanes
Reasoning Over Paragraph Effects in Situations (ROPES)
REDASA COVID-19 Open Data
Refgenie reference genome assets
Scottish Public Sector LiDAR Dataset
Sea Surface Temperature Daily Analysis: European Space Agency Climate Change Initiative product version 2.1
Sentinel-1
Sentinel-1 SLC dataset for South and Southeast Asia, Taiwan, Korea and Japan
Sentinel-2
Sentinel-3
Sentinel-5P Level 2
SILAM Air Quality
Smithsonian Open Access
SondeHub Radiosonde Telemetry
Sophos/ReversingLabs 20 Million malware detection dataset
Southern California Earthquake Data
SpaceNet
Speedtest by Ookla Global Fixed and Mobile Network Performance Maps
STOIC2021 Training
Storm EVent ImageRy (SEVIR)
Sudachi Language Resources
Tabula Muris
Tabula Muris Senis
Tabula Sapiens
~~Terra Fusion Data Sampler~~
Textbook Question Answering (TQA)
The Cancer Genome Atlas
The Genome Modeling System
The Human Microbiome Project
The Klarna Product-Page Dataset
The Massively Multilingual Image Dataset (MMID)
Therapeutically Applicable Research to Generate Effective Treatments (TARGET)
TIGER Training
Toxicant Exposures and Responses by Genomic and Epigenomic Regulators of Transcription (TaRGET)
Transiting Exoplanet Survey Satellite (TESS)
TSBench
U.S. Census ACS PUMS
UCSC Genome Browser Sequence and Annotations
UK Met Office Global and Regional Weather Forecasts
UniProt
University of British Columbia Sunflower Genome Dataset
Voices Obscured in Complex Environmental Settings (VOiCES)
World Bank - Light Every Night
Xiph.Org Test Media
Yale-CMU-Berkeley (YCB) Object and Model Set
YouTube 8 Million - Data Lakehouse Ready
ZINC Database

Xinan Xu · Answer 3 · Fri Apr 01 2022 13:18:47 GMT+0800 (China Standard Time)

@dkkapur I want to have most of them eligible to Slingshot at once. It's 40PiB of data total, assume 10x replication, that's 400PiB or 0.4 EiB of useful data over 15 EiB of current network capacity. It will be a good story to show and tell. Also,there is not much dataset left for Slingshot. Bringing this list will encourage more people to join slingshot.
@orvn Thanks for spending time to sort and dedup. Those dataset are also duplicates - KITTI, MMID. Hope it's not a great effort to add the filtering on the UI.

Deep Kapur · Answer 4 · Wed May 04 2022 17:35:18 GMT+0800 (China Standard Time)

@xinaxu - I agree that it would be good to scope this in. Proposing that we pull these in (maybe in subsets) for 3.1 with an impending design change in the program (June-ish onwards).

Xinan Xu · Answer 5 · Sat May 07 2022 00:34:46 GMT+0800 (China Standard Time)

@dkkapur Sounds good. Will wait for that and revisit this.

Xinan Xu · Answer 6 · Thu Jul 14 2022 15:15:26 GMT+0800 (China Standard Time)

Closing since those dataset are being used for V3