stompsjo / ETI.data_manual

This is a manual for data access and use within ETI Thrust Area 1.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

ETI Thrust Area 1: Data Access and Collaboration


ETI logo

Introduction

The purpose of this manual is to provide members, particularly those from university institutions, of the Consortium for Enabling Technologies and Innovation (ETI) details for access to NA-22 relevant data from national laboratories. Data that has been gathered with potential interest to ETI collaborators is introduced, and instructions for accessing that data are provided. Hopefully this will serve as a catalyst for university members, like graduate students and their PIs, to partner with ventures at national labs that can benefit from academic research.

This manual has been categorized in two ways. For those interested in specific ventures or projects, the table of contents below directs to each respective project page. In some instances, a user might be interested in a specific form of data (imaging, audio, E&M readings, etc.) but might not be sure which venture or project best fits those needs. To facilitate this search, a table has been organized below that attempts to match certain forms of data with each project. Note that this table is not exhaustive. In the event that a user finds a data type that is not listed but should be, please create an issue or pull request on the manual's GitHub page. To explore something in the table, each cell forwards to information on the venture or project.

Contribution to this document is encouraged as data streams are identified. The process of adding content is detailed in the contributing page.

Table of Contents

  1. Data Streams
  2. Misc. Data Sources
  3. Data Modes
  4. Data Tools

Data Streams

audio biota EM imaging infrasound radiation seismo-acoustic video hyperspectral
MINOS x x x x x x
WAGGLE x x x x x
MUSE x
Topcoder x
GRDC/BDC x x x x
VAST
FMotW x
xView x
SpaceNet x
COWC x

MINOS

Multi-Informatics for Nuclear Operations Scenarios is an NA-22 venture that collects several data modalities for use in nuclear nonproliferation. These data streams are centralized for use by the MINOS team and collaborators. Please see the MINOS page in this manual for more information.

Waggle

This is an Argonne National Laboratory project that uses a system of nodes dispersed throughout an urban area to collect various data streams and develop methods for edge computing and threat detection.

Modeling Urban Scenarios and Experiments

MUSE is an ORNL nuclear dataset designed to help in nuclear nonproliferation research aimed at detecting and assessing threats in an urban environment. DOI: 10.13139/ORNLNCCS/1597414

Topcoder Data Science Competition

This dataset was used for a topcoder data science competition in association with several national laboratories. The aim in using this dataset was to develop algorithms that identify and characterize nuclear threats in urban areas. The datasets and an explanation of the competition can be found here.

Gamma-ray Data Cloud /Berkeley nuclear Data Cloud

GRDC and BDC are LBNL-hosted websites developed to enable access to radiological data and associated contextual data.
GRDC primarily hosts the RadMAP dataset. RadMAP was a truck-borne detector system carrying NaI, HPGe and Liquid Scintillator radiation detectors, LiDAR, 4pi cameras, hyperspectral camera, a weather station, and a GPS/IMU. RadMAP Publication here Some Helicopter-borne NaI data is also available. Request GRDC access here.

BDC was developed to improve upon the GRDC data management model, with a goal of facilitating users to be able to add their own data.
BDC is will host similar datasets to RadMAP (and possibly the RadMAP dataset one day), but for now, it continues to be subject to active developement and the first hosted dataset is the Northern Virginia Array (NoVArray), which is a set of 18 months of (up to 18) NaI detectors positioned alongside roadways in the Northern Virginia suburbs of the Washington, DC area. Users can register for [bdc.lbl.gov] at the website, but manual approval by an administrator is required. NoVArray ArXiv

Both datasets are available to researchers who register on the respective websites.

VAST Challenge 2020

The VAST Challenge is an annual competition utilizing data visualization and analytics. While the aim of the competition may be driven by data visualization, the datasets provided can be scientifically valuable as an alternative open data source.

Functional Map of the World

fMoW was an IARPA challenge to develop classification algorithms for imagery data. The data is still available in TIFF and JPEG formats here. The challenge website provides some context on goals and additional resources for using imagery data. A paper describing the dataset in detail can be found on arXiv.

xView Detection Challenge

This is a publicly available dataset of satellite imagery provided by the Defense Innovation Unit Experimental (DIUx) and the National Geospatial-Intelligence Agency (NGA). XView builds on the work of other imagery challenges in developing classification and detection algorithms. A pre-trained model is already provided using TensorFlow and PyTorch.

SpaceNet

SpaceNet is a commercial satellite imagery dataset with existing labels for developing machine learning classification algorithms. The dataset is publicly available on AWS.

Cars Overhead with Context

COWC is a training dataset with value to machine learning and deep neural networks for classification and detection of cars in overhead imagery. A paper describing the dataset can be found here.


Miscellaneous Data Sources

Several other organizations maintain databases of varying public availability. The Incorporated Research Institutions for Seismology (IRIS) provides several raw datasets in different formats (time series, event, etc.). The U.S. Energy Information Administration (EIA) compiles significant amounts of data on economics and energy generation both nationally and internationally. OpenEI collects data pertaining to different energy generation methods. They also maintain a Geothermal Data Repository (GDR) with data collected across the United States. Finally, data.gov provides environmental datasets divided by the level of constituency (city, county, state, federal, etc.).


Data Modes

This section briefly defines the data modalities described in the data table:

  • Audio - Recorded audio data collected at a detector location. This could be collected as a physical data source or as recordings providing clarification and context to other data streams.

  • Biota - Ecological data describing experimental or detection locations and the wildlife in the surrounding area.

  • EM - Electromagnetic observations.

  • Imaging - Visual imaging data (i.e. static pictures). These can be satellite imagery or ground-level pictures from detectors and ground cameras.

  • Infrasound - Low-frequency sound recordings not normally audible to humans.

  • Radiation - Data collected from detectors that measure different types of radiation.

  • Seismo-acoustic - Low-frequency recordings traditionally from geophysical sources. One detector example would be a seismograph.

  • Video - Recorded video imaging from a detector.

  • Hyperspectral - Visual and near infrared light imagery, measured in many channels, as opposed to the three-channel RGB visual.


Data Tools

Several advanced computing software packages have been developed that may be useful to ETI research efforts. These machine learning packages are written for use with Python 3:

  1. "Shadow is a PyTorch based library for semi-supervised machine learning." It contains several training algorithms and can be installed via pip. Online documentation includes several examples of using the package and API information.

  2. "MIMOSAS (Multimodal Input Model Output Security Analysis Suite) is a supervised machine learning pipeline developed for classification of multimodal data to inform nuclear security and proliferation detection scenarios." It has a modular framework with the ability to pre-process data as well as train and test models. MIMOSAS is compatible with MINOS data and can be installed from source. Additional information can be found here.

  3. "RADAI (Radiological Anomaly Detection And Identification) is a suite of python tools for handling and manipulating gamma-ray data, implementing spectral analysis algorithms, and (eventually) manipulating a RADAI benchmark dataset that will be building upon the Urban Radiological Search Competition (aka TopCoder above). The package includes implementations of several benchmark algorithms for anomaly detection and coupled detection/identification.

  4. "Becquerel is a Python package for analyzing nuclear spectroscopic measurements." The core functionalities are reading and writing different spectrum file types, fitting spectral features, performing detector calibrations, and interpreting measurement results. It includes tools for plotting radiation spectra as well as convenient access to tabulated nuclear data, and it will include fits of different spectral features. It is intended to be general-purpose enough that it can be useful to anyone from an undergraduate taking a laboratory course to the advanced researcher.

About

This is a manual for data access and use within ETI Thrust Area 1.

License:Creative Commons Attribution 4.0 International