jacksongoode / nuclear-collections

An exploration of the NCR ADAMS data scraped by the Internet Archive for Earthathon 2023

ADAMS NRC enrichment for Internet Archive

Description

This repo contains code for:

Downloading a mapping between document titles and accession IDs and saving that data fetching.ipynb.
Basic topic modeling of the documents based on those titles modeling.ipynb.

A sample dataset is also generated and available in this repo at docs/2023-01-01_2023-04-30_docs.json.

Processing Steps

Major Obstacles

Data scraped by web archive does not contain any identifying information for organizing and searching
Source data at https://adams.nrc.gov/wba/ can be queried using SQL to obtain a mapping between document titles, accession IDs, and other relevant information, however queries are limited to 1000 items.
Many objects returned by the SQL queries are containers for data, rather than the data itself

Current Solution (Python)

Generate template SQL query to send via requests to source website, filtering specifically for .pdf files
Using a sliding window on when the documents were added, compile a list of all documents added during a specific timeframe
Save as a .json file
Extract document titles for preliminary analysis

To Do

Generate data for desired dates
Filter through relevant documents
Apply and join mapping on top of existing archives

About

An exploration of the NCR ADAMS data scraped by the Internet Archive for Earthathon 2023

Languages

Language:Jupyter Notebook 100.0%