jacksongoode / nuclear-collections

An exploration of the NCR ADAMS data scraped by the Internet Archive for Earthathon 2023

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

ADAMS NRC enrichment for Internet Archive

Description

This repo contains code for:

  • Downloading a mapping between document titles and accession IDs and saving that data fetching.ipynb.
  • Basic topic modeling of the documents based on those titles modeling.ipynb.

A sample dataset is also generated and available in this repo at docs/2023-01-01_2023-04-30_docs.json.

Processing Steps

Major Obstacles

  • Data scraped by web archive does not contain any identifying information for organizing and searching
  • Source data at https://adams.nrc.gov/wba/ can be queried using SQL to obtain a mapping between document titles, accession IDs, and other relevant information, however queries are limited to 1000 items.
  • Many objects returned by the SQL queries are containers for data, rather than the data itself

Current Solution (Python)

  • Generate template SQL query to send via requests to source website, filtering specifically for .pdf files
  • Using a sliding window on when the documents were added, compile a list of all documents added during a specific timeframe
  • Save as a .json file
  • Extract document titles for preliminary analysis

To Do

  • Generate data for desired dates
  • Filter through relevant documents
  • Apply and join mapping on top of existing archives

About

An exploration of the NCR ADAMS data scraped by the Internet Archive for Earthathon 2023


Languages

Language:Jupyter Notebook 100.0%