EleutherAI / pd-books

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

PD-Books

This repo holds the WIP code for matching and analyzing datasets of copyright submissions and renewals based on the excellent work done by The New York Public Library (NYPL).

Tips

The datasets have been converted to HF datasets: registrations, renewals.

The registration data has been parsed from xml files with this script. To replicate clone this repo (or download the xml folder) and call:

python ./parse_xml.py --input_path <path_to_xml_folder> --output_path <output.parquet>

The renewals dataset, available in tab-delimited format, has been aggregated and uploaded as well.

Some exploratory analysis and preliminary results here.

The main matching criteria for out purposes are the registration date and registration number which are provided in both datasets. The registration number alone is not a unique identifier across the datasets.

About

License:Apache License 2.0


Languages

Language:Jupyter Notebook 98.1%Language:Python 1.9%