EleutherAI / the-pile

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Congressional Records

thoppe opened this issue · comments

URL: https://www.govinfo.gov/help/crecb#about

Size: Estimated 6-8 GB text uncompressed

The Congressional Record is the official record of the proceedings and debates of the United States Congress. It is published daily when Congress is in session. The Congressional Record began publication in 1873 and is still published today.

At the end of each session of Congress, all of the daily editions are collected, re-paginated, and re-indexed into a permanent, bound edition. This permanent edition, referred to as the Congressional Record (Bound Edition), is made up of one volume per session of Congress, with each volume published in multiple parts, each part containing approximately 10 to 20 days of Congressional proceedings. The primary ways in which the bound edition differs from the daily edition are continuous pagination; somewhat edited, revised, and rearranged text; and the dropping of the prefixes H, S, and E before page numbers.

What is available?

Volumes 144 (1998) and prior are made available as digitized versions of the Congressional Record (Bound Edition) created as a result of a partnership between GPO and the Library of Congress. These volumes include all parts of the official printed edition.

There is an API to access the records that seems straight forward, once you get past the idea of collections and packages:

https://api.govinfo.gov/docs/

The data are all in PDF, so it would require some parsing but it looks like the documents are already OCR'd.

Example: https://www.govinfo.gov/content/pkg/CRECB-2001-pt1/pdf/CRECB-2001-pt1.pdf

Preliminary experiments with pdfbox show good extraction. Example:

Mr. BYRD. Yes, exactly, one of which
happens to appear to target a facility
for a district represented by a Member
of the House from Texas. We do not
know what that facility is, but it has
been slipped into this measure.
Mr. SARBANES. I say to the distin-
guished Senator, I was not even aware
of that one. That one has not yet risen
to the level of being covered in these
newspaper stories.
Mr. BYRD. I think that is where I got
a glimmer of it, somewhere in a news-
paper story.

They are in column format, so there will be a lot of words broken by hyphens "distin- guished". I don't think that will be a problem though for the LM.

Additionally, the Congressional Records can be pulled from 1998 forward, but these are already digitized and are on a different API access endpoint.

American congressmen are racist pricks so we aren't going to use this.