ftrotter / mirrulations_regulation_data

A repository with instructions for accessing data from the mirrulations project.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Mirrulations Regulation Data

A repository with instructions for accessing data from the mirrulations project.

S3 Access

The Mirrulations project publishes its results at the following S3 Bucket.

s3://mirrulations

Mirrulations uses

How to download

The simplest way to download the data, is to use rclone. We have written some helper scripts that will help automate the rclone commands.

Using the script it is simple to download data by year, by Federal agency, and to decide to download the raw text only, or to download both the raw text and the originals (generally pdfs and word documents).

Data License

Regulations data is public domain under the edicts of government principle.

With this in mind, we are formally labeling the data as Public Domain, by using the Creative Commons Public Domain mark:

Public Domain Mark
This work (Mirrulations Regulation Data, by Participants in pulic regulatory process), identified by Mirrulations Project, is free of known copyright restrictions.

We have also added a text version of this assertion in the data_LICENSE.txt file.

At one time, regulations.gov requested that data downloaded from the service include the following warning:

Regulations.gov and the Federal government cannot verify and are not responsible for the accuracy or authenticity of the data or analyses derived from the data after the data has been retrieved from Regulations.gov.

In other words, "once the data has been downloaded from Regulations.gov, the U.S. Government cannot verify and is not responsible for the quality, accuracy, reliability, or timeliness of any analyses conducted using the downloaded data."

Indeed mirrulations uses the Regulations.gov Data API but is neither endorsed nor certified by Regulations.gov.

Warnings

There are unique risks associated with this public dataset, please read these warnings carefully. There is a reason that the warnings come before the instructions in this ReadMe.

Data Reliability Warning

The public domain status of this information means that you are free to use this data in any way that you would like. However, please note that much of our data is the results of the use of the Mirrulations Software, which, amoung other things, converts PDF to Text in various ways. This process can have bugs and as a result, it is possible that the text generated by these processes is incorrect, incomplete or otherwise broken. The MIT license of the Mirrulations makes it clear that we are fully disclaiming the liability for the use of the program, which means using the data is at your own risk. There are no warranties from us that the data is correct.

Privacy Warning

Further, many people likely did not understand that by contribiting to the regulatory comment process, they contributed their data to the public sphere. Many commenters appear to be assuming that only government regulators would be able to see and a access their data. This means that it is likely that there is a substantial amount of data inside these comments that people consider private, despite the fact that they have actually made it public information. Please be considerate to this risk as much as possible and do not use this data in a manner which takes advantage of this misunderstanding. In a similar fashion, it is possible that people contributed data without fully understanding how this data could be used against them. In the era of Facebook and other companies that monetize private information, this might seem like a quaint idea, but as much as possible, please do not use this data to harm people.

We cannot make you follow these rules, and anyone can go directly to regulations.gov and download this data directly, but we emplor you to use your best judgement when using this data.

S3 download costs

The authors of Mirrulations cannot afford to subsidize the download of regulation data. Therefore, Mirrulations is published using the Requestor Pays Bucket feature of Amazon AWS S3 product.

If you download the S3 bucket in its entiretly this will result in an Amazon AWS bill of several hundreds of dollars. Even downloading just the text portion can be expensive. To save costs, please download only the portions of the text corpus that you need.

Downloading the data

In order to download the data, you will need to get your own AWS keys, and then follow the directions for downloading requestor pays S3 buckets.

Most Amazon S3 clients do not support the requestor pays feature. However, rclone does support this. We intend to support rclone for downloading purposes.

Understanding the Mirrulations folder structure

The Mirrulations project documents the directory structure used in the S3 bucket.

Generally, the structure looks like this:

data
└── <agency> (like DEA, CMS, etc)
    └── <docket id> (like DEA-2016-0015) 
        ├── binary-<docket id> (like 'binary-DEA-2016-0015')
        └── text-<docket id> (like 'text-DEA-2016-0015')

Generally, the "binary" folder contains a mirror of the pdfs and word documents that people submit for comments to Federal regulations. More rarely, these directories can contain jpgs, png and other image files that are submitted as comments.

You are free to download these binary files, but the whole point of the Mirrulations project is to make the text contained in these pdf's available as raw text. So if you look under the 'text' directory, you will find the text-extracted versions of these resources.

Mirrulations, by default, uses the pikepdf tool to conduct text extraction on the various documents. In the future, for PDFs that pikepdf does not cleanly extract, other extract tools, including OCR tools, will likely be used. This is the reason that under the OCR directories there is a "" subdirectory, so that you can know what tool did the conversion between the binary file and the text file.

Under the text- directory, the following directories exist:

  • comments - for JSON files for comments submitted as raw text through regulations.gov
  • comments_extracted_text - for the text results of the OCR process for pdfs/word/other files submitted as comments
  • docket - for JSON files regarding the docket itself
  • documents - For JSON files representing the documents that the government published inside the docket.
  • documents_extracted_text - for the text files that are the results of the OCR on the documents.

here is a thousand words on the topic:

mirrulations_extract_folder_diagram.png

using rclone to download portions of the data

Downloading only a specific docket

This command will download both the binary files and the text files associated with docket id DEA-2016-0015

rclone --s3-requester-pays copy s3:mirrulations/DEA/DEA-2016-0015/ /path/to/your/local/mirrulations/directory/DEA-2016-0015

Downloading only the text corpus

rclone has advanced filtering capacity to download only portions of the data. Using that, you can use the following command to mirror all of the text contained in mirrulations (NOTE: this can be expensive!!):

rclone --s3-requester-pays copy s3:mirrulations /path/to/your/local/mirrulations/directory/ --include "*.txt" --include "*.json" --include "*.htm"

Download any agency

To download every DEA regulation:

rclone --s3-requester-pays copy s3:mirrulations /path/to/your/local/mirrulations/directory/ --include "/DEA/**/*.txt" --include "/DEA/**/*.json" --include "/DEA/**/*.htm"

Replace 'DEA' with your agency of interest to get more information

Download any year

The docket number of a regulation generally contains the year that the regulation was published. Thus, to download every regulation from 2022 we write:

rclone --s3-requester-pays copy s3:mirrulations /path/to/your/local/mirrulations/directory/ --include "/*/*2022*/**/*.txt" --include "/*/*2022*/**/*.json" --include "/*/*2022*/**/*.htm"

About

A repository with instructions for accessing data from the mirrulations project.

License:Other


Languages

Language:Python 100.0%