mab2400 / Redactions

A Python-based project that utilizes OpenCV to detect and analyze redactions in declassified President's Daily Briefs and Central Intelligence Bulletins from the mid-20th Century.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Redactions Project

A Python-based project that utilizes OpenCV to detect and analyze redactions in declassified President's Daily Briefs and Central Intelligence Bulletins from the mid-20th Century.

Analyzing Redactions

Getting Started

Prepare two directories:

  • A directory containing only pdf files of the documents (the "from" directory)
  • An empty directory (the "to" directory)

As the script analyzes each document, it will move the files from the "from" directory to the "to" directory.

Install the necessary packages:

  • Run ./install.sh

Running the Script

To find the redactions in a batch of documents and generate a CSV file containing the data:

python3 stats.py <doc_type> batch <from_dir> <to_directory>
  • doc_type indicates the document type of the files you inputted. For example, if your files are President's Daily Briefs, replace doc_type with pdb. If your files are Central Intelligence Bulletins, replace doc_type with cib.
  • You should expect to see a table being generated on the screen. Note that the raw data will be stored in <from_dir>/output.csv for future use if necessary.

To generate graphs and analyze the data in the CSV file:

python3 stats.py <doc_type> analyze <output_filepath.csv>

You should expect graphs to appear in a pop-up window. See below for an example.

Analyzing and Displaying a Single Page

Analyzing the redactions on a single page of a document file and display an image with the identified redactions.

Running the Script

python3 redactions_show.py <doc_type> <png_filepath>
  • jpg_filepath is the filepath for a single page of a document. It must be a PNG file.
  • Running this script will calculate the number of redactions, the percent of text on the page that was redacted, as well as the estimated number of words that were redacted. It will also open a pop-up window containing an image that identifies the locations of the redactions on the page.
  • Once the pop-up window opens, press any key, which automatically takes a screenshot so you can save the analyzed image for future reference.

Results

NER Top 15 Entities - Redactions per Mention

China, Moscow, USSR more heavily redacted. Hanoi less redacted.

Weighted Redactions per Year

How Does It Work?

Our script makes use of OpenCV’s built-in contour and shape detection features to identify the white redaction boxes on each page of a PDB document.

By calculating the area of the redaction in comparison to text on the page, we can estimate the number of words that were redacted as well as what percent of text on the page was redacted.

About

A Python-based project that utilizes OpenCV to detect and analyze redactions in declassified President's Daily Briefs and Central Intelligence Bulletins from the mid-20th Century.


Languages

Language:Python 97.6%Language:Shell 2.4%