ihmc / congressional_hearings

Converting Congressional Hearing .htm files to .csv

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

congressional_hearings

This is an attempt to recreate the csv files for Congressional Hearings for the NLP Unshared Task in PoliInformatics 2014. In addition to the "Speaker" and "Speech" columns that existed in the original files, there are now "Type" and "Comments" columns. The type is one of

  • Speech
  • Statement

And if a record is a Statement, then the Comments column will include the text of the Statement header.

For example

"Speaker" "Speech" "Type" "Comment"
"Chairman Greenspan", "Thank you very much, Mr. Chairman and Members of the Committee. I am, as always, pleased to be here today to present----", "Statement", "STATEMENT OF ALAN GREENSPAN, CHAIRMAN BOARD OF GOVERNORS OF THE FEDERAL RESERVE SYSTEM"

The new files are colocated with the original files in the data directory with a .htm.csv file extension.

To regenerate the files, run

rake poliinformatics:convert

To test the code on your system, run

rspec spec

About

Converting Congressional Hearing .htm files to .csv


Languages

Language:Ruby 100.0%