Clemson-Digital-History / TextAnalysis_StLouisFair

This is the repository for a digital history project that utilizes methodologies of distant reading, NER, and WEM to assess how local St. Louis newspapers engaged in generating cultural and discursive representations of the geopolitical entities at the 1904 World's Fair, as well as a view of the world centered around the American colonial empire.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

An Imagined Geography of Empire: Mining cultural representations of the American colonial state at the 1904 World’s Fair

This digital history project utilizes methodologies of distant reading and textual data mining to assess how local newspapers in St. Louis produced and promoted their own discursive representations of the world in response to the ideological messages embedded on the grounds of the world's fair in 1904. An Imagined Geography of Empire understands the 1904 Louisiana Purchase Exposition as a complex microcosm of early-twentieth-century modernity embedded with ritualistic competition, contradictions, and tense power relations. The article pushes for closer scholarly attention to how newspapers engaged with and interpreted the language of empire and American colonialism at the fair. Newspapers relied on the ways in which multiple audiences perceived and engaged with the fair exhibits in order to write their stories and produce complex representations of participating cultures and the modernizing world. By attending to the cultural commentary about the fair through the use of digital methodologies, the project argues that, in response to the power relations and discursive negotiations embedded on the fairgrounds, newspapers contributed to an "imagined geography" of the modernizing world centered around the United States as an emerging, exceptional colonial power at the turn of the century. (Said, 1979; Lefebvre, 1991; Anderson, 1983; Blevins, 2014). They did so, first, by talking about the Philippine exhibit as a center piece of the exposition, and by printing place-names of the United States and the Philippines more often than every other geopolitical entity participating at the fair. Second, by characterizing Filipino people as a nation under American tutelage and guidance towards civilization.

In its root, the repository contains the final article.docx (with its accompanying .Rmd version), the data-ethics-statement.html (and its accompanying .md file), the README.md file, and the R project file. Note that the article.docx has been further revised and therefore the article.Rmd file may be an obsolete version. It is still included in the repository for reproducibility purposes. The repository contains four sub-directories: "data", "txt_files", "word_embedding", and "code".

data

  • metadata.csv -> It contains the metadata for each newspaper article collected from Newspapers.com. Variables include: "doc_id", "file_format", "first_page_indicator", "title", "day", "month", "year", "newspaper_id", "article_date", "multipage_article", "page_id", "multipage_id", and "article_id".
  • raw_data.csv -> It contains the same variables as metadata.csv, but with an added variable containing the full OCR'ed text of each JPG file. This text is not clean and it still contains stop words.
  • text_data.csv -> It contains the same variables as raw_data.csv, but the "text"" column is clean. It still contains stop words.
  • tokenized_data.csv -> It contains the same variables as text_data.csv, except the "text" column is replaced by the "word" column after tokenization. This data frame does not contain stop words.
  • dataforgeocoding.csv -> It contains the widen data frame with placenames extracted from text_data.csv using spacy_parse{spacyR}. The widen frame includes the "place_name" and "count" columns, as well as "city", "state", "country", "native_group", and "continent" as variables, which were merged into a "geo_address" column. This new variable was then used for geocoding.
  • google_geodata.csv -> It contains the same variables as dataforgeocoding.csv with the addition of the geocoded "latitude" and "longitude" variables retrieved using Google's API key.
  • geocoded_placenames.csv -> It contains "rowid", "place_name", "count", "geo_address", "latitude", "longitude", and "scale". This is the final version of the geocoded placenames data frame and it allows for plotting subsets of the data based on scale (i.e., city, state, country, continent, geo_region, native_group).

txt_files

  • This directory contains the original 461 text files resulting from the OCR process. This is the raw text data fed into RStudio for data prep and processing prior to the analysis.

word_embedding

  • This directory contains the fulltext_bngrams.txt and the fulltext_bngrams.bin files used to train and build up the word embedding model used in the word-vector-analysis.R script.

code

  • data-preprocessing.R --> This script prepares the original JPG files extracted from the online database Newspapers.com for text analysis. The JPG files were OCR'ed using tesseract and stored as plain text files in a separate directory named txt_files. Using readtext, the text files were used to generate a data frame containing the document ID and the complete text of each file in each row. This dataframe was then joined with the metadata.csv table, resulting in a raw version of the data (prior to cleaning and stopwords removal) stored as raw_data.csv. After cleaning and removing punctuation, special characters, and upper-case letters, as well as addressing some problems of OCR errors, the data was stored as text_data.csv. Note that during this step, stopwords are NOT removed yet. Lastly, the script tokenizes the data using tidytext for data exploratory analysis (EDA). The tokenization process includes removing stop words using a customized stop list. The output was stored as tokenized_data.csv, which was then used in the following step (EDA).

  • exploratory-analyses.R --> This script uses the tokenized_data.csv to run a few different types of basic analyses. First, it plots the term frequency of the terms "savage" and "native". Second, it uses a sentiment vocabulary ("Affinn" lexicon) to explore the frequency of "positive" and "negative" terms throughout the corpus and across the months of the fair. Note that sentiment analysis methodologies are highly problematic for historical analysis. They are applied here cautiously and merely as an added form of exploring the data to inspire new research questions.

  • named-entity-extraction.R --> This script applies the spacy_parse{spacyR} function on the text_data.csv to extract geopolitical entities from the corpus through tokenization. When applicable, the function assigns each token a particular type of entity classification. The two classifications that mattered for the sake of this project were "NORP" (Nationalities or religious or political groups) and "GPE" (Countries, cities, states). Note that this step required close reading to understand the limitations of the automated entity recognition, and manual interventions to the algorithm in order to retrieve important locations and placenames that were not automatically recognized. It also accounts for OCR errors that made some placenames unrecognizable by the algorithm, although the author recognizes that many OCR errors persisted and that a cleaner version of the text data could present variations in the analytic results. A significant portion of this script is devoted to annotating, data cleaning, and mitigating issues of OCR errors. The last part of the script prepares the data frame with extracted entities for geocoding. Three versions of the data are stored in this step: 1) dataforgeocoding.csv stores the data frame with placenames and a full-address column that can be used for geocoding; 2) google_geodata.csv stores the output of the automated geocoding process using Google's API key; 3) geocoded_placenames.csv stores the final version of the data after removing temporary variables and including the "scale" variable (city, state, country, etc) that allows for plotting subsets of the data based on scale.

  • word-vector-analysis.R --> This script relies on @bmschmidt 's package wordVectors and his example on historical cookbooks for exploring word embedding models in historical analysis. The script creates a fullcorpus_bngrams.txt file and stores it in a separate directory named word_embedding. It then uses a temporary .bin file to train a model on the full corpus file. After exploring the words that are closest semantically to the words "progress" and "savage", it plots the terms on two dimensional visualizations based on their semantic similarity scores. Lastly, the script creates a cluster of words across the entire corpus in a way that mimics topic modeling to some extent and looks for patterns in the 10 first clusters. It then creates a subset cluster based on the terms "progress" and "savage" only, plotting the output on a dendrogram.

The project is currently a work in progress for the HIST8550 Seminar in Digital History at Clemson University under supervision of Dr. Amanda Regan.

About

This is the repository for a digital history project that utilizes methodologies of distant reading, NER, and WEM to assess how local St. Louis newspapers engaged in generating cultural and discursive representations of the geopolitical entities at the 1904 World's Fair, as well as a view of the world centered around the American colonial empire.


Languages

Language:HTML 94.1%Language:R 5.9%