The GeoNewsMiner (GNM): An interactive spatial humanities tool to visualize geographical references in historical newspapers

The GeoNewsMiner (GNM) is an interactive tool that maps and visualises geographical references in historical newspapers. As a use case, we used Italian immigrant newspapers published in the United States from 1898 to 1920, as collected in the corpus ChroniclItaly (Viola 2018). The corpus was previously tagged for entities using a sequence tagging tool (Riedl and Padó 2018). This tagged version of ChroniclItaly is ChroniclItaly 2.0 and it is available as an open access resource. The overarching goal of this project is to offer new perspectives on the geographies of the past. Methodologically, we extracted from the corpus the geographical references, quantified them, geocoded them, and plotted them on a map. As a visualization tool, we developed the GNM App using a Shiny app. This allowed us to present and analyse the results in an intuitive, interactive, and reproduceable way. This documentation is the step-by-step description of the project.

GNM is available as an open access app. Click here to use the GNM App.

Spatial Humanities
ChroniclItaly
Sequence Tagging
Geocoding with Google API
Installation
Shiny app
Remarks
License
Links
References
The team
How to cite GNM

Spatial Humanities

GNM has been created as an exploratory tool to assist researchers working in spatial humanities, a recently emerged interdisciplinary field -especially within the digital humanities -focused on geographic and conceptual space, particularly from a historical perspective. Essentially, spatial humanities are based on Geographic Information Systems (GIS). GIS are used to georeference a dataset, map it, display it, and analyse it by cross-referencing different layers of information. Such process of cross-referencing has provided researchers working in fields such as environmental history, historical demography, and economic, urban and medieval history with new perspectives leading them to challenge traditional positions and even explore new questions. Drawing from the application of GIS and data mining, GNM aims to illustrate how the visualization of ethnic newspapers' place names can provide researchers with new perspectives on the geographies of the past so as to open up new ways to study migration history.

ChroniclItaly

GNM is based on ChroniclItaly (Viola 2018), a corpus of Italian language newspapers published in the United States between 1898 and 1920. The corpus includes the digitized pages of seven Italian language newspapers published in California, Massachusetts, Pennsylvania, Vermont, and West Virginia. The collection includes the following titles: L’Italia, Cronaca sovversiva, La libera parola, The patriot, La ragione, La rassegna, and La sentinella del West Virginia. ChroniclItaly, which totals 4,810 issues and contains 16,624,571 words, gathers the front pages of each issue and it was collected from Chronicling America, an Internet-based, searchable database of U.S. newspapers published in the United States from 1789 to 1963 made available by the Library of Congress. ChroniclItaly is available as an open access resource.

Sequence Tagging

ChroniclItaly was first tagged for entities using a sequence tagging tool (Riedl and Padó 2018) that implements Tensorflow. The Italian language model of the sequence tagging tool was trained on I-CAB (Italian Content Annotation Bank), an open access corpus annotated for entities (i.e. persons, organizations, locations, and geopolitical entities), temporal expressions, and relations between entities. I-CAB contains 525 news articles taken from the newspaper L'Adige and totals up around 180,000 words. Once the training was complete, the output had the following format:

il      il      KNOWN   O       O
principio      principio      KNOWN   O       O
delle   delle   KNOWN   O       O
ostilità       ostilità       KNOWN   O       O
fra     fra     KNOWN   O       O
la      la      KNOWN   O       O
Spagna  spagna  KNOWN   O       B-GPE
e       e       KNOWN   O       O
gli     gli     KNOWN   O       O
Stati   stati   KNOWN   O       B-GPE
Uniti   uniti   KNOWN   O       I-GPE
.       .       KNOWN   O       O

The first column is the input word, the second column specifies the pre-processed, lowercased word, the third column contains a flag, that is whether the word has been known during training (KNOWN) or not (UNKNOWN). If labels are assigned to the input file, these will appear in the third column. The last column contains the predicted tags. The no-entity tag is O. Because some entities (e.g., Stati Uniti "United States") have multiple words, the tagging scheme distinguishes between the beginning (tag B-...) or the inside of an entity (tag I-...). The tags are:

LOC -> Location
GPE -> Geo-political entity
PER -> Person
ORG -> Organization

The sequence tagging retrieved 1369 unique locations which had been referred to 214110 times throughout the whole corpus. Because each individual document is time stamped, the number of references to each location was quantified at any given time within the timeframe of ChroniclItaly (1898-1920). Afterwards, locations were geocoded by using the Google API.

Geocoding with Google API

In order to perform geocoding, users must have an API key. The API key is a unique identifier, a 40-character long string that is used to authenticate requests associated with a project. To get an API key, visit the Google Cloud Platform Console.

Geocoding with Google is a two-stage process which requires Google Geocoding API. First, Google Geocoding API provides users with a Place ID for each location. The Place ID uniquely identifies a place as it is stored in the Google Places database and on Google Maps. Because the language of the dataset is Italian, the language of the API was set to Italian. It was found that setting the API language as the language of the data-set improves the accuracy of the geocoding results. At the same time, however, the results are also returned in Italian. In order to have the results returned in Enlgish, only the Place ID was extracted in this first stage. Once the Place ID was received, it was possible to use the Google Geocoding API to perform reverse geocoding, that is to obtain all the details for the location in English (e.g., geo-coordinates, administrative level). In this way, the results are dispalyed in English thus making the GNM App more user-friendly.

The geocoding process was performed only for locations that occurred 8 times or more in the entire corpus. This methodological decision was made to overcome two limitations: 1) noise in the original data mainly due to OCR mistakes and 2) a relatively high number of false positives found by the sequence tagging tool <8 'locations'. Please refer to /REMARKS.md for further details on the accuracy of the sequence tagger for Italian models.

Understanding the Google Geocoding results

In addition to the geo-coordinates, Google also provides additional details, such as the tag type[] which indicates why Google has provided certain geo-coordinates to a location. Some of these geo-coordinates may have been "misinterpreted", therefore understanding the type[] of a location is important. "Misinterpretations" may happen especially when working with historical data. Because the Google Places database stores places based on a contemporary world map, the locations in a historical dataset may have changed name or may no longer exist. Moreover, categories such as country, city, region, municipality, etc. which Google uses to determine the location type[] are highly dependent on the location itself and consequently, certain categories may not apply or they may change from place to place.

For example, within the level country, Google distinguishes between five sub-levels: administrative_area_level_1, administrative_area_level_2, administrative_area_level_3, and so on. Generally speaking, the levels proceed from bigger (level 1) to smaller (level 5). The type locality corresponds to a city or town. In the United States, for instance, level_1 would correspond to a State, level_2 to a county and so on. However, not every country would have States, counties, etc. In geocoding the places in ChroniclItaly 2.0, we encountered cases in which the level given by Google required a manual edit. Specifically, three main types of manual edits needed to be performed:

- Changing the level to a higher level:

This was the case of cities like Naples which Google tagged as administrative_area_level_2 (municipality) instead of locality;

- Aggregating locations according to a higher level:

This was the case of places like S. Pietro -the square in Vatican City -or New York Harbor. In these cases, we manually changed the type (establishment) to match that of the corresponding country or city (locality);

- Changing a wrong type[]:

This was the case of places that no longer exist or that have changed name. An example of this kind is the city of Zara in Croatia which today is called Zadar and which was a province of the Kingdom of Italy from 1918 to 1947. However, Google tagged Zara as establishment and gave it the geo-coordinates of a Zara department store in the United States.

All the locations that have been manually edited are marked in red in the file data/output_20191622_041444_geocoding_edit.xlsx. This file can be compared with the file output/output_20191622_041444_geocoding.xlsx to see the original output. For more information on manually editing the locations, please refer to the section Fixing geocoding errors in the notebook. For an overview of all the possible types[] tags, please refer to the the Google Maps Platform documentation.

Installation

Please install Python 3.6 or higher to run the notebook. The notebook makes use of the dependencies found in requirements.txt. Install the dependencies with pip by running the following line of code in your terminal.

pip install -r requirements.txt

Shiny app

To explore the data, a Shiny app was developed. A live version of the GNM app is available open access. The GNM app is an interactive tool which allows users to explore references to place names (e.g., city, region, and/or country) by selecting a wide range of filters, as follows:

Time: from 1898 to 1920;
Newspaper’s title
Type of frequency visualization: absolute, percentage
Geographical information: include/exclude references to cities/regions
Percentile: select the least/most mentioned places
Choose the map: 1880, 1914, 1920, 1994.

Users can also share their results, download the raw data which may reflect their filters’ selection, and download the visualised map as a .png file.

The Shiny app requires R. Please install the dependencies found in gnm_shiny_app/global.R. The source code of the app can be found in the /gnm_shiny_app folder.

Remarks

Please see /REMARKS.md for technical remarks regarding OCR, entity recognition and geocoding in this project.

License

The software is licensed under GPL-3.

References

Harris, T. M., Bodenhamer, D. J. and Corrigan, J. (Eds.) 2010. The Spatial Humanities: GIS and the Future of Humanities Scholarship. Bloomington/Indianapolis: Indiana University Press.

Riedl, M. and Padó, S. 2018. "A Named Entity Recognition Shootout for German". Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Short Papers), pages 120–125. Melbourne, Australia, July 15 - 20, 2018.

Viola, L. and Verheul, J. 2019. "Mining ethncity: Discourse-driven topic modelling of immigrant discourses in the USA, 1898-1920". Digital Scholarship in the Humanities. DOI: 10.1093/llc/fqz068.

The team

GeoNewsMiner is a project by Lorella Viola and Jaap Verheul. This project was funded by the Utrecht University Innovation Fund for Research in IT and received support from the Research Engineering team of Utrecht University. The technical implementation was provided by Jonathan de Bruin (j.debruin1@uu.nl) and Casper Caandorp. The Shiny app was developed by Kees van Eijden (k.v.eijden@uu.nl).

How to cite GNM

To cite the GNM repository, please use the following format according to the APA style guide:

Viola, Lorella, de Bruin, Jonathan, van Eijden, Kees and Verheul, Jaap. 2019. The GeoNewsMiner (GNM): An interactive spatial humanities tool to visualize geographical references in historical newspapers (v1.0.0). Utrecht: Utrecht University. Available at https://github.com/lorellav/GeoNewsMiner. DOI: 10.5281/zenodo.3484591

lorellav / GeoNewsMiner