Examples for crawling and preprocessing corpora for conflict study.
./Crawlers: example crawlers for two types of sources. We combined Newspaper3k and manually designed patterns with Beautiful Soup.
- News wires
- Organizations:
- UN: UN News, UNHCR, UNODC, OHCHR
- Refword: Amnesty, USCRI, Immigration and Refugee Board of Canada (IRB)
- NGO: Amnesty, HRW.org, New Humanitarian, Rescue.org, PHR.org
- Others: CFR.org, FRUS
./Preprocess: Preprocessing pipelines for five different types of sources. Cleaning and filtering stories in conflicts domain.
- News wires
- Organizations
- Gigaword
- Phoenix Real-Time from the paper
- Wikipedia
- wikiextractor: Modified from the orignal wikiextractor to output both contexts and categories.
- df_query.csv: Related categories queried from https://petscan.wmflabs.org/
./Patterns: statistically summarized the most frequent keywords’ regular expressions to filter conflicts domain.
- wiki_relevant: patterns for filtering relevant news wires
- irelevant_keywords: patterns for filtering out not relevant news wires
- wiki_relevant_exclude: additional patterns for filtering out not relevant wikipedia documents by categories