snowood1 / Corpora-for-Conflict-Study

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Corpora-for-Conflict-Study

Examples for crawling and preprocessing corpora for conflict study.

./Crawlers: example crawlers for two types of sources. We combined Newspaper3k and manually designed patterns with Beautiful Soup.

  • News wires
  • Organizations:
    • UN:   UN News, UNHCR, UNODC, OHCHR
    • Refword:   Amnesty, USCRI, Immigration and Refugee Board of Canada (IRB)
    • NGO:   Amnesty, HRW.org, New Humanitarian, Rescue.org, PHR.org
    • Others:   CFR.org, FRUS

./Preprocess: Preprocessing pipelines for five different types of sources. Cleaning and filtering stories in conflicts domain.

./Patterns: statistically summarized the most frequent keywords’ regular expressions to filter conflicts domain.

  • wiki_relevant:   patterns for filtering relevant news wires
  • irelevant_keywords:   patterns for filtering out not relevant news wires
  • wiki_relevant_exclude:   additional patterns for filtering out not relevant wikipedia documents by categories

About

License:GNU General Public License v3.0


Languages

Language:Jupyter Notebook 99.5%Language:Python 0.5%