sharad461 / english-corpus-nepal

Monolingual corpus comprising Nepal-related content in English. Intended mainly for domain-specific purposes for NLP in the Nepali language.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

english-corpus-nepal

Monolingual corpus scraped from English-language newspapers in Nepal. Main purpose is to collect Nepal-related content in English for use in domain-specific natural language processing in the Nepali language.

Files

The source articles are zipped inside source folder and the consolidated sentence-level files are in the root directory.

The lists of links that have been crawled are inside crawl-lists folder. These can be ignored in your own crawls if you need more data.

Crawls

  1. October 9, 2019 (The Kathmandu Post): 3849 article items, 115890 sentences after removing repetitions
  2. October 10, 2019 (The Annapurna Express): 385 article items, 15263 sentences
  3. October 10, 2019 (Republica): 6087 items, 121858 sentences

About

Monolingual corpus comprising Nepal-related content in English. Intended mainly for domain-specific purposes for NLP in the Nepali language.

License:MIT License