wikiextract

This package is used to create an artificial dataset of Hindi Errors. It depends on the wikiextractor package from here: https://github.com/attardi/wikiextractor. It does the following:

Downloads a "current" hindi wikipedia dump. Note that it can use any corpus of hindi sentences, such as HindiMonocorp, but we have not tried to do that
Extracts meaningful sentences from the wikipedia dump
Taking these as 'correct' sentences, it creates erroneous sentences using some heuristics and rule based algorithms (see insert_error.py)
For error correction task the erroneous sentences are considered as source file and the correct sentences are considered as the target files

To run, simply:

run bash setup.sh
run bash run_hiwiki.sh
optionally

Note that you may need to update a link inside setup.sh, since wikimedia deletes old dumps.

We do not consider spelling errors to be grammatical errors and as such avoid generating spelling errors. Still, due to inadequacies in the POS tagger that we use, we still generate some spelling errors regardless. In fact much of the code in insert_error.py is based around circumventing these inadequacies.

The ouput of run_hiwiki.sh is a file hiwiki.augmented.edits which contains the parallel corpus of hindi errors in the .edits format. You can convert it to whatever format you like using the tools in scripts/.

About

A package for creating a corpus of artificial errors for grammar correction in Hindi

Languages

Language:Python 97.9%Language:Shell 2.1%