Protozet / WikiDoMiner

Mining Wikipedia

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

WikiDoMiner: Wikipedia Domain-specific Miner

WikiDoMiner is a tool that automatically generates domain-specific corpora by crawling Wikipedia.

Installation

Clone and install the required libraries

git clone https://gitlab.uni.lu/sezzini/WikiDoMiner.git
cd WikiDoMiner
pip install -r requirements.txt 

Usage example

CLI:

python WikiDoMiner.py --doc Xfile.txt --output-path ../research/nlp --wiki-depth 1

checkout available arguments using

python WikiDoMiner.py --help

Run the notebook Open In Colab

# extract keywords
keywords = getKeywords(document, spacy_pipeline)

# query wikipedia to get your corpus
corpus = getCorpus(keywords, depth=1)

# locally save your corpus 
saveCorpus(corpus, parent_dir='Documents', folder='Corpus')

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

Please make sure to update tests as appropriate.

KeyBERT for Keyword Extraction

To run KeyBERT, follow the instructions within the README.md file within the KeyBERT-Master folder. After ensuring all of the necessary libraries are installed, change to the WikiDoMiner directory before running bert.py. The "all_requirements" folder is used by default, but if you wish to run the program on different documents, be sure to add the folder in the same format that "all_requirements" follows while also changing the "directory" variable to match the name of your folder.

License

MIT

Command used to run: python WikiDoMiner.py --doc cctns.pdf --output-path ./research/nlp --wiki-depth 1

About

Mining Wikipedia

License:MIT License


Languages

Language:Python 78.8%Language:Jupyter Notebook 20.6%Language:Makefile 0.3%Language:CSS 0.3%