This project uses Wikipedia data for NLP tasks.
Use Jupyter Notebooks to access the data. (how to run Jupyter Lab)
After opening Jupyter Lab on the browser you can create a new notebook and use the objects from the main
module .
from main import wiki, data, spc
page_titles = wiki.search("Luís Gama")
page = wiki.page(page_titles[0])
data.save_page(page)
doc = spc.doc(page.content)
Use to search on the Wikipedia and get the pages.
Use to save/load data locally
Use to process the text data using Spacy.
OBS: In the future we plan to use other libraries as Gensim, Stanza and NLTK to mention a few.
The current solution stores the data into flat files on OS file system.
An alternative solution is to use a noSQL DB. It needs to be implemented though.
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
jupyter lab
Wikipedia allow us to download all articles data, but it's in XML format.
The latest dump can be found in the file ptwiki-20231020-pages-articles-multistream.xml
The articles index can be found in the index file: ptwiki-20231020-pages-articles-multistream-index.txt
It's necessary a solution to extract the text data from this dump XML file.
Here's a github repo that presents a solution that maybe is a good one.
A web interface to view the Wikipedia data and metadata. More over, this interface will show linguistic features annotated by the Spacy library.