Disclaimer: This repo is experimental and not liable for any damage or misuse, neither is it linked to any person, company or public entity.
Use this repo to automatically generate an word/entity index of the class-books in order to prepare for the exam.
- Drop your pdf files in
data/allBooqs/*
- Use the naming convention
<course_name>_<book_number>.pdf
wiht numbers starting from 1. - Run 🏃🏾♂️ the script.
- A csv file wih the index should appear in
data/index_<course_name>.csv
. - STUDY!!!
Install and execute (with Poetry package manager):
poetry install
poetry run python run.py --course_name <> --n_books <>
If you are interested in Named Entity recognition, feel free to play around in sanshacq/ner_tagging.py
.
It will generate a html file with colored entities.
Run sanshacq/n_gram.py
to have a look into the tf-idf score of the documents.
Index:
- Join index table with Tdf-Idf score table
- Filter based on score
- n-gram
Get the full vocab:
list(nlp.vocab.strings)