MarkusSagen/paai_skr_demo

Setup

git clone git@github.com:MarkusSagen/paai_skr_demo.git
cd paai_skr_demo

Create Virtual environment

python -m venv skr
source skr/bin/activate

Install pre-requisites:
```
pip install -r requirements.txt
```

Run the demo

Start the demo application

streamlit run app.py

Download PDFs and create topic model

Follow the setup instructions above
Download Swedish stopwords

Download stopwords

python -m nltk.downloader stopwords python -c "import stanza;stanza.download('sv')"
Download all the library plans and place as pandas DataFrames:
```
python scraper/scrape.py
python convert/pdf.py
```
Some of the PDFs are for some reason not saved correctly to the right format when one tries to open then. But have currently left that for a later date.
Create a topic model

python topics/model.py \
    --input_dataframe biblioteksplaner.csv
    --model_name kb
    --num_topics 50
    --output models

Run the streamlit application

streamlit run app.py

Notes:

mp seem slightly better than kb as topic models
For deployment:
- Don't save embedding model for BERTopic
- Don't load the embedding model
- Ensure python version >= 3.8

DONE:

Next Steps:

Make region and municipalities plots
Re-run and get feedback from Annika
Add option to reset the multiselect bars in filter
Do lemmatization and find the best topics
Continue to clean up the text
Load two different models (50 and 20 topics)

Lemmatization in Swedish

Example of using stanza

nlp = stanza.Pipeline(lang='en', processors='tokenize,mwt,pos,lemma')
doc = nlp('Barack Obama was born in Hawaii.')
print(*[f'word: {word.text+" "}\tlemma: {word.lemma}' for sent in doc.sentences for word in sent.words], sep='\n')

TODO:

Find PDFs not downloaded correctly - check: text/html and add those PDFs
Skip converting PDFs not with correct encoding to show up!
Allow filtering based on more parameters (municiplaities, region size)
Filtering should have a cut-off point to make searching faster
Aggregate / Summarize / Cluster the most similar answers fond across documents.
Remove weblinks in text (beautiful soup)
Allow annotated_text to take in markdown for links and clickable buttons
Remove Hashtags, weblinks, bulletpoints etc.
Hyperparameter tune the topic models

About

Languages

Language:Python 97.7%Language:Shell 1.4%Language:CSS 0.9%

Setup

Run the demo

Download PDFs and create topic model

Download stopwords

Notes:

DONE:

Next Steps:

TODO:

About

Languages