Setup check. Script to get keywords for comparing against SimpleMaths, TextRank and Philology results

Question

Setup check. Script to get keywords for comparing against SimpleMaths, TextRank and Philology results

hermanpetrov opened this issue 2 months ago · comments

So currently using the GPT and some read manuals. Did I correctly setup the code and transformer model? Or are there any suggestions which I could use? I will also try with ngrams up to 3.

Maybe some preprocessing suggestions or how to achieve to results with KeyBERT POS ner tags in the process.

I am comparing currently KeyBERT vs SketchEngine (SimpleMaths) and TextRank and also Philologically found keywords. And for my masters I thought KeyBERT would be best. I would like to checkout the efficiency with Estonian language models (tartuNLP/EstBERT) but also try with mBART(facebook/mbart-large-50) and mT5(google/mt5-base) and could add some value to the research. I tried also reaching out via linkedIn

All the best.

import csv
import os
from flair.embeddings import TransformerDocumentEmbeddings
from keybert import KeyBERT
import re

def load_and_preprocess_stopwords(file_path):
    with open(file_path, 'r', encoding='UTF-8') as file:
        # Normalize each stop word by lowering case and removing extra characters
        stopwords = [re.sub(r'\W+', '', line.strip().lower()) for line in file]
    return stopwords

def extract_keywords_and_write(text_path, csv_path, output_path, stopwords):
    # Read the text file
    with open(text_path, 'r', encoding="UTF-8") as file:
        text_content = file.read()

    # Count rows in the corresponding CSV file
    with open(csv_path, 'r', newline='', encoding='UTF-8') as csvfile:
        reader = csv.reader(csvfile)
        next(reader, None)  # Skip the header row
        row_count = sum(1 for row in reader)  # Count rows excluding the header

    # Load the model and extract keywords
    estBERT = TransformerDocumentEmbeddings('tartuNLP/EstBERT')
    kw_model = KeyBERT(model=estBERT)
    keywords = kw_model.extract_keywords(text_content, keyphrase_ngram_range=(1, 1), stop_words=stopwords, nr_candidates=row_count, top_n=row_count)

    # Write keywords to a new CSV file in the output directory
    with open(output_path, 'w', newline='', encoding="UTF-8") as csvfile:
        writer = csv.writer(csvfile)
        writer.writerow(("keywords", "score"))  # Write header
        writer.writerows(keywords)  # Write keywords and scores

# Load and preprocess Estonian stopwords
estonian_stopwords = load_and_preprocess_stopwords('estonian-stopwords.txt')

# Define directories
txt_dir = 'raw_text'
csv_dir = 'pre_processed_text_data'
output_dir = 'keybert'
os.makedirs(output_dir, exist_ok=True)

# Process each text file in the txt_dir
for txt_filename in os.listdir(txt_dir):
    if txt_filename.endswith('.txt'):
        base_filename = os.path.splitext(txt_filename)[0]
        csv_filename = f"{base_filename}.csv"
        txt_file_path = os.path.join(txt_dir, txt_filename)
        csv_file_path = os.path.join(csv_dir, csv_filename)
        output_file_path = os.path.join(output_dir, csv_filename)

        if os.path.exists(csv_file_path):
            extract_keywords_and_write(txt_file_path, csv_file_path, output_file_path, estonian_stopwords)

Maarten Grootendorst · Answer 1 · Sun Apr 07 2024 14:50:49 GMT+0800 (China Standard Time)

Thanks for sharing your code!

Did I correctly setup the code and transformer model?

From a quick glance, it seems correctly setup but it all depends on your definition of correctly. Are you running into any errors or do you want to optimize performance/diversity/etc.? What is it exactly that you want checked? What is your use case and what goal do you want to achieve?

It helps if you start by describing your use case first, the problem that you are facing, and the kinds of solutions/feedback you might be looking for. Reading your questions, it isn't clear to me what the main question is.

To illustrate, the following comment is quite broad and does not tell me what kind of feedback you are looking for:

Maybe some preprocessing suggestions or how to achieve to results with KeyBERT POS ner tags in the process.

In other words, can you specify your question a bit?

Herman Petrov · Answer 2 · Sun Apr 07 2024 15:43:20 GMT+0800 (China Standard Time)

Thank you for your reply, really appreciate it!

What is your use case and what goal do you want to achieve?

Most importantly I wished for the Author to check whether or not the keyword extraction was called out correctly or my setup was correct. - if that makes sense :)

My goal is to create a script in which a User can use 3 different keyword extraction methods and the methods would display their rankings. The script will have SketcheEngine Simple Maths and TextRank + KeyBERT with 3 provided models. All meant for Estonian texts.

My masters goal is to compare LLM against existing Simple Maths , Textrank and Philologically found keywords to measure how accurate are LLM using the minimal and brilliant KeyBERT solution.
The results will have keywords and keyphrases.

Therefore I need maximum accuracy that can possibly be achieved with KeyBERT in finding keywords and keyphrases.

What is it exactly that you want checked?

Am I calling out correctly the keyword functions with the chosen transformers?
Words like check-in, is it possible to include hyphens? Currently my results have hyphens removed.
Is it possible to have the results without the capital being lower-cased?
For lemmatized results KeyBERT results, are there any other ways existing other than pre-processing the text to lemmas and inserting into KeyBERT?
How do I reference your work properly so I would not leave out anything?
How can I get with the extraction of keywords with KeyBERT the included word or phrase NER Tags, UPOS, sentence ID or Word ID?

Additionally the question regarding POS and ner_tags.
Because with my own scripts I can get back the word id and sentence id, UPOS and NER Tags.
What I meant is that currently my SimpleMaths and TextRank scripts sets data with UPOS and NER tags as follows:

in lemma (with word lemma is replaced with original word) form:

lemma;upos;ner tag;fc_count;stopword;rfc_count;fcTotalCount;rfcTotalCount;simpleMathsScore
niu;ADJ;O;2;no;278;2127;932062935;725.0390734407061
Merili;PROPN;O;3;no;1993;2127;932062935;449.7504366785572
Merily;PROPN;O;2;no;1255;2127;932062935;401.1511946716528
mustamägi;NOUN;O;3;no;3929;2127;932062935;270.62976758775744

keyphrases lemma (with word lemma are also instead with original words) form:

Phrase;UPOS;NER Tags;Keyness Scores;Average Keyness;Count;Log

tund rõhutaja läbuvärk;NOUN + NOUN + NOUN;O + O + O;4.134771512979459 450.8307115352569 470.6407998793465;308.5354276425276;1;"(tund, NOUN, O, t887.csv, sent_id=83, word_id=8); (rõhutaja, NOUN, O, t887.csv, sent_id=83, word_id=9); (läbuvärk, NOUN, O, t887.csv, sent_id=83, word_id=10)"

rajuma sound ansambel;VERB + NOUN + NOUN;O + O + O;431.3461093308251 107.57734338218152 82.77084416232944;207.23143229177867;1;"(rajuma, VERB, O, t887.csv, sent_id=56, word_id=2); (sound, NOUN, O, t887.csv, sent_id=56, word_id=3); (ansambel, NOUN, O, t887.csv, sent_id=56, word_id=4)"

The code has been updated a bit to iterate over all three models and create for each model a subfolder with ngrams ranging 1 to 3.

import csv
import os
from flair.embeddings import TransformerDocumentEmbeddings
from keybert import KeyBERT
import re

def load_and_preprocess_stopwords(file_path):
    with open(file_path, 'r', encoding='UTF-8') as file:
        stopwords = [re.sub(r'\W+', '', line.strip().lower()) for line in file]
    return stopwords

def extract_keywords_and_write(text_path, csv_path, output_path, stopwords, model_name, ngram):
    # Read the text file
    with open(text_path, 'r', encoding="UTF-8") as file:
        text_content = file.read()

    # Count rows in the corresponding CSV file
    with open(csv_path, 'r', newline='', encoding='UTF-8') as csvfile:
        reader = csv.reader(csvfile)
        next(reader, None)  # Skip the header row
        row_count = sum(1 for row in reader)  # Count rows excluding the header


    # Load the model and extract keywords
    doc_embeddings = TransformerDocumentEmbeddings(model_name)
    kw_model = KeyBERT(model=doc_embeddings)
    keywords = kw_model.extract_keywords(text_content, keyphrase_ngram_range=(ngram, ngram), stop_words=stopwords, nr_candidates=row_count, top_n=row_count)

    # Write keywords to a new CSV file in the output directory
    with open(output_path, 'w', newline='', encoding="UTF-8") as csvfile:
        writer = csv.writer(csvfile)
        writer.writerow(("keywords", "score"))  # Write header
        writer.writerows(keywords)  # Write keywords and scores

# Load and preprocess Estonian stopwords
estonian_stopwords = load_and_preprocess_stopwords('estonian-stopwords.txt')

# Define directories
txt_dir = 'raw_text'
csv_dir = 'pre_processed_text_data'
models_dir = 'models'
os.makedirs(models_dir, exist_ok=True)

# Model configurations
models = ['google/mt5-base', 'facebook/mbart-large-50', 'tartuNLP/EstBERT']
ngrams = [1, 2, 3]

# Process each text file in the txt_dir for each model and ngram setting
for model in models:
    model_path = os.path.join(models_dir, model.split('/')[0])  # Model directory
    for ngram in ngrams:
        ngram_path = os.path.join(model_path, f"ngram {ngram}")
        os.makedirs(ngram_path, exist_ok=True)
        
        for txt_filename in os.listdir(txt_dir):
            if txt_filename.endswith('.txt'):
                base_filename = os.path.splitext(txt_filename)[0]
                csv_filename = f"{base_filename}.csv"
                txt_file_path = os.path.join(txt_dir, txt_filename)
                csv_file_path = os.path.join(csv_dir, csv_filename)
                output_file_path = os.path.join(ngram_path, csv_filename)
                
                if os.path.exists(csv_file_path):
                    extract_keywords_and_write(txt_file_path, csv_file_path, output_file_path, estonian_stopwords, model, ngram)

Repetitive errors I only get:

C:\Users\herma\anaconda3\envs\FUZE\lib\site-packages\transformers\convert_slow_tokenizer.py:473: UserWarning: The sentencepiece tokenizer that you are converting to a fast tokenizer uses the byte fallback option which is not implemented in the fast tokenizers. In practice this means that the fast version of the tokenizer can produce unknown tokens whereas the sentencepiece version would have converted these unknown tokens into a sequence of byte tokens matching the original piece of text.
warnings.warn(

Maarten Grootendorst · Answer 3 · Tue Apr 09 2024 19:53:30 GMT+0800 (China Standard Time)

Am I calling out correctly the keyword functions with the chosen transformers?

From a pure coding perspective, yes you are calling out the functions correctly. Do note though that sentence-transformers models generally work a bit better (and in my experience faster) than than flair, so perhaps use that instead. Also, make sure to check the MTEB leaderboard for a nice overview of models.

Words like check-in, is it possible to include hyphens? Currently my results have hyphens removed.

Any processing during extraction is done through the CountVectorizer which should be used to change how you would like to see this processing.

Is it possible to have the results without the capital being lower-cased?

Yes, see my answer above.

For lemmatized results KeyBERT results, are there any other ways existing other than pre-processing the text to lemmas and inserting into KeyBERT?

Not within KeyBERT itself other than using the CountVectorizer to lemmatize any input words that you receive.

How do I reference your work properly so I would not leave out anything?

Do you mean a citation? If so, then you can follow along with the README.

How can I get with the extraction of keywords with KeyBERT the included word or phrase NER Tags, UPOS, sentence ID or Word ID?

You can use the [KeyphraseVectorizers](https://maartengr.github.io/KeyBERT/guides/countvectorizer.html#keyphrasevectorizers) although I think it is not maintained anymore.

Herman Petrov · Answer 4 · Thu Apr 11 2024 08:33:55 GMT+0800 (China Standard Time)

@MaartenGr thank you for your reply and suggestions!

Currently reinstalled my envs in Conda and added Torch and everything runs fast.

For Estonian I chose e5 as it was the first multilingual suggestions fitting Estonian. Results so far promising. Will for sure update on the comparison info.

I do have couple of more questions.

1.The suggested MMR 0.7 in the example is it optimal ?
2. Regarding reference, README is good but are there any sort of conference articles, journals or blog posts more recent?
3.When I will publish my code with usage of KeyBERT packages how should I reference on GitHub?

Thank you upfront!

Regarding MMR , out of curiosity for my masters:
I am currently going from 0 to 1 to check which is most accurate for my texts ( running it with RTX3070 with max ngrams 1:1 to 3:3 and 50 to 200 candidates )

More recent code:

`import os
import re
import csv
from torch import cuda
from sentence_transformers import SentenceTransformer
from flair.embeddings import TransformerDocumentEmbeddings
from keybert import KeyBERT
from sklearn.feature_extraction.text import CountVectorizer

def load_and_preprocess_stopwords(file_path):
    with open(file_path, 'r', encoding='UTF-8') as file:
        stopwords = [re.sub(r'\W+', '', line.strip().lower()) for line in file]
    return stopwords

def custom_tokenizer(doc):
    return re.split(r'[\s,.!?;:()]+', doc)

def read_documents(folder_path):
    docs = []
    for filename in os.listdir(folder_path):
        if filename.endswith('.txt'):
            with open(os.path.join(folder_path, filename), 'r', encoding='utf-8') as file:
                docs.append((filename, file.read()))
    return docs

def load_model(model_info):
    model_type, model_path = model_info
    if model_type == 'sentence_transformer':
        model = SentenceTransformer(model_path)
    elif model_type == 'flair_transformer':
        model = TransformerDocumentEmbeddings(model_path)
    return KeyBERT(model=model)

def run_models(docs, model, model_name, output_base, ngram_ranges, diversities, lowercase):
    stopwords = load_and_preprocess_stopwords('estonian-stopwords.txt')
    for ngram_range in ngram_ranges:
        vectorizer = CountVectorizer(tokenizer=custom_tokenizer, ngram_range=ngram_range, stop_words=stopwords, token_pattern=None, lowercase=lowercase)
        for diversity in diversities:
            output_dir_path = os.path.join(output_base, f"{model_name}", f"ngram_{ngram_range[0]}_{ngram_range[1]}", f"diversity_{int(diversity*10)}")
            os.makedirs(output_dir_path, exist_ok=True)
            for filename, doc in docs:
                output_path = os.path.join(output_dir_path, f"{filename[:-4]}.csv")
                keywords = model.extract_keywords(doc, use_mmr=True, diversity=diversity, vectorizer=vectorizer, nr_candidates=200, top_n=200)
                with open(output_path, 'w', newline='', encoding='utf-8') as csvfile:
                    writer = csv.writer(csvfile, delimiter=';')
                    writer.writerow(['keyphrase', 'score'])
                    for keyphrase, score in keywords:
                        writer.writerow([keyphrase, score])
            print(f"Finished processing {model_name} at ngram range {ngram_range} and diversity {diversity} with nr_candidates=200 and top_n=200 and lowercase={lowercase}")
    del model  # Free up memory
    if cuda.is_available():
        cuda.empty_cache()

def main():
    base_folders = {
        'raw_text': 'models/raw_text_data',
        'raw_text_lemma': 'models/raw_text_lemma_data',
    }
    lcf_folders = {
        'raw_text': 'models/raw_text_data_LCF',
        'raw_text_lemma': 'models/raw_text_lemma_data_LCF'
    }
    models_info = {
        'LaBSE': ('sentence_transformer', 'sentence-transformers/LaBSE'),
        'multi_e5': ('sentence_transformer', 'intfloat/multilingual-e5-large-instruct'),
        'MiniLM_multi': ('sentence_transformer', 'sentence-transformers/paraphrase-multilingual-mpnet-base-v2'),
        'MiniLM-L12_multi': ('sentence_transformer', 'sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2'),
        'distilbertMulti': ('flair_transformer', 'distilbert/distilbert-base-multilingual-cased'),
        'bertMulti': ('flair_transformer', 'google-bert/bert-base-multilingual-cased'),
        'xlm-roberta': ('flair_transformer', 'FacebookAI/xlm-roberta-base'),
        'EstBERT': ('flair_transformer', 'tartuNLP/EstBERT'),
        'est-roberta': ('flair_transformer', 'EMBEDDIA/est-roberta')
    }
    ngram_ranges = [(1, 1), (2, 2), (3, 3)]
    diversities = [0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1]

    for folder_key in base_folders:
        folder_path = 'raw_text' if 'lemma' not in folder_key else 'raw_text_lemma'
        docs = read_documents(folder_path)
        for model_name, model_info in models_info.items():
            model = load_model(model_info)
            # Process normally
            run_models(docs, model, model_name, base_folders[folder_key], ngram_ranges, diversities, lowercase=True)
            # Process with lowercase=False
            run_models(docs, model, model_name, lcf_folders[folder_key], ngram_ranges, diversities, lowercase=False)

if __name__ == '__main__':
    main()

`

Maarten Grootendorst · Answer 5 · Sun Apr 14 2024 14:25:39 GMT+0800 (China Standard Time)

1.The suggested MMR 0.7 in the example is it optimal ?

That depends on your use case and definition of "optimal". For some use cases, a lower values is enough to remove some redundancy but for others you might want to increase the value if you have many synonyms or generally are interested in more diverse representations. Always make sure to first define what you think is "optimal", "good", "performant", etc.

Regarding reference, README is good but are there any sort of conference articles, journals or blog posts more recent?

The official documentation contains the most recent information.

When I will publish my code with usage of KeyBERT packages how should I reference on GitHub?

You can cite KeyBERT as mentioned in the README.

Regarding MMR , out of curiosity for my masters:
I am currently going from 0 to 1 to check which is most accurate for my texts ( running it with RTX3070 with max ngrams 1:1 to 3:3 and 50 to 200 candidates )

As a small tip. If you have a large dataset, then it might be worthwhile to set min_df to at least 2 in the CountVectorizer as it may significantly reduce memory requirements.