Prompt Tuned Embedding Classification for Multi-Label Industry Sector Allocation

Installation • Reproducibility • Usage • Other Resources • Paper • Blog Post • Citation

This repository contains the code accompanying the paper "Prompt Tuned Embedding Classification for Multi-Label Industry Sector Allocation". We also recommend to read our blog post "How EQT Motherbrain uses LLMs to map companies to industry sectors". For any questions, please contact valentin.buchner@eqtpartners.com.

Installation

After cloning this repository, the necessary packages can be installed with:

pip install -r requirements.txt
pip install -e .

# if using a vertex ai notebook with CUDA
pip3 install torch==1.13.1+cu117 --extra-index-url https://download.pytorch.org/whl/cu117 --no-cache-dir

Reproducibility

All experiments, including hyperparameter search, can be reproduced by running the following batch files:

bash preprocessing/preprocessing.sh
bash sectors/experiments/run_experiments_gpu.sh
bash sectors/experiments/run_experiments_cpu.sh

Usage

The scripts can also be run individually:

Preprocessing

The preprocessed data for the hatespeech dataset is already contained in this repository. However, it can be rerun with

python preprocessing/get_dataset.py
python preprocessing/preprocess_data.py # this line will take ~10 min as it summarizes long descriptions and keyword lists

The preprocessed dataset can be augmented by applying paraphrasing with vicuna:

python preprocessing/paraphrase_augmentation.py

This will create a new dataset data/[DATASET]/train_augmented.json.

Running The Experiments

For test runs, all the following commands include the --model_name=bigscience/bloom-560m flag, as this can easily be run on a cpu. However, it can also be replaced with other huggingface hosted LLaMa or Bloom models. By default it uses huggyllama/llama-7b. All experimental results will be saved as json files in the results/[DATASET]/ directory.

N-shot experiments

python sectors/experiments/nshot/nshot.py --model_name bigscience/bloom-560m

In order to use gpt-3.5-turbo as a model for n-shot prompting, a .env file with the OpenAI API credentials needs to be added to the root directory of this repository:

OPENAI_SECRET_KEY = "secret key"
OPENAI_ORGANIZATION_ID = "org id"

Embedding Promximity

For these experiments, the embeddings still have to be generated by running the following code

python embedding_proximity/generate_embeddings.py --model_name bigscience/bloom-560m
# for augmented data
python embedding_proximity/generate_embeddings.py --model_name bigscience/bloom-560m --augmented augmented

Then, the following code runs all embedding proximity experiments:

python embedding_proximity/vector_similarity.py --model_name bigscience/bloom-560m
python embedding_proximity/vector_similarity.py --model_name bigscience/bloom-560m --augmented augmented

python embedding_proximity/vector_similarity.py --type RadiusNN --model_name bigscience/bloom-560m
python embedding_proximity/vector_similarity.py --type RadiusNN --model_name bigscience/bloom-560m --augmented augmented

python embedding_proximity/classification_head/classification_head.py --model_name bigscience/bloom-560m
python embedding_proximity/classification_head/classification_head.py --model_name bigscience/bloom-560m --augmented augmented

Prompt Tuning

python prompt_tuning/prompt_tune.py --model_name bigscience/bloom-560m --interrupt_threshold 0.01
python prompt_tuning/prompt_tune.py --model_name bigscience/bloom-560m --interrupt_threshold 0.01 --augmented augmented

PTEC

python prompt_tuning/prompt_tune.py --model_name bigscience/bloom-560m --head ch --scheduler exponential --interrupt_threshold 0.01
python prompt_tuning/prompt_tune.py --model_name bigscience/bloom-560m --head ch --scheduler exponential --interrupt_threshold 0.01 --augmented augmented

Other Resources

For an example of applying Trie Search, see notebooks/constrained_beam_search.ipynb

Citation

If you use or refer to this repository in your research, please cite our paper:

BibTeX

@article{buchner2023prompt,
    title={Prompt Tuned Embedding Classification for Multi-Label Industry Sector Allocation},
    author={Valentin Leonhard Buchner and Lele Cao and Jan-Christoph Kalo and Vilhelm von Ehrenheim},
    year={2023},
    eprint={2309.12075},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

APA

Buchner, V. L., Cao, L., Kalo, J.-C., & von Ehrenheim, V. (2023). Prompt Tuned Embedding Classification for Multi-Label Industry Sector Allocation [Preprint]. arXiv:2309.12075. cs.CL.

MLA

Buchner, Valentin Leonhard, et al. "Prompt Tuned Embedding Classification for Multi-Label Industry Sector Allocation." arXiv preprint arXiv:2309.12075, 2023.

EQTPartners / PTEC