Prompt Tuned Embedding Classification for Multi-Label Industry Sector Allocation
Installation • Reproducibility • Usage • Other Resources • Paper • Blog Post • Citation
This repository contains the code accompanying the paper "Prompt Tuned Embedding Classification for Multi-Label Industry Sector Allocation". We also recommend to read our blog post "How EQT Motherbrain uses LLMs to map companies to industry sectors". For any questions, please contact valentin.buchner@eqtpartners.com.
Installation
After cloning this repository, the necessary packages can be installed with:
pip install -r requirements.txt
pip install -e .
# if using a vertex ai notebook with CUDA
pip3 install torch==1.13.1+cu117 --extra-index-url https://download.pytorch.org/whl/cu117 --no-cache-dir
Reproducibility
All experiments, including hyperparameter search, can be reproduced by running the following batch files:
bash preprocessing/preprocessing.sh
bash sectors/experiments/run_experiments_gpu.sh
bash sectors/experiments/run_experiments_cpu.sh
Usage
The scripts can also be run individually:
Preprocessing
The preprocessed data for the hatespeech dataset is already contained in this repository. However, it can be rerun with
python preprocessing/get_dataset.py
python preprocessing/preprocess_data.py # this line will take ~10 min as it summarizes long descriptions and keyword lists
The preprocessed dataset can be augmented by applying paraphrasing with vicuna:
python preprocessing/paraphrase_augmentation.py
This will create a new dataset data/[DATASET]/train_augmented.json
.
Running The Experiments
For test runs, all the following commands include the --model_name=bigscience/bloom-560m
flag, as this can easily be run on a cpu. However, it can also be replaced with other huggingface hosted LLaMa or Bloom models. By default it uses huggyllama/llama-7b
. All experimental results will be saved as json
files in the results/[DATASET]/
directory.
N-shot experiments
python sectors/experiments/nshot/nshot.py --model_name bigscience/bloom-560m
In order to use gpt-3.5-turbo
as a model for n-shot prompting, a .env
file with the OpenAI API credentials needs to be added to the root directory of this repository:
OPENAI_SECRET_KEY = "secret key"
OPENAI_ORGANIZATION_ID = "org id"
Embedding Promximity
For these experiments, the embeddings still have to be generated by running the following code
python embedding_proximity/generate_embeddings.py --model_name bigscience/bloom-560m
# for augmented data
python embedding_proximity/generate_embeddings.py --model_name bigscience/bloom-560m --augmented augmented
Then, the following code runs all embedding proximity experiments:
python embedding_proximity/vector_similarity.py --model_name bigscience/bloom-560m
python embedding_proximity/vector_similarity.py --model_name bigscience/bloom-560m --augmented augmented
python embedding_proximity/vector_similarity.py --type RadiusNN --model_name bigscience/bloom-560m
python embedding_proximity/vector_similarity.py --type RadiusNN --model_name bigscience/bloom-560m --augmented augmented
python embedding_proximity/classification_head/classification_head.py --model_name bigscience/bloom-560m
python embedding_proximity/classification_head/classification_head.py --model_name bigscience/bloom-560m --augmented augmented
Prompt Tuning
python prompt_tuning/prompt_tune.py --model_name bigscience/bloom-560m --interrupt_threshold 0.01
python prompt_tuning/prompt_tune.py --model_name bigscience/bloom-560m --interrupt_threshold 0.01 --augmented augmented
PTEC
python prompt_tuning/prompt_tune.py --model_name bigscience/bloom-560m --head ch --scheduler exponential --interrupt_threshold 0.01
python prompt_tuning/prompt_tune.py --model_name bigscience/bloom-560m --head ch --scheduler exponential --interrupt_threshold 0.01 --augmented augmented
Other Resources
For an example of applying Trie Search, see notebooks/constrained_beam_search.ipynb
Citation
If you use or refer to this repository in your research, please cite our paper:
BibTeX
@article{buchner2023prompt,
title={Prompt Tuned Embedding Classification for Multi-Label Industry Sector Allocation},
author={Valentin Leonhard Buchner and Lele Cao and Jan-Christoph Kalo and Vilhelm von Ehrenheim},
year={2023},
eprint={2309.12075},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
APA
Buchner, V. L., Cao, L., Kalo, J.-C., & von Ehrenheim, V. (2023). Prompt Tuned Embedding Classification for Multi-Label Industry Sector Allocation [Preprint]. arXiv:2309.12075. cs.CL.
MLA
Buchner, Valentin Leonhard, et al. "Prompt Tuned Embedding Classification for Multi-Label Industry Sector Allocation." arXiv preprint arXiv:2309.12075, 2023.