CultureBank

Quick Links: [Paper] [Project Page] [dataset-tiktok] [dataset-reddit] [Models]

We provide:

an easy-to-use and generalizable pipeline to construct cultural knowledge bank from online communities
two cultural knowledge datasets, [CultureBank-TikTok] and [CultureBank-Reddit]
grounded cultural evaluation and fine-tuning scripts

Setup

Setup the environment

conda env create -f environment.yml

Setup the api keys

OpenAI: os.getenv("OPENAI_API_KEY")
Perspective api: os.getenv("PERSPECTIVE_API")

Data process pipeline

The pipeline contains 9 components (see data_process_pipeline/pipeline/main_pipeline.py).

data_process_pipeline/pipeline/component_0_culture_relevance_classifier.py: classify if a comment is related to culture
data_process_pipeline/pipeline/component_1_knowledge_extractor.py.py: extract cultural information from the comment
data_process_pipeline/pipeline/component_2_negation_converter.py: convert positive sentences to negative forms
data_process_pipeline/pipeline/component_3_clustering.py: perform clustering
data_process_pipeline/pipeline/component_4_cluster_summarizer.py: summarize the clusters
data_process_pipeline/pipeline/component_5_topic_normalization.py: normalize the cultural groups and topics
data_process_pipeline/pipeline/component_6_agreement_calculator.py: calculate the agreement values
data_process_pipeline/pipeline/component_7_content_moderation.py: identify potentially controversial and PII data for annotation
data_process_pipeline/pipeline/component_8_final_formatter.py: format the final data

How to run the pipeline

Prepare a data file, e.g., the provided dummy data file
Set up the paths in the config, e.g., the provided config_dummy_data_vanilla_mistral.yaml
Run this command to run the components with index 0,1,3,4,5,6,7,8 in order with the config

python data_process_pipeline/main.py -i 0,1,3,4,5,6,7,8 -c ./data_process_pipeline/configs/config_dummy_data_vanilla_mistral.yaml

The final output will be at data_process_pipeline/results/8_final_formatter/output.csv, as specified in config_dummy_data_vanilla_mistral.yaml.

How to run individual components

We can also run individual components, but need to make sure the input file exists.

# load the 0th component, relevance_classifier
python data_process_pipeline/main.py -i 0 -c ./data_process_pipeline/configs/config_dummy_data_vanilla_mistral.yaml

Some notes

The pipeline will also generate a file with controversial data for human annotation, output_file_for_manual_annotation, you need to annotate it and put it in controversial_annotation_file
We prepare two sample configs:
- config_dummy_data_vanilla_mistral.yaml: uses vanilla mistral models as the extractor and summarizer, light-weight
- config_dummy_data_finetuned_mixtral.yaml: uses vanilla mixtral models plus our fine-tuned adapters on Reddit as the extractor and summarizer, requires more GPU memory (at least ~27GB)

Evaluation scripts

evaluation/convert_to_desc.py: concatenates the fields in CultureBank data and translates them into free-text paragraphs of cultural descriptors.
evaluation/generate_questions.py: generates questions for grounded evaluation based on the cultural descriptors. The released adapter is here.
evaluation/generate_questions_aug.py: generates questions for grounded evaluation based on the cultural descriptions with self-refinement method (very similar to evaluation/generate_questions.py, the only difference is that GPT-4 will score the generated question until max trials or good results). The released adapter is here.
evaluation/grounded_eval.py: performs grounded evaluation on language models on the generated cultural questions. if -aug (augmentation) is turned on, it means we will have the golden cultural descriptor in the input for the evaluation; and the golden-knowledge-augmented responses from GPTs can be used for further SFT training steps.
evaluation/knowledge_entailment.py: computes the knowledge entailment scores of models' generated responses in the grounded evaluations.
evaluation/direct_eval.py: performs direct evaluation on language models on CultureBank data.

Evaluation on two downstream tasks

evaluation/downstream_tasks/cultural_nli.py: evaluate on cultural nli.
evaluation/downstream_tasks/world_value_survey.py: evaluate on the world value survey based on methods in this paper.

Fine-tuning scripts

finetuning/sft_mixtral.py: a sample script to supervised-finetune a mixtral model on various tasks (extractor, summarizer, culturally-aware model, etc) with proper data preparation.
finetuning/dpo_mixtral.py: a sample script to train a mixtral model with DPO on various tasks (culturally-aware model, etc) with proper data preparation.

Released models

Acknowledgement

The codebase is adapted from Candle (paper) which is under this license. Thanks for the amazing work!

If you find our work helpful, please consider citing our paper:

@misc{shi2024culturebank,
    title={CultureBank: An Online Community-Driven Knowledge Base Towards Culturally Aware Language Technologies},
    author={Weiyan Shi and Ryan Li and Yutong Zhang and Caleb Ziems and Chunhua Yu and Raya Horesh and Rogério Abreu de Paula and Diyi yang},
    year={2024},
    eprint={2404.15238},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

We welcome all kinds of contributions. If you have any questions, feel free to leave issues or email us.

guspan-tanadi / CultureBankfromSALT-NLP