guspan-tanadi / CultureBankfromSALT-NLP

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

CultureBank

Quick Links: [Paper] [Project Page] [dataset-tiktok] [dataset-reddit] [Models]

alt text

We provide:

Setup

  1. Setup the environment

conda env create -f environment.yml

  1. Setup the api keys

Data process pipeline

alt text

The pipeline contains 9 components (see data_process_pipeline/pipeline/main_pipeline.py).

  1. data_process_pipeline/pipeline/component_0_culture_relevance_classifier.py: classify if a comment is related to culture
  2. data_process_pipeline/pipeline/component_1_knowledge_extractor.py.py: extract cultural information from the comment
  3. data_process_pipeline/pipeline/component_2_negation_converter.py: convert positive sentences to negative forms
  4. data_process_pipeline/pipeline/component_3_clustering.py: perform clustering
  5. data_process_pipeline/pipeline/component_4_cluster_summarizer.py: summarize the clusters
  6. data_process_pipeline/pipeline/component_5_topic_normalization.py: normalize the cultural groups and topics
  7. data_process_pipeline/pipeline/component_6_agreement_calculator.py: calculate the agreement values
  8. data_process_pipeline/pipeline/component_7_content_moderation.py: identify potentially controversial and PII data for annotation
  9. data_process_pipeline/pipeline/component_8_final_formatter.py: format the final data

How to run the pipeline

  1. Prepare a data file, e.g., the provided dummy data file

  2. Set up the paths in the config, e.g., the provided config_dummy_data_vanilla_mistral.yaml

  3. Run this command to run the components with index 0,1,3,4,5,6,7,8 in order with the config

python data_process_pipeline/main.py -i 0,1,3,4,5,6,7,8 -c ./data_process_pipeline/configs/config_dummy_data_vanilla_mistral.yaml
  1. The final output will be at data_process_pipeline/results/8_final_formatter/output.csv, as specified in config_dummy_data_vanilla_mistral.yaml.

How to run individual components

We can also run individual components, but need to make sure the input file exists.

# load the 0th component, relevance_classifier
python data_process_pipeline/main.py -i 0 -c ./data_process_pipeline/configs/config_dummy_data_vanilla_mistral.yaml

Some notes

  • The pipeline will also generate a file with controversial data for human annotation, output_file_for_manual_annotation, you need to annotate it and put it in controversial_annotation_file
  • We prepare two sample configs:

Evaluation scripts

alt text

  1. evaluation/convert_to_desc.py: concatenates the fields in CultureBank data and translates them into free-text paragraphs of cultural descriptors.
  2. evaluation/generate_questions.py: generates questions for grounded evaluation based on the cultural descriptors. The released adapter is here.
  3. evaluation/generate_questions_aug.py: generates questions for grounded evaluation based on the cultural descriptions with self-refinement method (very similar to evaluation/generate_questions.py, the only difference is that GPT-4 will score the generated question until max trials or good results). The released adapter is here.
  4. evaluation/grounded_eval.py: performs grounded evaluation on language models on the generated cultural questions. if -aug (augmentation) is turned on, it means we will have the golden cultural descriptor in the input for the evaluation; and the golden-knowledge-augmented responses from GPTs can be used for further SFT training steps.
  5. evaluation/knowledge_entailment.py: computes the knowledge entailment scores of models' generated responses in the grounded evaluations.
  6. evaluation/direct_eval.py: performs direct evaluation on language models on CultureBank data.

Evaluation on two downstream tasks

  1. evaluation/downstream_tasks/cultural_nli.py: evaluate on cultural nli.
  2. evaluation/downstream_tasks/world_value_survey.py: evaluate on the world value survey based on methods in this paper.

Fine-tuning scripts

  1. finetuning/sft_mixtral.py: a sample script to supervised-finetune a mixtral model on various tasks (extractor, summarizer, culturally-aware model, etc) with proper data preparation.
  2. finetuning/dpo_mixtral.py: a sample script to train a mixtral model with DPO on various tasks (culturally-aware model, etc) with proper data preparation.

Released models

  1. Knowledge extractor
  2. Cluster summarizer
  3. Evaluation question generator
  4. Llama2-7B SFT fine-tuned on CultureBank-TikTok
  5. Mixtral-8X7B SFT fine-tuned on CultureBank-TikTok
  6. Mixtral-8X7B DPO fine-tuned on CultureBank-TikTok

Acknowledgement

The codebase is adapted from Candle (paper) which is under this license. Thanks for the amazing work!

If you find our work helpful, please consider citing our paper:

@misc{shi2024culturebank,
    title={CultureBank: An Online Community-Driven Knowledge Base Towards Culturally Aware Language Technologies},
    author={Weiyan Shi and Ryan Li and Yutong Zhang and Caleb Ziems and Chunhua Yu and Raya Horesh and Rogério Abreu de Paula and Diyi yang},
    year={2024},
    eprint={2404.15238},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

We welcome all kinds of contributions. If you have any questions, feel free to leave issues or email us.

About

License:MIT License


Languages

Language:Python 100.0%