This repository and its contents are provided for non-clinical research only . None of the material constitutes actual diagnosis or advice, and help-seeker should get assistance from professional psychiatrists or clinical practitioners. No warranties, express or implied, are offered regarding the accuracy , completeness, or utility of the predictions and explanations. The authors and contributors are not responsible for any errors, omissions, or any consequences arising from the use of the information herein. Users should exercise their own judgment and consult professionals before making any clinical-related decisions. The use of the software and information contained in this repository is entirely at the user's own risk.
The raw datasets collected to build our IMHI dataset are from public social media platforms such as Reddit and Twitter, and we strictly follow the privacy protocols and ethical principles to protect user privacy and guarantee that anonymity is properly applied in all the mental health-related texts. In addition, to minimize misuse, all examples provided in our paper are paraphrased and obfuscated utilizing the moderate disguising scheme.
In addition, recent studies have indicated LLMs may introduce some potential bias, such as gender gaps. Meanwhile, some incorrect prediction results, inappropriate explanations, and over-generalization also illustrate the potential risks of current LLMs. Therefore, there are still many challenges in applying the model to real-scenario mental health monitoring systems.
By using or accessing the information in this repository, you agree to indemnify, defend, and hold harmless the authors, contributors, and any affiliated organizations or persons from any and all claims or damages.
This project presents our efforts towards interpretable mental health analysis with large language models (LLMs). In early works we comprehensively evaluate the zero-shot/few-shot performances of the latest LLMs such as ChatGPT and GPT-4 on generating explanations for mental health analysis. Based on the findings, we build the Interpretable Mental Health Instruction (IMHI) dataset with 105K instruction samples, the first multi-task and multi-source instruction-tuning dataset for interpretable mental health analysis on social media. Based on the IMHI dataset, We propose MentalLLaMA, the first open-source instruction-following LLMs for interpretable mental health analysis. MentalLLaMA can perform mental health analysis on social media data and generate high-quality explanations for its predictions. We also introduce the first holistic evaluation benchmark for interpretable mental health analysis with 19K test samples, which covers 8 tasks and 10 test sets. Our contributions are presented in these 2 papers:
The MentaLLaMA Paper | The Evaluation Paper
We provide 4 model checkpoints evaluated in the MentaLLaMA paper:
- MentaLLaMA-chat-13B: This model is fine-tuned based on the Meta LLaMA2-chat-13B foundation model and the full IMHI instruction tuning data. The training data covers 8 mental health analysis tasks. The model can follow instructions to make accurate mental health analysis and generate high-quality explanations for the predictions. Due to the model size, the inference are relatively slow.
- MentaLLaMA-chat-7B: This model is fine-tuned based on the Meta LLaMA2-chat-7B foundation model and the full IMHI instruction tuning data. The training data covers 8 mental health analysis tasks. The model can follow instructions to make mental health analysis and generate explanations for the predictions.
- MentalBART: This model is fine-tuned based on the BART-large foundation model and the full IMHI-completion data. The training data covers 8 mental health analysis tasks. The model cannot follow instructions, but can make mental health analysis and generate explanations in a completion-based manner. The smaller size of this model allows faster inference and easier deployment.
- MentalT5: This model is fine-tuned based on the T5-large foundation model and the full IMHI-completion data. The model cannot follow instructions, but can make mental health analysis and generate explanations in a completion-based manner. The smaller size of this model allows faster inference and easier deployment.
You can use the MentaLLaMA models in your Python project with the Hugging Face Transformers library. Here is a simple example of how to load the model:
from transformers import LlamaTokenizer, LlamaForCausalLM
tokenizer = LlamaTokenizer.from_pretrained(MODEL_PATH)
model = LlamaForCausalLM.from_pretrained(MODEL_PATH, device_map='auto')
In this example, LlamaTokenizer is used to load the tokenizer, and LlamaForCausalLM is used to load the model. The device_map='auto'
argument is used to automatically
use the GPU if it's available. MODEL_PATH
denotes your model save path.
After loading the models, you can generate a response. Here is an example:
prompt = 'Consider this post: "work, it has been a stressful week! hope it gets better." Question: What is the stress cause of this post?'
inputs = tokenizer(prompt, return_tensors="pt")
# Generate
generate_ids = model.generate(inputs.input_ids, max_length=2048)
tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
Our running of these codes on MentaLLaMA-chat-13B gets the following response:
Answer: This post shows the stress cause related to work. Reasoning: The post explicitly mentions work as being stressful and expresses a hope that it gets better. This indicates that the poster is experiencing stress in relation to their work, suggesting that work is the primary cause of their stress in this instance.
We introduce the first holistic evaluation benchmark for interpretable mental health analysis with 19K test samples . We collect raw test data from 10 existing datasets covering 8 mental health analysis tasks, and transfer them into test data for interpretable mental health analysis. Statistic about the 10 test sets are as follows:
Name | Task | Test samples | Data Source | Annotation |
---|---|---|---|---|
DR | depression detection | 405 | Weak labels | |
CLP | depression detection | 299 | Human annotations | |
dreaddit | stress detection | 414 | Human annotations | |
SWMH | mental disorders detection | 10,882 | Weak labels | |
T-SID | mental disorders detection | 959 | Weak labels | |
SAD | stress cause detection | 684 | SMS | Human annotations |
CAMS | depression/suicide cause detection | 625 | Human annotations | |
loneliness | loneliness detection | 531 | Human annotations | |
MultiWD | Wellness dimensions detection | 2,441 | Human annotations | |
IRF | Interpersonal risks factors detection | 2,113 | Human annotations |
We currently release the test data from the following sets: DR, dreaddit, SAD, MultiWD, and IRF. The instruction data is put under
/test_data/test_instruction
The items are easy to follow: the query
row denotes the question, and the gpt-3.5-turbo
row
denotes our modified and evaluated predictions and explanations from ChatGPT. gpt-3.5-turbo
is used as
the golden response for evaluation.
To facilitate test on models with no instruction following ability, we also release part of the test data for IMHI-completion. The data is put under
/test_data/test_complete
The file layouts are the same with instruction tuning data.
To evaluate your trained model on the IMHI benchmark, first load your model and generate responses for all test items. We use the Hugging Face Transformers library to load the model. For LLaMA-based models, you can generate the responses with the following commands:
cd src
python IMHI.py --model_path MODEL_PATH --batch_size 8 --model_output_path OUTPUT_PATH --test_dataset IMHI --llama --cuda
MODEL_PATH
and OUTPUT_PATH
denote the model save path and the save path for generated responses.
All generated responses will be put under ../model_output
. Some generated examples are shown in
./examples/response_generation_examples
You can also evaluate with the IMHI-completion test set with the following commands:
cd src
python IMHI.py --model_path MODEL_PATH --batch_size 8 --model_output_path OUTPUT_PATH --test_dataset IMHI-completion --llama --cuda
You can also load models that are not based on LLaMA by removing the --llama
argument.
In the generated examples, the goldens
row denotes the reference explanations and the generated_text
row denotes the generated responses from your model.
The first evaluation metric for our IMHI benchmark is to evaluate the classification correctness of the model
generations. If your model can generate very regular responses, a rule-based classifier can do well to assign
a label to each response. We provide a rule-based classifier in IMHI.py
and you can use it during the response
generation process by adding the argument: --rule_calculate
to your command. The classifier requires
the following template:
[label] Reasoning: [explanation]
However, as most LLMs are trained to generate diverse responses, a rule-based label classifier is impractical. For example, MentaLLaMA can have the following response for an SAD query:
This post indicates that the poster's sister has tested positive for ovarian cancer and that the family is devastated. This suggests that the cause of stress in this situation is health issues, specifically the sister's diagnosis of ovarian cancer. The post does not mention any other potential stress causes, making health issues the most appropriate label in this case.
To solve this problem, in our MentaLLaMA paper we train 10 neural network classifiers based on MentalBERT, one for each collected raw dataset. The classifiers are trained to assign a classification label given the explanation. We release these 10 classifiers to facilitate future evaluations on IMHI benchmark.
The models can be downloaded as follows: ...
You can obtain the labels as follows: ...
The second evaluation metric for the IMHI benchmark is to evaluate the quality of the generated explanations.
The results in our
evaluation paper show that BART-score is
moderately correlated with human annotations in 4 human evaluation aspects, and outperforms other automatic evaluation metrics. Therefore,
we utilize BART-score to evaluate the quality of the generated explanations. Specifically, you should first
generate responses using the IMHI.py
script and obtain the response dir as in examples/response_generation_examples
.
Then score your responses with BART-score using the following commands:
cd src
We release our human annotations on AI-generated explanations to facilitate future research on aligning automatic evaluation tools for interpretable mental health analysis. Based on these human evaluation results, we tested various existing automatic evaluation metrics on correlation with human preferences. The results in our evaluation paper show that the scores by BART-score are moderately correlated with human annotations.
In our evaluation paper, we manually labeled a subset of the AIGC results for the DR dataset in 4 aspects: fluency, completeness, reliability, and overall. The annotations are released in this dir:
/human_evaluation/DR_annotation
where we labeled 163 ChatGPT-generated explanations for the depression detection dataset DR. The file chatgpt_data.csv
includes 121 explanations that correctly classified by ChatGPT. chatgpt_false_data.csv
includes 42 explanations that falsely classified by ChatGPT. We also include 121 explanations that correctly
classified by InstructionGPT-3 in gpt3_data.csv
.
In our MentaLLaMA paper, we invited one domain expert major in quantitative psychology to write an explanation for 350 selected posts (35 posts for each raw dataset). The golden set is used to accurately evaluate the explanation-generation ability of LLMs in an automatic manner. To facilitate future research, we release the expert-written explanations for the following datasets: DR, dreaddit, SWMH, T-SID, SAD, CAMS, loneliness, MultiWD, and IRF (35 samples each). The data is released in this dir:
/test_data/test_instruction_expert
The expert-written explanations are processed to follow the same format as other test datasets to facilitate model evaluations. You can test your model on the expert-written golden explanations with similar commands as in response generation. For example, you can test LLaMA-based models as follows:
cd src
python IMHI.py --model_path MODEL_PATH --batch_size 8 --model_output_path OUTPUT_PATH --test_dataset expert --llama --cuda
If you use MentaLLaMA in your work, please cite our paper.
@misc{yang2023mentalllama,
title={MentalLLaMA: Interpretable Mental Health Analysis on Social Media with Large Language Models},
author={Kailai Yang and Tianlin Zhang and Ziyan Kuang and Qianqian Xie and Sophia Ananiadou},
year={2023},
eprint={2309.13567},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
MentaLLaMA is licensed under [MIT]. Please find more details in the MIT file.