This repository maintains the source code for "Improving Unsupervised Dialogue Topic Segmentation with Utterance-Pair Coherence Scoring", SIGDIAL 2021.
In the paper, we mention two training datasets for Utterance-Pair Coherence Scoring model training. They are:
- DailyDialog (for English)
- NaturalConv (for Chinese) You can access and download these two datasets online!
Once the source of training data is ready, we run data_process.py to generate the postive and negative utterance pair samples for the training of BERT-based coherence scoring model. Please note that the code will generate three files:
- dialogues_text.txt
- dialogues_topic.txt
- dialogues_act.txt
These three files will be required to work together to manage the data loading of model training.
Please modify the paths in the model.py file to the paths you save your data files in. This code will save the utterance-pair coherence scoring model, which will be further utilized in test.py for topic segmentation inference.
In the evaluation phase, three datasets are used for model testing, they are: