Web IR / NLP Group @ NUS's repositories
SG-Deep-Question-Generation
This repository contains code and models for the paper: Semantic Graphs for Generating Deep Questions (ACL 2020).
nus-sms-corpus
This is the distribution point for the NUS SMS Corpus as described and updated from This is a corpus of SMS (Short Message Service) messages collected for research at the Department of Computer Science at the National University of Singapore. This dataset consists of 67,093 SMS messages taken from the corpus on Mar 9, 2015. The messages largely originate from Singaporeans and mostly from students attending the University. These messages were collected from volunteers who were made aware that their contributions were going to be made publicly available. The data collectors opportunistically collected as much metadata about the messages and their senders as possible, so as to enable different types of analyses. This corpus was collected by Tao Chen and Min-Yen Kan. If you use this data, please ensure the following paper is cited. For more details, please refer to Citation field. Tao Chen and Min-Yen Kan (2013). Creating a Live, Public Short Message Service Corpus: The NUS SMS Corpus. Language Resources and Evaluation, 47(2)(2013), pages 299-355. URL: https://link.springer.com/article/10.1007%2Fs10579-012-9197-9
Summarization-Papers
Summarization Papers
FormatEval
[Preprint' 24] LLMs Are Biased Towards Output Formats! Systematically Evaluating and Mitigating Output Format Bias of LLMs
lib4moocdata
Library for processing MOOC data dumps. Currently limited to Coursera data.
AutomaticKeyphraseExtraction
Data for Automatic Keyphrase Extraction Task
FormatBiasEval
[Preprint' 24] LLMs Are Biased Towards Output Formats! Systematically Evaluating and Mitigating Output Format Bias of LLMs
SemanticTokenizer
Item Tokenization: the future for the recommender systems
wing-website
Hugo Blox WING Website pilot
CoAnnotating
This is the official repository for "CoAnnotating: Uncertainty-Guided Work Allocation between Human and Large Language Models for Data Annotation"
ControllableLyricTranslation
Code for the paper "Songs Across Borders: Singable and Controllable Neural Lyric Translation"
DiSQ-Score
The Dataset and Official Implementation for <Discursive Socratic Questioning: Evaluating the Faithfulness of Language Models’ Understanding of Discourse Relations> @ ACL 2024
LLM-Misinfo-QA
This repository contains data and code used for On the Risk of Misinformation Pollution with Large Language Models (to appear on Findings of EMNLP 2023).
nnose
Codebase for NNOSE: Nearest Neighbor Occupational Skill Extraction
RL-for-Question-Generation
This repository contains codes and models for the paper: Exploring Question-Specific Rewards for Generating Deep Questions (COLING 2020).