NTU NLP Lab

NTU NLP Lab's repositories

traditional-chinese-alpaca

A Traditional-Chinese instruction-following model with datasets based on Alpaca.

Language:PythonApache-2.0129 5 3

LifeEventDialog

Life Event Dialog contains fine-grained personal life event annotations on DailyDialog.

A dialogue dataset is an indispensable resource for building a dialogue system. Additional information like emotions and interpersonal relationships labeled on conversations enables the system to capture the emotion flow of the participants in the dialogue. However, there is no publicly available Chinese dialogue dataset with emotion and relation labels. In this paper, we collect the conversions from TV series scripts, and annotate emotion and interpersonal relationship labels on each utterance. This dataset contains 25,548 utterances from 4,142 dialogues. We also set up some experiments to observe the effects of the responded utterance on the current utterance, and the correlation between emotion and relation types in emotion and relation classification tasks.

400

Finance-FinProLex

FinProLex provides 5,162 tokens in professional analysts' reports and the financial social media platform posts with expert-like scores. The expert-like scores are calculated based on the pointwise mutual information (PMI).

400

Finance-NTUSD-Fin

NTUSD-Fin provides various scoring methods including frequency, CFIDF, chi-squared value, market sentiment score and word vector for the tokens. Only the tokens appeared at least ten times and shown significantly difference between expected and observed frequency with chi-squared test are remained in our dictionary. The predetermined significance level is 0.05. The market sentiment score is calculated by substracting the bearish PMI from the bullish PMI. There are 8,331 words, 112 hashtags and 115 emojis in the constructed dictionary, NTUSD-Fin.

401

Chinese-Word-Ordering-Errors-Detection-and-Correction-Corpus

Word Ordering Errors (WOEs) are the most frequent type of grammatical errors at sentence level for non-native Chinese language learners. Learners taking Chinese as a foreign language often place character(s) in the wrong places in sentences, and that results in wrong word(s) or ungrammatical sentences. Besides, there are no clear word boundaries in Chinese sentences.

300

Finance-FinNum

Numeral is the crucial part of financial documents. In order to understand the detail of opinions in financial documents, we should not only analyze the text, but also need to assay the numeric information in depth. Because of the informal writing style, analyzing social media data is more challenging than analyzing news and official documents. FinNum is a dataset for fine-grained numeral understanding in financial social media data - to identify the category of a numeral.

300

Finance-Numeracy-600K

Numeral is the crucial part of in narrative, especially in financial documents. We should not only analyze the text, but also need to assay the numeric information in depth. Numeracy-600K is a dataset for testing the numeracy of machines.

300

NL2KB

A total of 7,139 Chinese relation patterns that cover 1,087 DBpedia properties are extracted and verified by human annotators. This resource can be used for knowledge base construction and knowledge base retrieval (e.g., question-answering).

300

NTU-Chinese-Causal-Corpus

A Chinese causal corpus containing 1,314 pairs of arguments based on the Chinese Discourse Treebank (CDTB) by Li et al. (2014).

300

NTUSD

Sentiment words are employed to compute the tendency of a sentence, and then a document. To detect sentiment words in Chinese documents, a Chinese sentiment dictionary is indispensable. However, a small dictionary may suffer from the problem of coverage. A method to learn sentiment words and their strengths from multiple resources is developed in this task.

MIT300

AMDRD

Analysis Model of Discourse Relations within a Document(AMDRD)

Language:PythonGPL-3.02 10

ICDA

Interactive Clinical Diagnostic Assistant for Medical Interview

Language:Python200

NTU-English-Tense-Predictor

A rule-based English tense predictor based on the output of the dependency parser like Stanford CoreNLP.

Language:Python200

NTU-Irony-Corpus

The NTU Irony Corpus consists of more than 1,000 microblog messages collected from the Plurk website. All the messages in the corpus are in Traditional Chinese and have been confirmed to be ironic. They are marked with three types of labels: (1) ironic word/phrase , (2) context, and (3) rhetoric element.

200