ROCLING 2023 Shared Task for Chinese Multi-genre Named Entity Recognition in the Healthcare Domain (MultiNER-Health)

The goal of the MultiNER-Health shared task is to develop and evaluate the capability of Chinese NER systems for healthcare texts written in different genres. The input is a sentence indicating as one of three genres (i.e., FT, SM, and WA) that may contain named entities. The NER system should predict the boundaries and category of the named entity for each sentence. Following the settings of the ROCLING-2022 shared task (Lee et al., 2022), we use the common BIO format for our MultiNER-Health task. The B (Beginning)-prefix before a tag indicates that the character is the beginning of a named entity while the I (Inside)-prefix indicates that the character is inside a named entity, and O (Outside) indicates that a character belongs to no named entity. We use the same entity types defined in the Chinese HealthNER Corpus (Lee and Lu, 2021). A total of 10 types are described for this Chinese healthcare NER task, and some examples are provided in Table 1.

Entity Type (Tag)	Description	Examples
Body (BODY)	The whole physical structure that forms a personor animal including biological cells, organizations, organs and systems.	“細胞核” (nucleus), “神經組織” (nerve tissue), “左心房” (left atrium), “脊髓” (spinal cord), “呼吸系統” (respiratory system)
Symptom (SYMP)	Any feeling of illness or physical or mental change that is caused by a particular disease.	“流鼻水”(rhinorrhea), “咳嗽” (cough), “貧血” (anemia), “ 失眠 ” (insomnia), “ 心悸 ” (palpitation), “耳鳴” (tinnitus)
Instrument (INST)	A tool or other device used for performing a particular medical task such as diagnosis and treatments.	“血壓計” (blood pressure meter), “達文西手臂” (DaVinci Robots), “體脂肪計” (body fat monitor), “雷射手術刀” (laser scalpel)
Examination (EXAM)	The act of looking at or checking something carefully in order to discover possible diseases.	“聽力檢查”(hearing test), “腦電波圖” (electroencephalography; EEG), “核磁共振造影” (magnetic resonance imaging; MRI)
Chemical (CHEM)	Any basic chemical element typically found in the human body.	“去氧核糖核酸” (deoxyribonucleic acid; DNA), “糖化血色素” (glycated hemoglobin), “膽固醇” (cholesterol), “尿酸” (uric acid)
Disease (DISE)	An illness of people or animals caused by infection or a failure of health rather than by an accident.	“小兒麻痺症” (poliomyelitis; polio), “帕金森氏症” (Parkinson’s disease), “青光眼” (glaucoma), “肺結核” (tuberculosis)
Drug (DRUG)	Any natural or artificially made chemical used as a medicine.	“阿斯匹靈” (aspirin), “普拿疼” (acetaminophen), “青黴素” (penicillin), “流感疫苗” (influenza vaccination)
Supplement (SUPP)	Something added to something else to improve human health.	“維他命” (vitamin), “膠原蛋白” (collagen), “益生菌 ” (probiotics), “葡萄糖胺” (glucosamine), “葉黃素” (lutein)
Treatment (TREAT)	A method of behavior used to treat diseases.	“藥物治療” (pharmacotherapy), “胃切除術” (gastrectomy), “標靶治療” (targeted therapy), “外科手術” (surgery)
Time (TIME)	Element of existence measured in minutes, days, years.	“嬰兒期” (infancy), “幼兒時期” (early childhood), “青春期” (adolescence), “生理期” (on one’s period), “孕期” (pregnancy)

Table 1: Name entity types with descriptions and examples.

Examples

Example sentences are presented in Table 2. The input is a sentence consisting of a sequence of character-based tokens including punctuation. The NER system returns the corresponding BIO tags aligned to each token as the output. In the Example 1 from the FT genre, “老化” (aging) belongs to the Symptom (SYMP) entity type and “阿茲海默症” (Alzheimer’s disease) is a disease (DISE) type. “痤瘡” (acne) in Example 5 from the WA genre is also a kind of disease (DISE), and is a formal usage of “痘痘” in Example 2 from the SM genre. “燒心” in Example 6 from the WA genre is a spoken language form of a disease “胃食道逆流症” (gastroesophageal reflux disease) in Example 3 from the SM genre.

Genre	Examples	Input & Output
Formal Texts (FT)	Ex 1	Input: 早起也能預防老化，甚至降低阿茲海默症的風險 Output:O, O, O, O, O, O, B-SYMP, I-SYMP, O, O, O, O, O, B-DISE, I- DISE, I-DISE, I-DISE, I-DISE, O, O, O
Formal Texts (FT)	Ex 2	Input: 壓力、月經引起的痘痘患者 Output: B-SYMP, I-SYMP, O, B-TIME, I-TIME, O, O, O, B-DISE, I- DISE, O, O
Social Media (SM)	Ex 3	Input: 如何治療胃食道逆流症? Output: O, O, O, O, B-DISE, I-DISE, I-DISE, I-DISE, I-DISE, I-DISE, O
Social Media (SM)	Ex 4	Input: 請問長期打善思達針劑是不是會變胖? Output: O, O, O, O, O, B-DRUG, I-DRUG, I-DRUG, I-DRUG, I-DRUG, O, O, O, O, B-SYMP, I-SYMP, O
Wikipedia Articles (WA)	Ex 5	Input: 抗生素和維生素 A 酸可用於口服治療痤瘡 Output: B-DRUG, I-DRUG, I-DRUG, O, B-DRUG, I-DRUG, I-DRUG, I- DRUG, I-DRUG, O, O, O, O, O, O, O, B-DISE, I-DISE
Wikipedia Articles (WA)	Ex 6	Input:抑酸劑，又稱抗酸劑，抑制胃酸分泌，緩解燒心 Output: B-CHEM, I-CHEM, I-CHEM, O, O, O, B-CHEM, I-CHEM, I- CHEM, O, O, O, B-CHEM, I-CHEM, O, O, O, O, O, B-DISE, I-DISE

Table 2: Examples of the MultiNER-Health task.

Data Preparation

The training sets for this MultiNER-health task consist of two parts: the Chinese HealthNER corpus (Lee and Lu, 2021) was used for both the FT and SM genres and the ROCLING-2022 CHNER dataset (Lee et al., 2022) was designed for the WA genre. For the FT genre, we have 23,008 sentences with a total of 1,109,918 characters, sourced from web-based health-related articles. The SM genre collected from medical question/answer forums includes 7,648 sentences with a total of 403,570 characters. The quantity in the FT genre about 3 times than that in the SM genre in the Chinese HealthNER corpus. After manual annotation, this corpus consists of 68,460 named entities across 10 defined entity types, of which 42,070 entities (about 61%) came from the FT genre and the remaining 26,390 entities belong to the SM genre. The training instances for the WA genre originate from the ROCLING 2022 CHNER dataset, which includes 3,205 sentences with a total of 118,116 characters and 13,369 named entities. We use the existing named entities in the Chinese HealthNER corpus as the query terms to identify corresponding texts written in different genres.Our constructed test set includes 2,035/2,208/2,381 sentences respectively for the FT/SM/WA genres, resulting in a total of 340,091 characters and 28,896 named entities.

Table 2 presents detailed statistics for the mutually exclusive training and test sets, showing similar entity type distributions. The most frequently occurring type was Body, followed by Symptom, Disease and Chemical regardless of genre. In the training sets, these 4 types collectively accounted for about 82.9% of all named entity instances, with the remaining 6 types accounting for 17.1%. In the test sets, these 4 types accounted for 81.2% of the total, with the other 6 types accounting for the remaining 18.8%.

In the training set, sentences used for the FT and SM genres may or may not contain named entities, but sentences belonging to the WA genre contain at least one named entity. Each sentence had an average of 48.19 characters and 2.42 namedentities. For system performance evaluation, at least 2,000 sentences per genre were tested, each with an average of 51.34 characters and 4.36 named entities. The average sentence length in the test set was slightly longer and the named entity density was relatively higher than those in the training set.

Datasets		Training Sets			Test Sets
Source		Chinese HealthNER Corpus		ROCLING 2022 CHNER Dataset	ROCLING 2023 MultiNER-Health Datasets
Genre		FT	SM	WA	FT	SM	WA
#Sentence		23,008	7,648	3,205	2,035	2,208	2,381
#Character		1,109,918	403,570	118,116	149,276	98,317	92,498
#Named Entity		42,070	26,390	13,369	10,845	8,292	9,761
Entity Type	Body	17,639	8,772	5,315	2,461	2,572	3,843
	Symptom	6,432	6,472	1,944	2,635	2,280	1,890
	Instrumrent	743	346	250	190	41	149
	Examination	444	2,178	207	223	511	180
	Chemical	5,716	1,118	1,718	1,124	321	748
	Disease	5,865	4,214	2,609	2,300	1,322	1,970
	Drug	1,165	1,060	481	932	746	451
	Supplement	1,338	187	183	47	92	56
	Treatment	2,031	1,077	468	512	363	308
	Time	697	966	194	421	44	166

Table 3: Detailed data statistics.

We hope the data sets collected and annotated for this shared task can facilitate and expedite future development of Chinese NER in the healthcare domain. Therefore, the gold standard test set and evaluation scripts are made publicly available in GitHub repositories as follows:

Chinese HealthNER Corpus (train, for FT/SM genre)
https://github.com/NCUEE-NLPLab/Chinese-HealthNER-Corpus
ROCLING-2022 Shared Task (train, for WA genre)
https://github.com/NCUEE-NLPLab/ROCLING-2022-ST-CHNER
ROCLING-2023 Shared Task (test, this repository)
https://github.com/NCUEE-NLPLab/ROCLING-2023-ST-MultiNERHealth

Evaluation

Performance is evaluated by examining the difference between the machine-predicted and human-annotated BIO tags. Standard precision, recall and F1-score are the most typical evaluation metrics of NER systems at a character level, and are used here. If the predicted tag of a character in terms of BIO format was completely identical with the gold standard, the character in the testing instances was regarded as correctly recognized.Precision is defined as the percentage of named entities found by the NER system that are correct. Recall is the percentage of named entities present in the test set found by the NER system. The F1-score is the harmonic mean of precision and recall.

#Place the prediction files in the Input folder. The prediction file should be named as "WA_result.txt", "SM_result.txt", and "FT_result.txt".
#It will generate the corresponding "WA_result_Eval.txt", "SM_result_Eval.txt", and "FT_result_Eval.txt" in the Eval folder.
python turn_to_eval.py

#It will generate the "WA_result_Score.txt", "SM_result_Score.txt", and "FT_result_Score.txt" in the Score folder.
python conlleval.py

#It will generate the "Overall_Score.txt" in the Score folder.
python socre.py

Results

The policy of this shared task is an open test. Participating systems are allowed to use other publicly available data for this shared task, but the usage should be specified in their system description paper. Each team was allowed to provide at most three submissions during the evaluation period. Among eight registered teams, six submitted their testing results, providing a total of 16 submissions, from which the submission with the best macro-averaging F1-score of each team was kept for official performance ranking.

Rank	Team	Formal texts F1-score (%)	Social media F1-score (%)	Wikipedia F1-score (%)	Macro-averaging F1-score (%)
1	crowNER (Wang et al., 2023)	65.49	69.54	73.63	69.55
2	YNU-HPCC (Pang et al., 2023)	61.96	71.11	72.13	68.40
3	ISLab (Wu et al., 2023)	62.52	71.42	71.19	68.38
4	SCU-MESCLab (Luo et al., 2023)	62.51	71.33	70.57	68.14
5	YNU-ISE-ZXW (Zhang et al., 2023)	62.79	70.22	70.37	67.79
6	LingX (Wang et al., 2023)	51.23	59.28	60.54	57.02

Citation

Please cite the following paper for ROCLING 2023 MultiNER-Health Datasets:

Lung-Hao Lee, Tzu-Mi Lin, and Chao-Yi Chen. 2023. Overview of the ROCLING 2023 Shared Task for Chinese Multi-genre Named Entity Recognition in the Healthcare Domain. In Proceedings of the 35th Conference on Computational Linguistics and Speech Processing, pp. 333-338.

@article{ROCLING 2023,
author={Lung-Hao Lee, Tzu-Mi Lin, and Chao-Yi Chen},
title={Overview of the ROCLING 2023 Shared Task for Chinese Multi-genre Named Entity Recognition in the Healthcare Domain},
year={2023},
conference={In Proceedings of the 35th Conference on Computational Linguistics and Speech Processing},
pages={333-338}
}

References

Lung-Hao Lee, and Yi Lu. 2021. Multiple embeddings enhanced multi-graph neural networks for Chinese healthcare named entity recognition. IEEE Journal of Biomedical and Health Informatics, 25(7): 2801-2810.

Lung-Hao Lee, Chao-Yi Chen, Liang-Chih Yu, and Yuen-Hsien Tseng. 2022. Overview of the ROCLING 2022 shared task for Chinese healthcare named entity recognition. In Proceedings of the 34th Conference on Computational Linguistics and Speech Processing, pp. 363-368.

Yin-Chieh Wang, Wen-Hong Wu, Feng-Yu Kuo, Han- Chun Wu, Te-Yu Chi, Te-Lun Yang, Sheh Chen, and Jyh-Shing Roger Jang. 2023. CrowNER at ROCLING 2023 MultiNER-Health Task: enhancing NER task with GPT paraphrase augmentation on sparsely labeled data. In Proceedings of the 35th Conference on Computational Linguistics and Speech Processing, pp. 339-349.

Chonglin Pang, You Zhang, and Xiaobing Zhou. YUN- HPCC at ROCLING 2023 MultiNER-Health Task: a transformer-based approach for Chinese healthcare NER. In Proceedings of the 35th Conference on Computational Linguistics and Speech Processing, pp. 317-324.

Jun-Jie Wu, Tao-Hsing Chang, and Fu-Yuan Hsu. 2023. ISLab at ROCLING 2023 MultiNER-Health Task: a three-stage NER model combining textual content and label semantics. In Proceedings of the 35th Conference on Computational Linguistics and Speech Processing, pp. 359-366.

Tzu-En Su, Ruei-Cyuan Su, Ming-Hsiang Su, and Tsung-Hsien Yang. 2023. 2023. SCU-MESCLab at ROCLING-2023 Shared Task: Named Entity Recognition Using Multiple Classifier Model. In Proceedings of the 35th Conference on Computational Linguistics and Speech Processing, pp. 311-316.

Xingwei Zhang, Jin Wang, and Xuejie Zhang. 2023. YUN-ISE-ZXW at ROCLING 2023 MultiNER- Health Task: a transformer-based model with LoRA for Chinese healthcare named entity recognition. In Proceedings of the 35th Conference on Computational Linguistics and Speech Processing, pp. 325-332.

Xuelin Wang and Qihao Yang. 2023. LingX at ROCLING 2023 MultiNER-Health Task: intelligent capture of Chinese medical named entities by LLMs. In Proceedings of the 35th Conference on Computational Linguistics and Speech Processing, pp. 350-358.

chaochun / ROCLING-2023-ST-MultiNERHealth