Dutch medical NLP

This repository contains a collection of Dutch medical texts which were used for domain-adaptive pretraining to pretrain Dutch medical language models (see here and here).

Data sources

For creation of this dataset, we used medical texts from the Dutch College of General Practitioners' guidelines and monthly magazine, and from richtlijnendatabase.nl, a collection of medical guidelines from hospital specialties.

Pre-processing

Because the texts were mainly professional guidelines, they are scientifically oriented and therefore contain English passages and references to scientific articles. To make the model more suited to clinical applications, we used the FastText language identification model to detect and remove English sentences. We used pattern matching to remove article citations. Furthermore, preprocessing included removing whitespace and organizing the texts in a structure of one sentence per line.

Text statistics

Source	Number of words	Size
NHG standaarden	17.7M	127 MB
Richtlijnendatabase	43.6M	323 MB
Huisarts en wetenschap	23.7M	161 MB

Acknowledgement

This research was performed by Bas Arends as a student researcher at the Amsterdam University Medical Centers, location AMC, supervised by Miguel Rios Gaona.

terminalkitten / dutch_medical_nlp

Dutch medical NLP

Data sources

Pre-processing

Text statistics

Acknowledgement

About