This repository contains a collection of Dutch medical texts which were used for domain-adaptive pretraining to pretrain Dutch medical language models (see here and here).
For creation of this dataset, we used medical texts from the Dutch College of General Practitioners' guidelines and monthly magazine, and from richtlijnendatabase.nl, a collection of medical guidelines from hospital specialties.
Because the texts were mainly professional guidelines, they are scientifically oriented and therefore contain English passages and references to scientific articles. To make the model more suited to clinical applications, we used the FastText language identification model to detect and remove English sentences. We used pattern matching to remove article citations. Furthermore, preprocessing included removing whitespace and organizing the texts in a structure of one sentence per line.
Source | Number of words | Size |
---|---|---|
NHG standaarden | 17.7M | 127 MB |
Richtlijnendatabase | 43.6M | 323 MB |
Huisarts en wetenschap | 23.7M | 161 MB |
This research was performed by Bas Arends as a student researcher at the Amsterdam University Medical Centers, location AMC, supervised by Miguel Rios Gaona.