vojtechhuser / CRI

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

CRI

Clinical Research Informatics

Methods

We created a pipeline of multiple text processing and NLP tools that starts with protocol or informed consent as input. For PDF documents, we first extract the text and remove repeating header and footer text. The current pipeline uses MetaMap as the main NLP tool but we are evaluating several other tools (e.g., NobleCoder, Apache cTAKES). To assess the quality of the procedure terms NLP extraction, we created an evaluation reference standard (for a random subset of protocol documents). Examples of protocol procedures targeted in our pilot are: ‘whole blood count test’, ‘liver biopsy’ or ‘questionnaire administration’. We mark by special flag procedures that are traceable via the NIH CC data warehouse.

Preliminary Results and Discussion

From NIH CC, we obtained 2,013 ICs (all PDF files) originating from 764 active research studies (some studies had multiple versions of ICs). In addition to ICs, we also obtained 3 full protocols and 21 protocol synopsis from itntrialshare.org data sharing platform. The evaluation gold standard data are available at https://dx.doi.org/10.6084/m9.figshare.3100765.v2 (the link also contains additional results). We wrote scripts (in R language) that invoke MetaMap remote API, parse MetaMap text (or XML) output and filter only procedural detected concepts. Our comparison of multiple MetaMap configurations (restricting by semantic type [e.g., only procedures] or restricting by UMLS terminology [e.g., only SNOMEDCT_US] indicate best results with the least restrictive NLP configuration and more intensive post-processing of MetaMap outputs. Our results indicate the need for better representation of protocol documents (PDF format disadvantages are loss of formal document sections (heading and subheadings), rich text formatting and tables formatting). We have created a proposed extension to the CDISC Operational Data Model (ODM) and plan to collaborate closely with CDISC XML technology team to advance existing standards towards a computable representation.

About