This page provides the resources and tools mentioned from the entire available biomedical scientific literature, Harnessing Electronic Health Records for Real World Evidence.
- Background and Flowchart
- Method
In this study, we outline an integrated pipeline to improve the resolution of EHR data that will enable researchers to perform robust analysis with high quality data from EHRs for RWE generation. Our pipeline has 4 modules: 1) creating meta-data for harmonization, 2) cohort construction, 3) variable curation, and 4) validation and robust modeling (Figure 1). The lists of methods and resources integrated into the pipeline are listed for each module of the pipeline, respectively. The pipeline contributes simultaneously to the creation of digital twins.
Figure 1:The Integrated Data Curation pipeline designed to enable researchers to extract high quality data from electronic health records (EHRs) for RWE.
The first step in our pipeline is to perform data harmonization by mapping clinical variables of interest to relevant sources of data within EHRs. To make this mapping process more efficient and transparent, we propose an automated method using NLP for data harmonization. This approach can help streamline the process and improve accuracy in identifying the clinical relevant concepts.
Identify the medical concepts associated with the clinical variables from the RCT documents using existing clinical NLP software.
Use | Methods | Links | References |
---|---|---|---|
Identify medical concepts from RCT documents | Metamap(Code, Ref) | Tools: MetaMap | Mapping Text to the UMLS Metathesaurus |
HPO(Code, Ref) | The Human Phenotype Ontology | The Human Phenotype Ontology in 2021 | |
NILE(Code, Ref) | Narrative Information Linear Extraction (NILE) | NILE: Fast Natural Language Processing for Electronic Health Records | |
cTAKES(Code, Ref) | Apache cTAKES | Entity Extraction for Clinical Notes, a Comparison Between MetaMap and Amazon Comprehend Medical |
Match the identified medical concepts to both structured and unstructured EHR data elements.
The construction of the study cohort for RWE involves identifying the patients with the condition/disease of interest, their time window for the indication and whether they underwent the interventions in the RCT. EHR data contain a large amount of data of which a subset is relevant to the study. To avoid involving unnecessary personal health identifiers into the data for analysist, we recommend a 3-phase cohort construction strategy that gradually extracts the minimally necessary data from the EHR, starting from an inclusive data mart to the disease cohort and then to the treatment arms.
The data mart is designed to include all patients with any indication of the disease or condition of interest. To achieve the desired inclusiveness, researchers should summarize a broad list of EHR variables with high sensitivity and construct the data mart to capture patients with at least one occurrence of the listed variables.
Use | Methods | Links | References |
---|---|---|---|
Filter patients with diagnosis codes relevant to disease of interest | PheWAS catalog(Code, Ref) | Phenome Wide Association Studies | PheWAS: demonstrating the feasibility of a phenome-wide scan to discover gene-disease associations |
HPO(Code, Ref) | The Human Phenotype Ontology | The Human Phenotype Ontology in 2021 |
After the data mart is created, the next step is to identify the disease cohort containing the subset of patients within the data mart who have the disease of interest.Commonly used phenotyping tools can be roughly classified as either rule-based or machine-learning based. Machine learning approaches can be further classified as either weakly supervised, semi-supervised, or supervised based on the availability of gold-standard labels for model training.
With a given disease cohort, one may proceed to identify patients who received the relevant treatments, which are typically medications or procedures.
Use | Methods | Links | References |
---|---|---|---|
Identify indication conditions before treatment | Phenotyping with temporal input(Code:MSMR, TSPM,AgeMatters, Ref) | MSMR, TSPM, AgeMatters | High-throughput phenotyping with temporal sequences. |
RCT emulation with EHR data generally requires three categories of data elements: 1) the endpoints measuring the treatment effect; 2) eligibility criteria to match the RCT population; 3) confounding factors to correct for treatment by indication biases inherent in real world data. In the following, we describe the classification and extraction of the first two types while addressing the confounding in Module 4.
Use | Methods | Links | References |
---|---|---|---|
Extraction of binary variables through phenotypings | Same as Identify patients with disease of interest through phenotyping | Same as Identify patients with disease of interest through phenotyping | Same as Identify patients with disease of interest through phenotyping |
Extraction of numerical variables through NLP | EXTEND (Code, Ref), NILE(Code, Ref) | EXTEND, NILE | EXTraction of EMR numerical data: an efficient and generalizable tool to EXTEND clinical research,Performance of a Machine Learning Algorithm Using Electronic Health Record Data to Identify and Estimate Survival in a Longitudinal Cohort of Patients With Lung Cancer |
Use | Methods | Links | References |
---|---|---|---|
Extraction of event time through incidence phenotyping | Unsupervised:AC_TPC(Code,Ref) | AC_TPC(Code,Ref) | Disease progression modeling using Hidden Markov Models, Temporal Phenotyping using Deep Predictive Clustering of Disease Progression |
Semi-supervised: SAMGEP(Code,Ref) | SAMGEP(Code,Ref) | Samgep: A novel method for prediction of phenotype event times using the electronic health record, Semi-supervised Approach to Event Time Annotation Using Longitudinal Electronic Health Records | |
Supervised | Determining the Time of Cancer Recurrence Using Claims or Electronic Medical Record Data, Detecting Lung and Colorectal Cancer Recurrence Using Structured Clinical/Administrative Data to Enable Outcomes Research and Population Health Management | ||
Confounding factors, variables that affect both the treatment assignment and outcome, must be properly adjusted. To minimize the bias, the pipeline should include 1) validation for optimizing the medical informatics tools in Modules 2 and 3 ; 2) analyses robust to remaining data error; 3) comprehensive confounding adjustment.
Use | Methods | Links | References |
---|---|---|---|
Efficient and robust estimation of treatment effect with partially annotated noisy data | SMMAL(Code, Ref) | Efficient and Robust Semi-supervised Estimation of ATE with Partially Annotated Treatment and Response | |