Harnessing Electronic Health Records for Real World Evidence

This page provides the resources and tools mentioned from the entire available biomedical scientific literature, Harnessing Electronic Health Records for Real World Evidence.

Background and Flowchart

In this study, we outline an integrated pipeline to improve the resolution of EHR data that will enable researchers to perform robust analysis with high quality data from EHRs for RWE generation. Our pipeline has 4 modules: 1) creating meta-data for harmonization, 2) cohort construction, 3) variable curation, and 4) validation and robust modeling (Figure 1). The lists of methods and resources integrated into the pipeline are listed for each module of the pipeline, respectively. The pipeline contributes simultaneously to the creation of digital twins.

Figure 1:The Integrated Data Curation pipeline designed to enable researchers to extract high quality data from electronic health records (EHRs) for RWE.

Method

Module one: Creating Meta-Data for Harmonization

The first step in our pipeline is to perform data harmonization by mapping clinical variables of interest to relevant sources of data within EHRs. To make this mapping process more efficient and transparent, we propose an automated method using NLP for data harmonization. This approach can help streamline the process and improve accuracy in identifying the clinical relevant concepts.

Concept Identification

Identify the medical concepts associated with the clinical variables from the RCT documents using existing clinical NLP software.

Use	Methods	Links	References
Identify medical concepts from RCT documents	Metamap(Code, Ref)	Tools: MetaMap	Mapping Text to the UMLS Metathesaurus
	HPO(Code, Ref)	The Human Phenotype Ontology	The Human Phenotype Ontology in 2021
	NILE(Code, Ref)	Narrative Information Linear Extraction (NILE)	NILE: Fast Natural Language Processing for Electronic Health Records
	cTAKES(Code, Ref)	Apache cTAKES	Entity Extraction for Clinical Notes, a Comparison Between MetaMap and Amazon Comprehend Medical

Concept Matching

Match the identified medical concepts to both structured and unstructured EHR data elements.

Use	Methods	Links	References
Grouping of structured EHR	PheWAS catalog(Code,Ref)	Phenome Wide Association Studies	PheWAS: demonstrating the feasibility of a phenome-wide scan to discover gene-disease associations
	CCS(Resource:ICD9-CM,ICD-10-PCS, Ref),CPT-4/HCPCS(Resource, ICD-9-CM(Resource), ICD-10-PCS(Resource)	ICD-9-CM Diagnosis and Procedure Codes, 2023 ICD-10-PCS, List of CPT/HCPCS Codes, CLINICAL CLASSIFICATIONS SOFTWARE (CCS) FOR ICD-9-CM , CLINICAL CLASSIFICATIONS SOFTWARE (CCS) FOR ICD-10-PCS (BETA VERSION)	Clinical Classifications for Health Policy Research: Version 2 : Software and User’s Guide. (U.S. Department of Health and Human Services, Public Health Service, Agency for Health Care Policy and Research
	RxNorm(Resource, Ref)	RxNorm Files	RxNorm: Prescription for Electronic Drug Information Exchange
	Lonic(Resource, Ref)	Download Lonic	LOINC, a universal standard for identifying laboratory observations: a 5-year update
Expansion and selection of relevant features using knowledge source or cooccurrence	Export curation	UMLS, Wikidata	The Unified Medical Language System (UMLS): integrating biomedical terminology, Freebase (database)
	Knowledge sources	Distributional Semantics Resources , PubMed , MerkMannual , Medscape	Exploring the application of deep learning techniques on medical text corpora,Exploring the application of deep learning techniques on medical text corpora
	Matching descriptions via language model	CODER++( CODE, REF)
	Embedding from Co-occurrence in EHRs	KESER(CODE, APP, REF)	Clinical Knowledge Extraction via Sparse Embedding Regression (KESER) with Multi-Center Large Scale Electronic Health Record Data.

Module two: Cohort Construction

The construction of the study cohort for RWE involves identifying the patients with the condition/disease of interest, their time window for the indication and whether they underwent the interventions in the RCT. EHR data contain a large amount of data of which a subset is relevant to the study. To avoid involving unnecessary personal health identifiers into the data for analysist, we recommend a 3-phase cohort construction strategy that gradually extracts the minimally necessary data from the EHR, starting from an inclusive data mart to the disease cohort and then to the treatment arms.

Data Mart

The data mart is designed to include all patients with any indication of the disease or condition of interest. To achieve the desired inclusiveness, researchers should summarize a broad list of EHR variables with high sensitivity and construct the data mart to capture patients with at least one occurrence of the listed variables.

Use	Methods	Links	References
Filter patients with diagnosis codes relevant to disease of interest	PheWAS catalog(Code, Ref)	Phenome Wide Association Studies	PheWAS: demonstrating the feasibility of a phenome-wide scan to discover gene-disease associations
	HPO(Code, Ref)	The Human Phenotype Ontology	The Human Phenotype Ontology in 2021

Diease Corhort

After the data mart is created, the next step is to identify the disease cohort containing the subset of patients within the data mart who have the disease of interest.Commonly used phenotyping tools can be roughly classified as either rule-based or machine-learning based. Machine learning approaches can be further classified as either weakly supervised, semi-supervised, or supervised based on the availability of gold-standard labels for model training.

Use	Methods	Links	References
Identify patients with disease of interest through phenotyping	Unsupervised: Anchorexplorer(Code, Ref), Express(Code, Ref), Aphrodite(Code, Ref), PheNorm(Code, Ref), MAP(Code, Ref) sureLDA(Code, Ref)	Phenome Wide Association Studies, Anchorexplorer, Express, Aphrodite, PheNorm, MAP, sureLDA	Electronic medical record phenotyping using the anchor and learn framework, Learning statistical models of phenotypes using noisy labeled training data, Electronic medical record phenotyping using the anchor and learn framework, Enabling phenotypic big data with PheNorm, High-throughput multimodal automated phenotyping (MAP) with application to PheWAS , A multidisease automated phenotyping method for the electronic health record
	Semi-supervised: AFEP(Code, Ref), SAFE(Code, Ref), PSST(Code, Ref), Likelihood approach(Code, Ref), PheCAP(Code, Ref)	SAFE, PheCAP	Toward high-throughput phenotyping: unbiased automated feature extraction and selection from knowledge sources, Surrogate-assisted feature extraction for high-throughput phenotyping, Phenotyping through Semi-Supervised Tensor Factorization (PSST), A maximum likelihood approach to electronic health record phenotyping using positive and unlabeled patients., High-throughput phenotyping with electronic medical record data using a common semi-supervised approach (PheCAP)

Treatment Arms and Timing

With a given disease cohort, one may proceed to identify patients who received the relevant treatments, which are typically medications or procedures.

Use	Methods	Links	References
Identify indication conditions before treatment	Phenotyping with temporal input(Code:MSMR, TSPM,AgeMatters, Ref)	MSMR, TSPM, AgeMatters	High-throughput phenotyping with temporal sequences.

Module three: Variable Extraction

RCT emulation with EHR data generally requires three categories of data elements: 1) the endpoints measuring the treatment effect; 2) eligibility criteria to match the RCT population; 3) confounding factors to correct for treatment by indication biases inherent in real world data. In the following, we describe the classification and extraction of the first two types while addressing the confounding in Module 4.

Extraction of Baseline Variables or Endpoints

Use	Methods	Links	References
Extraction of binary variables through phenotypings	Same as Identify patients with disease of interest through phenotyping	Same as Identify patients with disease of interest through phenotyping	Same as Identify patients with disease of interest through phenotyping
Extraction of numerical variables through NLP	EXTEND (Code, Ref), NILE(Code, Ref)	EXTEND, NILE	EXTraction of EMR numerical data: an efficient and generalizable tool to EXTEND clinical research,Performance of a Machine Learning Algorithm Using Electronic Health Record Data to Identify and Estimate Survival in a Longitudinal Cohort of Patients With Lung Cancer

Extraction of Baseline Variables

Use	Methods	Links	References
Extraction of radiological characteristics through medical AI	Same as Identify patients with disease of interest through phenotyping	organs, blood vessel, neural system, CS-Net(Code, Ref), DeepLung(Code, Ref), nodule detection, cancer staging, fractional flow, reserve	Abdominal multi-organ segmentation with organ-attention networks and statistical fusion, Blood vessel segmentation algorithms - Review of methods, datasets and evaluation metrics, Segmentation of Corneal Nerves Using a U-Net-Based Convolutional Neural Network, Channel and Spatial Attention Network for Curvilinear Structure Segmentation, Automated pulmonary nodule detection in CT images using deep convolutional neural networks, DeepLung: Deep 3D Dual Path Nets for Automated Pulmonary Nodule Detection and Classification, Diagnostic accuracy of a deep learning approach to calculate FFR from coronary CT angiography, Diagnostic accuracy of 3D deep-learning-based fully automated estimation of patient-level minimum fractional flow reserve from coronary computed tomography angiography

Extraction of Baseline Endpoints

Use	Methods	Links	References
Extraction of event time through incidence phenotyping	Unsupervised:AC_TPC(Code,Ref)	AC_TPC(Code,Ref)	Disease progression modeling using Hidden Markov Models, Temporal Phenotyping using Deep Predictive Clustering of Disease Progression
	Semi-supervised: SAMGEP(Code,Ref)	SAMGEP(Code,Ref)	Samgep: A novel method for prediction of phenotype event times using the electronic health record, Semi-supervised Approach to Event Time Annotation Using Longitudinal Electronic Health Records
	Supervised		Determining the Time of Cancer Recurrence Using Claims or Electronic Medical Record Data, Detecting Lung and Colorectal Cancer Recurrence Using Structured Clinical/Administrative Data to Enable Outcomes Research and Population Health Management

Module four: Validation and Robust Modelling

Confounding factors, variables that affect both the treatment assignment and outcome, must be properly adjusted. To minimize the bias, the pipeline should include 1) validation for optimizing the medical informatics tools in Modules 2 and 3 ; 2) analyses robust to remaining data error; 3) comprehensive confounding adjustment.

Robust analysis and adjustment

Use	Methods	Links	References
Efficient and robust estimation of treatment effect with partially annotated noisy data	SMMAL(Code, Ref)		Efficient and Robust Semi-supervised Estimation of ATE with Partially Annotated Treatment and Response

About

Languages

Language:R 70.9%Language:Java 29.1%

celehs / Harnessing-electronic-health-records-for-real-world-evidence