Serum Study of Liver Diseases

Link to repository: github.com/PHOENIXcenter/SerumStudyOfLiverDiseases

Summary of scripts used for feature selection and machine learning in the project
MS-based proteome data are from two independent cohort:

Cohort	HC (Healthy Control)	CHB (Chronic Hepatitis B)	LC (Liver Cirrhosis)	HCC (Hepatocellular Carcinoma)
Discovery cohort (n=125)	21	29	29	46
Validation cohort (n=75)	15	15	15	30

Script	Description
Feature_Importance.py	Contains feature selection part. we used the LASSO model to executing feature selection, randomly select 80% of the samples each time to build a total of 500 sub-models, and chose proteins selected in at least 10% of the sub-models as candidate protein features.
Classifier_Model.py	Contains machine learning modeling and performance evaluation part. We separately evaluated the performance of 6 classical machine learning models (SVM, Ridge, KNN, Naïve Bayes, Decision Tree and Random Forest) in each pairwise comparisons using corresponding candidate markers in both 5-fold discovery cohort and independent validation cohort. The main evaluation index are F1-score and AUROC.
Model_Visualization.py	Contains visualization of machine learning models, including confusion matrix and ROC curve of both 5-fold discovery cohort and validation cohort.

Data Availability

The complete serum proteomics data and clinical data are available from the authors upon reasonable request due to the need to maintain patient confidentiality.

The mass spectrometry proteomics data have been deposited to the ProteomeXchange Consortium (http://proteomecentral.proteomexchange.org) via the iProX partner repository with the dataset identifier PXD034201.

KaikunXu / Serum_Study_of_Liver_Diseases

Serum Study of Liver Diseases

Contents

Data Availability

Workflow of Machine Learning

About

Languages