comprna / riser

Biochemical-free enrichment or depletion of RNA classes in real-time during direct RNA sequencing

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

The performance on one sample from HeLa dataset is far from reported. Any solutions?

charlesxu90 opened this issue · comments

Dear @EduEyras @a-sneddon,

Thanks for this nice work. I'm trying to replicate the results on one sample GSM6500715 from the HeLa dataset that you reported. However, I found the result confusion.

Below is the result. In this sample, mojarity of the reads are from coding, which I use a quality score of greater than 5 to define as a read mapping to coding/noncoding gene. BTW, I'm using the best CNN model that you associated in the github code.

Test accuracy: 26.7%

Max. batch inference time: 11.37490177154541
Min. batch inference time: 0.0012841224670410156
Avg. batch inference time: 0.00510050101256548
Avg. inference time per signal: 0.00015939065664267126

Confusion matrix:
------------------
[[27337    58]
 [75593   161]]

TPR: 0.002125300314174829
FPR: 0.00211717466691002
Precision: 0.7351598173515982
#TP/#FP: 2.7758620689655173
AUC: 0.438

However, if I reverse the positive and negative file, the model gives a much better accuracy, but with a poorer recall.

Test accuracy: 73.3%

Max. batch inference time: 15.053545713424683
Min. batch inference time: 0.0012507438659667969
Avg. batch inference time: 0.006252671145919535
Avg. inference time per signal: 0.00019539597330998546

Confusion matrix:
------------------
[[75593   161]
 [27337    58]]

TPR: 0.00211717466691002
FPR: 0.002125300314174829
Precision: 0.2648401826484018
#TP/#FP: 0.36024844720496896
AUC: 0.562

I checked the results, most of them gives 0's, only few gives 1's. In principle, this sample should be enriched with coding genes and all reads should be of 1's, as most of them have 'True' labels from mapping results.

So I guess it's due to the difference in data processing. I'm trying to process the data with the same approach you did. Let's see how it goes.

Solved. Using lastest mRNA model with signal processing, the model can achieve an acc of 0.853.