xiaoyeye / CNNC

covolutional neural network based coexpression analysis

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Confusion regarding inputs

hossam-zaki opened this issue · comments

Hello!

I am trying to use this to build a gene regulatory network based on my single cell data. However, I am slightly confused by your inputs. In #3 in your README, I have to input gene_pair_list with the genes and the labels. What if I don't know the labels, as that is what i'm trying to find out. If we already input the labels into the model, then what is this used for? Let me know. Thanks!!!

Hi,
Thanks for your interest. Actually it is a supervised method. It is natural that we use label to train the model. Once it is trained, the model should be able to predict unseen samples.
Let me know if you have any questions.
Best

Right, so is this used to train the model?

Screen Shot 2020-09-04 at 5 02 38 PM

Screen Shot 2020-09-04 at 5 03 16 PM

My concern is that I want to be able to predict the relationship between two genes, and in that first picture it appears you already know the relationship between those genes, and you want to predict them in the STEP 3? I am very confused as you can tell. Is the INPUT used to train the model, or evaluate the model. If it is to train the model, then how do we evaluate on our own data? If it is used to evaluate the model, then why would you input known labels in STEP 2?

Hi,
section 7 is about how to use a trained model to predict unseen samples.
If you want to train a new model using your own data, please see section 8. For example, 8.3 uses a 3-fold cross validation strategy to train and evaluate new models using your own data. Hope it helps you.
Best

Thanks for the quick response!

What do you mean by unseen samples?

What if I am trying to figure out causality between two genes? I don't know what the label would be and want to figure it out using this, as mentioned by your paper. Can I use the pre-trained model to do this?

You are welcome.

unseen samples mean samples that the trained model never have seen in trainning process. If unseen samples have label, they can act as test samples. if they do not have label, the trained model can also give them predicted values.

I believe that the pre-trained model should be able to predict casuality between any two genes. The model provided is trained by a filtered KEGG dataset. I guess the edge direction is the causality you really want. Of course, you can also generate gene pair label list as you like and train a new model by yourself.

Ah that makes a lot of sense. I have a lot of gene pairs with unknown labels. I have single cell data for these gene pairs, and would like to figure out the causality between these two genes. Based on your previous response, I think I can use the pretrained model to do this. Please correct me if I am wrong.

If I have this data, would all I have to do is input the histogram of these two genes into the model? Or is there another process to do this?

There are some problems. I thought you are trying to use scRNA-seq data we provided. If so, (and if you want to use the pre trained model) what you need to do is to 1) geneate the histogram matrix for these two genes, 2) use the pretrained model to predict them. But plz bear in mind that the expression data we provide contains hundreds of cell types, so it is an ideal dataset for KEGG prediction which also contains regulation networks from different cell types. and plz note that when do the TF target prediction, we use cell type specific expression data and chip-seq as groundtruth which is also cell type specific.

If you want to use you own scRNA-seq data, you need 1) create a gene pair list with label which can be done using KEGG or Reactome or other datasets , you can also use the list we provide;2) generate histogram using the gene pair list; 3) train a new model; 4) do prediction for unknown gene pairs.
Hope it helps.

Ah ok that makes sense. Why do i need to retrain the model if I use my own scRNA-seq data, if I'll be training on the same dataset you had trained it on?

Thank you so much for answering these questions. I am not experienced in scRNA-seq nor Deep Learning, but trying to learn as much as I can...

You are welcome.
For supervised learning, we should keep the input as consistent as possible, the ideal case is that all the inputs are iid (independent and identicial distribtuion). if you want, you can also use the pre trained model for your own scRNA-seq data, but I am afraid the result would be worse.

I see, so the KEGG data that I use to train the model should be consistent with the data that I input. And the result would be worse if I use the pre-trained model because the dataset used to train isn't optimized for my specific dataset?

Correct!

So sorry for the late response, but that makes a lot of sense! I will give this a go, and will circle back with any issues I get. Thanks again for your help!