Specifying targets in DeepSEA

Question

Specifying targets in DeepSEA

pjshort opened this issue 6 years ago · comments

Hello,

Thanks for all of your hard work on this! I see in the documentation here that you can specify a subset of targets (rather than the full 919 TFs): http://kipoi.org/models/DeepSEA/variantEffects/

If my understanding of this is correct, do you have any examples showing the format of the target file? Thank you!

krrome · Answer 1 · Sat Jul 14 2018 23:55:33 GMT+0800 (China Standard Time)

Hi Patrick, Thanks a lot, unfortunately that's a case of the documentation laging behind the implementation, that was a feature that had been implemented but eventually got removed, so at the moment there is no way to get variant effect predictions only for a selected model output from the top level command. I will soon add an alternative way to achieve that again. It is good to hear that that is actually a requested feature so I will give it higher priority in implementation.

…

On Sat, 14 Jul 2018, 17:39 Patrick Short, ***@***.***> wrote: Hello, Thanks for all of your hard work on this! I see in the documentation here that you can specify a subset of targets (rather than the full 919 TFs): http://kipoi.org/models/DeepSEA/variantEffects/ If my understanding of this is correct, do you have any examples showing the format of the target file? Thank you! — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#104>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ATXPbcLd1Vh79jNbRM_N-yYSH6nCUXVJks5uGhCsgaJpZM4VP5ab> .

Patrick Short · Answer 2 · Sun Jul 15 2018 00:08:27 GMT+0800 (China Standard Time)

The main reason for this request is that when running on a VCF file, the file size increases dramatically (the addition of 919 predictions to every line of the VCF) and the run-time is very long (~1000 predictions took 4 hours on the computing cluster I am using, although this is probably a function of using pytorch-cpu rather than a GPU as well?).

As I am only interested in a subset of the transcription factors/cell types, anyway, I thought I could improve run-time and decrease the file size by selecting a small subset.

If you are able to add an alternative way to do it that would be great - thank you so much!

krrome · Answer 3 · Sun Jul 15 2018 01:08:01 GMT+0800 (China Standard Time)

sure, I will do that. btw. if you do have GPUs available you should use the --gpu flag when you create the environment with kipoi env ... alternatively as a simple fix of your existing environment just run 'conda install pytorch torchvision -c pytorch` in your environment. if it is still very slow then it is because of the file being written...

Patrick Short · Answer 4 · Tue Jul 17 2018 20:36:51 GMT+0800 (China Standard Time)

Thank you! I'll give it a try using the --gpu flag and let you know how it works!

Žiga Avsec · Answer 5 · Thu Jul 19 2018 16:47:47 GMT+0800 (China Standard Time)

@krrome the output selection is now implemented in kipoi_veff, or?

krrome · Answer 6 · Thu Jul 19 2018 17:44:06 GMT+0800 (China Standard Time)

true, the requested functionality is now available using in the CLI: "--model_outputs" which takes the string identifiers of the model outputs or "--model_outputs_i" which takes the integer indices (0-based) of the model outputs.
If you use the python API you can use the "output_filter" kwarg of score_variantsin the exact same way: single values, lists of values and also boolean output selection is allowed.
To get an idea of what the string identifiers of the DeepSEA model are take a look at "DeepSEA/variantEffects/predictor_names.txt". btw this is defined in the model.yaml > schema > targets > column_labels.

in order to use this you will have to install the new version of kipoi and set the environment up with kipoi env create --vep --gpu DeepSEA/variantEffects. Let me know if you encounter any problems.

Patrick Short · Answer 7 · Thu Jul 19 2018 18:37:37 GMT+0800 (China Standard Time)

Hi @krrome and @Avsecz - thanks so much for all of your help!

I have this up and running now and the --gpu flag is way faster than running with pytorch-cpu. The --model_outputs and --model_outputs_i flags also work for me.

Interestingly, doing a small number of models doesn't seem to dramatically speed things up (at least not compared to switching from cpu to gpu).

Thanks again for all of your help - I think this can be closed!

krrome · Answer 8 · Thu Jul 19 2018 18:40:31 GMT+0800 (China Standard Time)

Selecting model outputs only reduces disk space and writing time. Multi-task models like DeepSEA always produce all the results, but we then choose to save only the selected ones. Happy to hear that it works.