kipoi / kipoiseq

Standard set of data-loaders for training and making predictions for DNA sequence-based models.

Home Page:https://kipoi.org/kipoiseq/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Design

Avsecz opened this issue · comments

Which datalaoders to use:

  • DNA sequence based
  • bigiwig-based (useful for DNA + accessibility)
  • Efficient bed-file reader

API

  • use torch.data.Dataset API

Nice to have

  • transforms
    • reverse-complementation

Integration with Kipoi

  • see how to expand default_dataloader

Two suggestions:

  • from our previous thread (#330 in kipoi/kipoi) re: DeepBind dataloaders, I would +1 pyfaidx for dealing with fasta indexing. It has fewer dependencies than GenomeLake (sp?).
  • the actual encoding from concise was good, but maybe could be factored out of the package? Installing the whole concise package just to encode DNA was a bit overkill

Agree. We should also add proper handling of variable widths etc.

I will indeed factor out the one-hot-encoder from concise for the beginning to make things simple. It will not be as fast as genomelake which uses cython to compile the one-hot-encode function.

Do you think one should install all the dependencies for all the dataloaders in the dataloaders package or should these be installed on the fly?

@lzamparo what was your issue once again with genomelake - pysam?

Migrating from pysam to pyfaidx: kundajelab/genomelake#19

@Avsecz my issue was that genomelake - pysam could not handle a fasta file that contained a collection of intervals with the corresponding subsequences. It expected an entire chromosome, and each subsequence where the subsequences could be extracted from it.

I will indeed factor out the one-hot-encoder from concise for the beginning to make things simple. It will not be as fast as genomelake which uses cython to compile the one-hot-encode function.

Did you do profiling of cythonized versus pure python one-hot encoding? How much of a speedup did it provide?

Do you think one should install all the dependencies for all the dataloaders in the dataloaders package or should these be installed on the fly?

I'm not sure. I think for users, it would be better to have dependencies installed on an as-needed basis. However, this might be more work for you, so I understand if you'd want to bundle all dependencies together whe then install a kipoi/dataloaders package (via conda I imagine?)

Did you do profiling of cythonized versus pure python one-hot encoding? How much of a speedup did it provide?

Yup. I did the benchmarks. For 1kb it's 300us vs 50us and for 10kb is 3ms vs 0.5 ms. Hence 6x speedup. I really want to make these dataloaders as fast as possible so that they become the default to people training models as well.

@Avsecz my issue was that genomelake - pysam could not handle a fasta file that contained a collection of intervals with the corresponding subsequences. It expected an entire chromosome, and each subsequence where the subsequences could be extracted from it.

I migrated genomelake to pyfaidx. Need to fix the unit tests and then it should work. Could you please send me a slashed down version of your fasta file (containing a single sequence of length ~100) and the expected result so that I can add it to the unit-tests. Migrating to pyfaidx will also solve the installation issues - if conda channels were in the wrong order, then (old) pysam from anaconda was chosen.

I'm not sure. I think for users, it would be better to have dependencies installed on an as-needed basis. However, this might be more work for you, so I understand if you'd want to bundle all dependencies together whe then install a kipoi/dataloaders package (via conda I imagine?)

Ok. I think that's a good idea especially if the set of data-loaders and their dependencies starts growing. It's actually not more work. We anyway need to annotate each dataloader as in dataloader.yaml.

thanks for the discussion - I am currently implementing a first draft of the dataloaders repo and I think my suggestion agrees with many of your points. I have opted to:

  • write my own FastaExtractor class and called it FastaReader to avoid confusion. This is necessary as some models (e.g: lsgkm-SVM) need string inputs, others one-hot encoded input, etc. It is therefore a wrapper around pyfaidx so that it operates with pybedtools.Interval objects.
  • the string sequence output can be converted by transformer methods, one of which is the onehot_trafo, another will be char_list_trafo:
    • onehot_trafo is at the moment a three-liner which can at any point be replaced with the onehot method in genomelake. I didn't do that to avoid genomelake dependency only because of this function
    • char_list_trafo converts the sequence string returned by FastaReader into list of characters, as required by some models.
  • For one-hot encoded sequences I define a Reshaper class that can add "dummy" dimensions and swap the order of the sequence and alphabet axes as required by the model.

using these definitions, I get a neat and compact defintion of dataloader classes capable to cater for one-hot encoded models with all sorts of different input shapes and axis-order, string-sequece models, etc. see: https://github.com/kipoi/dataloaders/blob/first_draft/kipoi_dataloaders/sequence_based.py. please feel free to take a look at the branch https://github.com/kipoi/dataloaders/tree/first_draft and give feedback :)

@lzamparo btw. I am more than happy to use a fasta reading package that has the functionality you requested. pyfaidx does not have the functionality that you have requested - e.g.: requesting the sequence from genomic interval chr15:98503515-98503517 given the fasta example file in your gist in kipoi/kipoi#330. pyfaidx inteprets the entire string after the ">" as a sequence/chromosome name. If you can point me to the fasta reader that you have in mind which is capable to handle this the way you specified and if the that fasta reader complies with the fasta standards then I am more than happy to replace pyfaidx with that one.

This is what happens when I use pyfaidx:

import pyfaidx
ff = pyfaidx.Fasta("tests/data/subseq_sample.fasta")
ff.get_seq("chr15", 98503515, 98503517)

Traceback (most recent call last):
  File ".../python3.6/site-packages/pyfaidx/__init__.py", line 628, in from_file
    i = self.index[rname]
KeyError: 'chr15'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File ".../python3.6/site-packages/pyfaidx/__init__.py", line 1015, in get_seq
    seq = self.faidx.fetch(name, start, end)
  File ".../python3.6/site-packages/pyfaidx/__init__.py", line 613, in fetch
    seq = self.from_file(name, start, end)
  File ".../python3.6/site-packages/pyfaidx/__init__.py", line 631, in from_file
    "Please check your FASTA file.".format(rname))
pyfaidx.FetchError: Requested rname chr15 does not exist! Please check your FASTA file.

on this fasta file:

>chr15:98503514-98503814
TGGCTCCTGCAAAGTCCTTACGCTGCTAGAAACCCCGCCCCCGTTCGTAACTATGGGAACGGCTGACACGCTGAGAGCCG

@Avsecz Here's a sample .fasta file in a gist, along with the test interval and expected output

@krrome I've not used get_seq before, but I used the dict interface of Pyfaidx.Fasta to do slicing on Fasta records from a file organized like the one in my original gist you reference.

There is one snag, which is you need to convert to relative indexing of the bases from the absolute indexing of the interval, but that's pretty fast. Have a look at the TruncSeqDataset dataloader I made, you can see how I'm taking subsequences in __getitem__

@lzamparo thanks for the link. I could add your way of querying in case the default get_seq throws a FetchError.. but the problem I can see there is that it will only work if the bed query is matched by a fasta file entry...
in any case if your suggestion solves your problem we can add it to the default dataloader, if @Avsecz agrees.

.. but the problem I can see there is that it will only work if the bed query is matched by a fasta file entry...

True, I ensured there would be correspondance by construction. Though I think it can handle the .other common use case, where you have one fasta file per chromosome, and the intervals can be taken as a subset without any conversion to relative sequences. Up to you to decide how to handle the case where an interval has no match in the fasta provided.

Great! Thanks @lzamparo . I'll have a closer look at your implementation tomorrow and add the provided files to unit-tests.

Update

  • I'm almost done writing the transforms module that contains all the useful pre-processing functions like one-hot encoding, tokenizing or sequence trimming. I re-factored one-hot-encoding functions from concise and also imported the fast one-hot-encoding from genomelake (for the DNA case)
  • @krrome @lzamparo do you frequently use any other functions in addition to those in https://github.com/kipoi/kipoiseq/blob/master/kipoiseq/transforms/functional.py?
    • I was thinking of also adding the following two transforms:
      • k-mer or gapped k-mer counter
      • simple reverse complementation function

@Avsecz +1 for both transforms you suggest below, I can think of some handy use-cases for both. Also useful might be a random substring transform? I can imagine a scenario when you are working with longer sequences, but the property you're trying to predict is valid over the length of the sequence, and you want to use this for data augmentation.

@lzamparo what's the motivation of storing chromosome subsets in the fasta file instead of using intervals to query the sequence directly?

>chr15:98503514-98503614
TGGCTCCTGCAAAGTCCTTACGCTGCTAGAAACCCCGCCCCCGTTCGTAACTATGGGAACGGCTGACACGCTGAGAGCCG
GAGATAACCCTGTTCCGCCC

I've started a new issue to keep track of the additional dataloaders: #8. The package is ready and on pypi pip install kipoiseq. I also created a small colab notebook: link.