FRED-2 / OptiType

Precision HLA typing from next-generation sequencing data

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Question about hla_referenence_dna.fasta

Atalasia opened this issue · comments

Hello,

I was looking through the given fasta files under data directory and noticed that some of the sequences provided in there are combinations of coding sequence and non-coding sequence.

What puzzles me is that some of them have different first 2 digits. (ex. HLA07296_HLA00097 HLA-A_33:53 (introns from HLA-A_31:01:02)). And it seems like the list of combinations of exon and intron that share first two digits are not exhaustive.

Is there a logic to how these alleles are combined? (And if I were to update this fasta to more recent alleles from the HLA db, what logic should I use to combine the alleles?)

Thanks.

@Atalasia,

as you might have noticed HLA00097 translates to HLA-A31:01:02 (just search for it in the file).
This just means that the first allele (HLA07296) had no intron sequence available and had to be imputed somehow.

Citing the paper:

We impute the missing sequence data by replacing it with its closest neighbor with respect to sequence similarity from among the complete allele sequences

So rather than you updating the database, it is probably easier if there would be an update to OptiType (they should also have tested code for that). @andras86, what do you think?

I would also be very interested if that improves performance.

Thank you @messersc, this is exactly what happens, we lift over intron sequences from the most similar alleles. Updating them yourself would be difficult as there's a lot of sanity checking and some manual work in this process, but we're making a more-or-less automated pipeline to do it. We won't update the database for the current OptiType version, but the coming-soon OT2 (with Class II typing) will have the most recent one and we'll keep updating it.

@andras86 Do let us know when the OT2 will be out and I'm definitely looking forward to it.

@andras86 Hello. I was just wondering how the automated-pipeline for updating the OptiType database is coming along? I'm interested in using the updated HLA alleles with OptiType, but would rather not impute the introns on my own, if a validated procedure already exists. Is any of your code related to this publically available? Thank you.

@andras86 Hi. I also have the same request for using the updated HLA reference database since I realized that there are two things for me impossible to do by myself: 1. build genomic sequences for truncated alleles from intron 1 to intron 3 by imputing most similar intron from other allele. 2. how to format "alleles.h5" file. Looking forward to the updating.
Thank you.

@andras86 Looking forward to OT2. Thanks.

@AlfredShawn alleles.h5is a HDF5 file, therefore you can explore it using whatever packages such that, e.g., python ones: h5py or Pandas.

All the best

@zimoun could you please provide the automated-pipeline for updating the OptiType database ? I also wanna use the latest HLA database from IGMT.

I would appreciate any instruction for both hla_referenence_dna.fasta and hla_referenence_rna.fasta update! Can anyone provide it, please?

@lidd77 @serge2016 My comment was more than 2 years ago. :-)
Now I do not use OptiType anymore. Therefore, I cannot provide nothing.
Last, note that my comment did not suggest any "automated-pipeline for updating the OptiType database" but only a way to explore alleles.h5 to format it.

All the best and Godspeed!

@zimoun and what do you use now instead of OptiType?