FRED-2 / OptiType

Precision HLA typing from next-generation sequencing data

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Running OptiType for non human samples

drtamermansour opened this issue · comments

Is there a way to use our own set of MHC alleles?

Yes, there is. But its a bit of work.

You have to provide your own reference fasta file and have to overwrite the alleles HDF that contains meta information of the MHC alleles contained in the reference.

See the relevant files in the data folder for reference.

Thanks for the response. How may I generate the alleles.h5 file? Is there a script some where that I may use to convert the alleles in fasta format to HDF?

The HDF5 File stores multiple data frames created from the IMGT .dat file and corresponding sequence file.

in haltyper.py there is a function called create_allele_dataframes that consumes an IMGT dat file, and two fasta files containing the HLA sequences in DNA and RNA. You can then store the data frames into an hdf5 file with the function store_dataframescontained in the same files. Please checkout the function, adjust accordingly, and make sure to name the dataframes and columns equally; otherwise, you will run into problems when running the pipeline with your files.

I am also trying to mirror these file structures for canine, I was referencing the files in /data, and noticed that hla_reference_rna.fasta does not start with "ATG" it looks like the starting "GCTCCCACT" motif from hla_reference_rna.fasta is the start of exon2 for HLA00001 according to the .dat file from imgt. Is this by design?

Not sure of the relationship between the fasta's and .dat/h5 within the optitype codebase, so i'm wondering if not starting at exon2 in the rna fasta for canine might mess things up.

@b-schubert Could you provide more detailed construction rules, especially for intron sequences?

this is by design, as optitype uses only exon2 and 3 due to data availability at that time. might be different for your organism.

Re introns: We imputed missing intronic information with the nearest neighbours HLA with intronic information. See the paper for more details.