infer_from_cds in UTRFetcher does not work correctly
Hoeze opened this issue · comments
Florian R. Hölzlwimmer commented
Currently, UTR region inference works only for non-spliced UTR regions:
kipoiseq/kipoiseq/extractors/gtf.py
Lines 348 to 368 in e67fab6
TODO:
- Generate test data with transcript that contains a spliced UTR, e.g.:
tabix /s/genomes/GenBank/hg38/annotation/hg38.ensGene.gtf.gz chr22 | grep -i ENST00000263207 > kipoiseq/tests/data/chr22_ENST00000263207.gtf
- Simulate some variants:
chr22_ENST00000263207_3UTR.vcf.gz
chr22_ENST00000263207_5UTR.vcf.gz
- Generate with
infer_from_cds=False
:
chr22_ENST00000263207_3UTR.alt_seqs.txt
chr22_ENST00000263207_3UTR.ref_seq.txt
chr22_ENST00000263207_5UTR.alt_seqs.txt
chr22_ENST00000263207_5UTR.ref_seq.txt
- Update tests:
kipoiseq/tests/extractors/test_protein.py
Lines 323 to 358 in 1d72daf
Florian R. Hölzlwimmer commented
The correct solution would be to pyranges.intersect()
the exons with transcript_start - CDS_start
for 5'UTR and CDS_end - transcript_end
for 3'UTR.
Also, make sure that the last / first exon of the UTRs is shrinked to CDS_start
/ CDS_end
.