NAL-i5K / GFF3toolkit

Python programs for processing GFF3 files

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Add argument to generate non-unique CDS IDS for a given mRNA parent feature

mpoelchau opened this issue · comments

The gff3 specification states that discontinuous features, such as CDS, need not have unique IDs. Instead they can share an ID to indicate that they are all part of a discontinuous feature. Whether or not you'll want unique or the same IDs for individual CDS lines of a given CDS feature usually depends on what you'll do with the gff downstream - for example, for Tripal ingest, CDS lines corresponding to a single feature should share an ID. So, it would be great if gff3_ID_generator.py had an option to not generate unique IDs for features that share a parent feature. For the user, I'd envision this as something like '-n'. Then, the program would only generate 1 ID for all CDS features that share a parent feature.

Example result one 1 gene with 2 isoforms using the proposed flag '-n CDS':

KZ848496.1      .       gene    715     17058   .       +       .       ID=LSTR000001;
KZ848496.1      .       mRNA    715     7345   .       +       .      Parent=LSTR000001;ID=LSTR000001-RA;
KZ848496.1      .       exon    715     899     .       +       .       ID=LSTR000001-RA-exon001;Parent=LSTR000001-RA
KZ848496.1      .       CDS     1418    1584    .       +       0       ID=LSTR000001-RA-CDS001;Parent=LSTR000001-RA
KZ848496.1      .       exon    7255    7345    .       +       .       ID=LSTR000001-RA-exon002;Parent=LSTR000001-RA
KZ848496.1      .       CDS     7255    7345    .       +       1       ID=LSTR000001-RA-CDS001;Parent=LSTR000001-RA
KZ848496.1      .       mRNA    13242     17058   .       +       .      Parent=LSTR000001;ID=LSTR000001-RB;
KZ848496.1      .       exon    13242   13331   .       +       .       ID=LSTR000001-RB-exon001;Parent=LSTR000001-RB;
KZ848496.1      .       CDS     13242   13331   .       +       1       ID=LSTR000001-RB-CDS001;Parent=LSTR000001-RB;
KZ848496.1      .       exon    15348   17058   .       +       .       ID=LSTR000001-RB-exon002;Parent=LSTR000001-RB;
KZ848496.1      .       CDS     15348   15540   .       +       1       ID=LSTR000001-RB-CDS001;Parent=LSTR000001-RB;

I tried to print out all the dictionaries to help me understand the data processing.
But I can't get any information about the loop of root.

screenshot

I am curious about what is the function of this loop.

@tony006469 here is the internal issue discussion for ID requirements: https://gitlab.com/i5k_Workspace/workspace_roadmap/issues/525

I divided it into two parts. The first part is that unspecified type will get uuid, the second part is that the specified type will share id, the first part has been completed yesterday, and the second part is currently in progress.

The screenshot shows that if I specify -t is EXON, other types can get uuid.
original
original

ac

I'm digging into the gff3.py to understand all the processing and data structure of gff3 file.

data.png

I tried to figure out what data columns is processed at each step of the generator.py and compare the differences between the id generator input and output files.

I made a draft for this feature, just add argument -t to make the CDS type share ID.
python gff3tool/lib/gff3_ID_generator.py -g merged.gff -og id.gff -uuid -r report.txt -t CDS
The screenshots of the output.gff and the report.txt are as follows.

out.gff
screenshot(draft).png

report.txt
screenshot(draft2).png

When we use original command (not add -t), it will generate uuid for each one.
python gff3tool/lib/gff3_ID_generator.py -g merged.gff -og id.gff -uuid -r report.txt
uuid.png

https://github.com/NAL-i5K/GFF3toolkit/tree/uuid_cds

python gff3tool/lib/gff3_ID_generator.py -g merged.gff -og id.gff -uuid -r report.txt -t CDS

Pull Request: #90
python gff3tool/lib/gff3_ID_generator.py -g merged.gff -og id.gff -uuid -r report.txt -t CDS

gff3_ID.png