Add argument to generate non-unique CDS IDS for a given mRNA parent feature
mpoelchau opened this issue · comments
The gff3 specification states that discontinuous features, such as CDS, need not have unique IDs. Instead they can share an ID to indicate that they are all part of a discontinuous feature. Whether or not you'll want unique or the same IDs for individual CDS lines of a given CDS feature usually depends on what you'll do with the gff downstream - for example, for Tripal ingest, CDS lines corresponding to a single feature should share an ID. So, it would be great if gff3_ID_generator.py had an option to not generate unique IDs for features that share a parent feature. For the user, I'd envision this as something like '-n'. Then, the program would only generate 1 ID for all CDS features that share a parent feature.
Example result one 1 gene with 2 isoforms using the proposed flag '-n CDS':
KZ848496.1 . gene 715 17058 . + . ID=LSTR000001;
KZ848496.1 . mRNA 715 7345 . + . Parent=LSTR000001;ID=LSTR000001-RA;
KZ848496.1 . exon 715 899 . + . ID=LSTR000001-RA-exon001;Parent=LSTR000001-RA
KZ848496.1 . CDS 1418 1584 . + 0 ID=LSTR000001-RA-CDS001;Parent=LSTR000001-RA
KZ848496.1 . exon 7255 7345 . + . ID=LSTR000001-RA-exon002;Parent=LSTR000001-RA
KZ848496.1 . CDS 7255 7345 . + 1 ID=LSTR000001-RA-CDS001;Parent=LSTR000001-RA
KZ848496.1 . mRNA 13242 17058 . + . Parent=LSTR000001;ID=LSTR000001-RB;
KZ848496.1 . exon 13242 13331 . + . ID=LSTR000001-RB-exon001;Parent=LSTR000001-RB;
KZ848496.1 . CDS 13242 13331 . + 1 ID=LSTR000001-RB-CDS001;Parent=LSTR000001-RB;
KZ848496.1 . exon 15348 17058 . + . ID=LSTR000001-RB-exon002;Parent=LSTR000001-RB;
KZ848496.1 . CDS 15348 15540 . + 1 ID=LSTR000001-RB-CDS001;Parent=LSTR000001-RB;
@tony006469 here is the internal issue discussion for ID requirements: https://gitlab.com/i5k_Workspace/workspace_roadmap/issues/525
I divided it into two parts. The first part is that unspecified type will get uuid, the second part is that the specified type will share id, the first part has been completed yesterday, and the second part is currently in progress.
The screenshot shows that if I specify -t is EXON, other types can get uuid.
original
I made a draft for this feature, just add argument -t to make the CDS type share ID.
python gff3tool/lib/gff3_ID_generator.py -g merged.gff -og id.gff -uuid -r report.txt -t CDS
The screenshots of the output.gff and the report.txt are as follows.
When we use original command (not add -t), it will generate uuid for each one.
python gff3tool/lib/gff3_ID_generator.py -g merged.gff -og id.gff -uuid -r report.txt
https://github.com/NAL-i5K/GFF3toolkit/tree/uuid_cds
python gff3tool/lib/gff3_ID_generator.py -g merged.gff -og id.gff -uuid -r report.txt -t CDS