AIRR file

Question

AIRR file

snowylxx opened this issue 10 months ago · comments

Why is there no cdr3 column in the airr.tsv

snowylxx · Answer 1 · Tue Sep 19 2023 10:52:39 GMT+0800 (China Standard Time)

Also, I have one more question to ask. What is the difference between the report and the airr file in the output file. What does the assemble number mean, assemble1_0,1_1... in the airr file, are they all assembled into assemble1 in the report file? So, with this understanding, a line in the airr file cannot be counted as a TCR or BCR sequence, right?

Li Song · Answer 2 · Tue Sep 19 2023 11:19:48 GMT+0800 (China Standard Time)

The CDR3 information is in the junction column, which is the CDR3 sequence plus the flanking motif amino acids (3 nucleotides). The "junction" column is required in AIRR format, and "CDR3" is actually optional. Therefore, we only put the junction column in the output.

The report is a summary for CDR3 (junctions more exactly ) sequences. The AIRR also put extra information such as the sequence alignment, germline alignment, and other columns relate to non-CDR3 regions. The assemble1 means the consensus contig id in the *_annot.fa file, and the suffix _0, _1 is the index minor CDR3s, e.g. low abundant CDR3s from SHMs, encoded in the consensus contig. The file contains both TCRs and BCRs, and I plan to add the "locus" column in AIRR format to simplify the parsing in the next release version.

snowylxx · Answer 3 · Wed Sep 20 2023 11:24:11 GMT+0800 (China Standard Time)

So the junction_aa column is the same as column CDR3aa in the report file(junction column had one more amino acid at the head and tail of the sequence).

When I use these results for downstream analysis, can all consensus contig (e.g. assemble1_0, _1, _2...) be used or only the sequence with the highest abundance (may be assemble1_0 ) be used?

Li Song · Answer 4 · Wed Sep 20 2023 11:30:37 GMT+0800 (China Standard Time)

Yes, the junction_aa in AIRR is identical to the CDR3aa in the report. Sorry about the terminology confusion..

You shall use all the sub-consensus contigs, i.e. including all _0, _1, _2,.... In many cases, those are real.

snowylxx · Answer 5 · Wed Sep 20 2023 11:48:11 GMT+0800 (China Standard Time)

Ok. thank you so much. you are so kind. ^_^

By the way, the "locus" column information can be obtained from the information of the VDJ (e.g. IGHV1-6906,IGHD3-1001,IGHJ4*02 ---- IGH locus)

Li Song · Answer 6 · Wed Sep 20 2023 11:51:29 GMT+0800 (China Standard Time)

Right, I just added the "locus" column explicitly in the daytime, which might simplify some workflows/analysis. It's a bit tricky for TRA and TRD, where some of the V genes can be used in both chains.

snowylxx · Answer 7 · Wed Sep 20 2023 19:56:46 GMT+0800 (China Standard Time)

what's the meaning of TRA and TRD' V gene be used in both chains? You mean some TRA genes can appear in both α and β chains? and some TRD genes can appear in both δ and γ chains?

Li Song · Answer 8 · Fri Sep 22 2023 10:03:56 GMT+0800 (China Standard Time)

It's some V genes can be recombined in α or δ chains.