liulab-dfci / TRUST4

TCR and BCR assembly from RNA-seq data

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

AIRR file

snowylxx opened this issue · comments

Why is there no cdr3 column in the airr.tsv

Also, I have one more question to ask. What is the difference between the report and the airr file in the output file. What does the assemble number mean, assemble1_0,1_1... in the airr file, are they all assembled into assemble1 in the report file? So, with this understanding, a line in the airr file cannot be counted as a TCR or BCR sequence, right?

The CDR3 information is in the junction column, which is the CDR3 sequence plus the flanking motif amino acids (3 nucleotides). The "junction" column is required in AIRR format, and "CDR3" is actually optional. Therefore, we only put the junction column in the output.

The report is a summary for CDR3 (junctions more exactly ) sequences. The AIRR also put extra information such as the sequence alignment, germline alignment, and other columns relate to non-CDR3 regions. The assemble1 means the consensus contig id in the *_annot.fa file, and the suffix _0, _1 is the index minor CDR3s, e.g. low abundant CDR3s from SHMs, encoded in the consensus contig. The file contains both TCRs and BCRs, and I plan to add the "locus" column in AIRR format to simplify the parsing in the next release version.

So the junction_aa column is the same as column CDR3aa in the report file(junction column had one more amino acid at the head and tail of the sequence).

When I use these results for downstream analysis, can all consensus contig (e.g. assemble1_0, _1, _2...) be used or only the sequence with the highest abundance (may be assemble1_0 ) be used?

Yes, the junction_aa in AIRR is identical to the CDR3aa in the report. Sorry about the terminology confusion..

You shall use all the sub-consensus contigs, i.e. including all _0, _1, _2,.... In many cases, those are real.

Ok. thank you so much. you are so kind. ^_^

By the way, the "locus" column information can be obtained from the information of the VDJ (e.g. IGHV1-6906,IGHD3-1001,IGHJ4*02 ---- IGH locus)

Right, I just added the "locus" column explicitly in the daytime, which might simplify some workflows/analysis. It's a bit tricky for TRA and TRD, where some of the V genes can be used in both chains.

what's the meaning of TRA and TRD' V gene be used in both chains? You mean some TRA genes can appear in both α and β chains? and some TRD genes can appear in both δ and γ chains?

It's some V genes can be recombined in α or δ chains.