liulab-dfci / TRUST4

TCR and BCR assembly from RNA-seq data

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Confusion with count and frequency in results.tsv

dgagler opened this issue · comments

Hi. I'm confused by some of the results.tsv file. What are the definitions of the count and frequency columns? My understanding is that frequency is the number of reads mapped to a CDR3 region divided by the total BCR reads. When I sum the values in my frequency column, however, I'm getting a value over 1. Furthermore, when looking at the top clones in this example...the values don't add up. See screen shot of the top clones in a report.tsv sorted by frequency:

Screen Shot 2023-10-02 at 11 30 12 AM

clone 1 has 2630 reads and a frequency of .3
clone 2 has 315 reads and a frequency of .28

The only thing I can think of is that this is because clone 1 is a light chain (kappa) clone and clone 2 is heavy chain. How would this affect things?

Furthermore, do you have any insight into why
most cells are light chain and not heavy chain? Is it possible to have analyzed reads of both heavy and light from the same cells?

Full report.tsv is attached for reference. Thanks.

01-076_report copy.csv

Yes, each chain is normalized separately to get the frequency value, in which IGK/IGL read count is combined. Therefore, the frequency column can sum up to 6 (IGH, IGK/IGL, TRB, TRA, TRG, TRD). The light chains usually express much more than heavy chains, so the sensitivity is higher on recovering light chains.

If you need paired-chain as the clonotype and calculate the frequencies, you may need to write your own code or use single-cell TCR/BCR analysis tools. There are many variations, such as whether to keep single-chain cells. You may also want to calculate the diversity for different cell clusters (inferred from gene expression). Therefore, we did not provide comprehensive statistics for paired-chain comparisons. For the single-cell downstream analysis, the barcode_report.tsv file or the barcode_airr.tsv file would be more appropriate.

Perfect. Thanks for clarifying.

Hi, reopening this to ask another question. Is it possible for a given cell to have both heavy and light chain calls?

Definitely, in the ideal case, a cell should have both chains called. Those should be in the barcode_report file.

Ahhh. This is because in the barcode_report file each row does not necessarily refer to a unique cell barcode, right? So for a cell that has both heavy and light chain calls there will be separate rows (with slightly different sequence_ids) for the heavy call and the light call?

Each row should be for a unique cell barcode. The heavy chain will be in the chain1 column, and the light chain will be in the chain2 column.

Oooo got it. I was looking at the barcode_airr.tsv, not the barcode_report.tsv. Thanks for clarifying and for the timely responses!