liulab-dfci / TRUST4

TCR and BCR assembly from RNA-seq data

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Differences in counts compared to MIXCR results, and out-of-frame CDR3 handling

dcarbajo opened this issue · comments

Hello again! This is a follow-up to issue #247, thank you so much for your insights there!

So I managed to run TRUST4 on my SMARTer data with the following command (which still took really too long, like half a day per sample):

run-trust4 --barcodeLevel molecule
                   -f path_to/hg38_bcrtcr.fa
                   --ref path_to/human_IMGT+C.fa
                   -1 path_to/sample_fq1
                   -2 path_to/sample_fq2
                   --barcode path_to/sample_fq2
                   --readFormat bc:0:11,r2:20:-1,r1:28:-1
                   --repseq -o sample_name --od sample_output_dir -t 8

But now that I compare one sample results with the ones previously obtained with MIXCR for that sample, I observe some discrepancies I was hoping you could help me understand.

At a glance, the things that strike me the most are the number of clonotype entries in the MIXCR report compared to the TRUST4 one. While the MIXCR file has 4320 lines, the TRUST4 one has 82024, though filtering out to TRA entries only, I come down to 27637 lines (6423 without singleton clonotypes, with count=1, so still over 2000 more clonotypes found).

Then the counts seem quite different; see for example the top clonotype (TRAV-21 / TRAJ31) with a count of 361655 in MIXCR:

Screenshot 2024-02-20 at 13 26 14

The same clonotype in the TRUST4 report, although still at the top, has a count of just 7290, two orders of magnitude less:

Screenshot 2024-02-20 at 13 28 00

So there are a lot more clonotypes in the TRUST4 report compared to the MIXCR one, but I wanted to see which clonotypes found by MIXCR were not recovered by TRUST4.

What I observed is that most of these cases contain a CDR3 sequence with gap/s in MIXCR, which might be due to an out-of-frame CDR3. All these cases are one line in the MIXCR output, but several lines in the TRUST4 one...

I extracted the "V" and "J" from these clonotypes with gaps in MIXCR, and subsetted both outputs for a few examples. Check the example below:

while the subset is just one line in the MIXCR output:

Screenshot 2024-02-20 at 13 35 50

it becomes several different lines in the TRUST4 output:

Screenshot 2024-02-20 at 13 36 09

Strangely, all these CDR3 sequences are quite different, and there are some that aren't real ones (like FEASIRDENIIF above) which concerns me a bit. Most of these entries belong to singleton clonotypes, but not all (the top 4 lines have count>1).

I was wondering how to interpret this, and whether there is some aggregation or filtering that I should do downstream of TRUST4, to make the results more comprehensible (and comparable to the previous I obtained with MIXCR).

Many thanks again!

When having the barcode (more accurately UMI here), the count is the number of barcodes containing this clonotype. Therefore, the count column would be more likely to correspond to the "uniqueMoleculeIdentifier" column. I think MiXCR may have some filters if a UMI has too few reads, which may be due to the UMI having some sequencing errors or some other sequencing artifacts (may need to double-check their documentation). The current TRUST4 does not have such filters, so I think it is expected to see more molecules in TRUST4 than MiXCR.

If you want to impose some filter, I think you can first filter the barcode with too few reads in the ${prefix}_barcode_report.tsv file to create another tsv file, e.g. filtered.tsv, using the column with comma-separated fields:

V_gene,D_gene,J_gene,C_gene,cdr3_nt,cdr3_aa,read_cnt,consensus_id,CDR3_germline_similarity,consensus_full_length

, where read_cnt is the number of reads supporting the CDR3.

Then you can run the perl $WD/trust-simplerep.pl ${prefix}_cdr3.out --barcodeCnt --filterBarcoderep filtered.tsv > ${prefix}_report.tsv to create the report file summarized from the filtered barcodes.

I did not make the filters because the parameter setting was designed for scRNA-seq gene expression data, and there were many cells with very few reads coming from VDJ region. So filtering those cells might be too aggressive.
Hope this helps.

By the way, I have added the option "--clean" to the run-trust4 wrapper in the github repo to clean up the intermediate files

Thanks! On it, I let you know the outcome of it asap. Cheers!

Hello again, sorry for the delay in replying, I only got around this now.

So I filtered down my barcode report file to only barcodes with >1 read counts.

With the resulting file, I run trust-simplerep.pl. First thing I noticed is the endless warnings I get... I haven't tried to debug the script, but I hope they aren't a problem; did you experience problems with this script before? See a screenshot below:
Screenshot 2024-02-21 at 15 50 22

After this, my counts in the final report file indeed are lower now, but still far from MIXCR both readCount and uniqueMoleculeCount... see below:

MIXCR:
Screenshot 2024-02-29 at 13 46 23

TRUST4:
Screenshot 2024-02-29 at 13 46 44

However, the uniqueMoleculeFraction seems more comparable to frequency.

As a side question, I still don't really understand the last cid_full_length field... what does a CID length of 0 mean? I seem to always get 0.

I then looked at those MIXCR clones that aren't retrieved by TRUST4, like I did above, namely those with a CDR3 containing gaps. This time around, the problem where I found many multiple lines for those clones in the TRUST4 report (many with spurious CDR3s), seems almost solved. However, I still find that those entries with gapped CDR3 are not found by TRUST4...

See a couple examples (including the same as the one above).

Example 1 MIXCR:
Screenshot 2024-02-29 at 15 58 44

Example 1 TRUST4 (it doesn't find anything close to CAASKAFE_SSASKIIF, and the counts are still very different, although the frequencies are comparable):
Screenshot 2024-02-29 at 15 58 54

Example 2 MIXCR:
Screenshot 2024-02-29 at 16 02 24

Example 2 TRUST4 (it doesn't find anything close to CAVRAR_SRLMF or CAVSPE*_NARLMF, and here the frequencies are a bit more different):
Screenshot 2024-02-29 at 16 02 36

All in all, how can I compare the MIXCR and TRUST5 results, and know that indeed TRUST4 is producing similar/identical MIXCR results? Should I disregard the counts and just look at the identities of the clones retrieved?

Looking forward to using TRUST4 routinely moving forward, many thanks again!

It might be expected to have UMI absolute value difference between TRUST4 and MiXCR as the filtering strategy could be different. I think the frequency/fraction is more meaningful, as diversity calculation is usually based on the normalized values.

The cid_full_length is for indicating whether the underlying contig is full length or not. So 0 means the corresponding contig is not full length (not from 5'V to 3'J). Your observation of almost all 0s could be due to the behavior of the --repseq option. Since TCR analysis does not need full-length assembly and VJ gene assignment is sufficient, the --repseq option will drastically throw away many reads. This behavior may be changed in the next release (#241 ).

For the CDR3s with gaps, do you find their corresponding CDR3 nucleotide sequence in the _cdr3.out file?

Could you please also share with me your filtered barcode_report file so I can look into the issue of the trust-simplerep?

Thank you!

So for the CDR3s with gaps, the associated nucleotide sequences in the MIXCR report indeed appear in the cdr3.out file.

See for example the CAASKAFE_SSASKIIF CDR3 above. This is the nucleotide sequence from the MIXCR report:
Screenshot 2024-03-01 at 12 46 46

And I can indeed see it multiple times in the cdr3.out file:
Screenshot 2024-03-01 at 12 47 12

Let me send you my cdr3.out and filtered barcode report (and final filtered report) to your mail @dartmouth (hope that is ok). Many thanks again!

Thank you for sharing the files. TRUST4 by default will suppress the out_of_frame cdr3s, as this might create false positive T or B cells in single-cell data. I have modified the trust-barcoderep.pl to keep those entries for the case of UMI-based TCR-seq data.

For the filtered barcode file, it seems the file added quotes to the fields possibly due to the csv export function. I have added an option "--filterBarcoderepReadCnt" in trust-simplerep.pl to filter the barcode/UMI with read support fewer than the specified value. So you can directly obtain the filtered UMI count with:

perl $WD/trust-simplerep.pl ${prefix}_cdr3.out --barcodeCnt --filterBarcoderep original_barcode_report.tsv --filterBarcoderepReadCnt 2 > ${prefix}_filtered_report.tsv

to exclude UMIs with fewer than 2 CDR3 read support.

Please let me know how these work.

Hi @dcarbajo , thank you for the detailed exploration of TRUST4's output. I'm planning to release a new version to incorporate the recent updates. Have you found any other issues? I can try to look into them before creating the new version. Thank you!

Thanks for your help! I think so far it all works well on my side. Looking forward to the new version!