liulab-dfci / TRUST4

TCR and BCR assembly from RNA-seq data

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

UMI count question

lishuangshuang0616 opened this issue · comments

Thank you for your software.
When analyzing single-cell 10x data, although we provide UMI data, the resulting output does not include UMI counts.
How does trust4 utilize UMI data during the assembly process? In barcode_report.tsv, there is only "read_fragment_count."
Does it represent the number of reads used to assemble each sequence?

If you provide the UMI in a single-cell data, the read count will be with respect to the number of UMIs supporting the corresponding CDR3.

I would like to know what kind of relationship exists between them, because in the cdr3.out file, I noticed that the sum of the read_fragment_count for all consensus IDs of a cell is much larger than the number of UMIs. For example, for a certain cell, the sum of read_fragment_count is 933, but the numbers of reads and UMIs are 1615 and 48, respectively. So, I am confused about how UMIs are allocated and whether only reads with the same UMIs are used for overlapping assembly.
image

The assembly step does not use the UMI information for single-cell data. After the assembly, the reads from the same UMI maybe mapped to different contigs in the quantification step, as a result, some UMIs will be overcounted. But the count for a CDR3 of a specific contig from a cell is the unique number of UMIs that mapped to the CDR3.

For the screenshot, do you mean that the count 461 (red square) is above the 48 (the UMIs for a cell), so the UMI count is wrong here?

Yes, why is the count much higher at this position? Is it an error, possibly stemming from my input mistake? If it's an error, I'll reanalyze it.

What was your running command?

$ fastq-extractor -t ${threads} -f ${coordinate} -o ${outdir}/tcrbcr -1 {R2.reads} --barcode {R1cb} --UMI {R1ub} --barcodeWhitelist {barcodewhitelist} --barcodeTranslate {barcodeTranslate}
$ trust4 -t ${threads} -f ${coordinate} -u ${outdir}/tcrbcr.fq --barcode ${outdir}/tcrbcr_bc.fa --UMI ${outdir}/tcrbcr_umi.fa
$ annotator -f ${coordinate} -a ${outdir}/tcrbcr_final.out -t ${threads} -o ${outdir}/tcrbcr --barcode --UMI  --readAssignment ${outdir}/tcrbcr_assign.out -r ${outdir}/tcrbcr/tcrbcr_assembled_reads.fa --airrAlignment > ${outdir}/tcrbcr_annot.fa

I split read1's cell barcode and unique molecular identifier (UMI) into two separate fastq files, and then I needed to convert the cell IDs, so I used barcodeTranslate. This shouldn't affect anything, right?

This should be fine. One minor thing is that the "-1" for fastq-extractor should be "-u". Otherwise, it will think this is a paired-end data sets and throw an error of unequal number of reads.

How did you calculate that there were 48 UMIs for this cell?

def cell_summary(barcode_fa, umi_fa, report):
    read_count_dict = defaultdict(int)
    umi_dict = defaultdict(set)
    with pysam.FastxFile(barcode_fa) as f1, \
        pysam.FastqFile(umi_fa) as f2:

        for read_1, read_2 in zip(f1, f2):
            cb = read_1.sequence
            umi = read_2.sequence
            read_count_dict[cb] += 1
            umi_dict[cb].add(umi)

    barcode_list = list(read_count_dict.keys())
    df_count = pd.DataFrame({'cell': barcode_list,
                                'read_count': [read_count_dict[i] for i in barcode_list],
                                'UMI': [len(umi_dict[i]) for i in barcode_list]})
    df_count.sort_values(by='UMI', ascending=False, inplace=True)
    df_count.to_csv(report, sep=',', index=None)

Script statistics tcrbcr_bc.fa and tcrbcr_umi.fa

This should be fine. One minor thing is that the "-1" for fastq-extractor should be "-u". Otherwise, it will think this is a paired-end data sets and throw an error of unequal number of reads.

How did you calculate that there were 48 UMIs for this cell?

Yes, it's "-u" in my pipeline . I made mistake in the issue description.

def cell_summary(barcode_fa, umi_fa, report):
    read_count_dict = defaultdict(int)
    umi_dict = defaultdict(set)
    with pysam.FastxFile(barcode_fa) as f1, \
        pysam.FastqFile(umi_fa) as f2:

        for read_1, read_2 in zip(f1, f2):
            cb = read_1.sequence
            umi = read_2.sequence
            read_count_dict[cb] += 1
            umi_dict[cb].add(umi)

    barcode_list = list(read_count_dict.keys())
    df_count = pd.DataFrame({'cell': barcode_list,
                                'read_count': [read_count_dict[i] for i in barcode_list],
                                'UMI': [len(umi_dict[i]) for i in barcode_list]})
    df_count.sort_values(by='UMI', ascending=False, inplace=True)
    df_count.to_csv(report, sep=',', index=None)

Script statistics tcrbcr_bc.fa and tcrbcr_umi.fa

This looks right to me. I just checked my run with barcode+UMI and a quick peek did not find any discrepancies.

Could you please run
"grep barcode:XXXX tcrbcr_assembled_reads.fa | cut -f6 -d' ' | sort | uniq | wc -l", where XXX is the barcode sequence of the cell with 461 read count. This command will return the number of distinct UMIs found by TRUST4 in the assembled reads.

$grep '' tcrbcr_assembled_reads.fa |cut -d' ' -f6 |sort | uniq |wc -l
48
$grep '' tcrbcr_assembled_reads.fa |cut -d' ' -f6 |wc -l
1609

The numbers look similar to the statistics

There might be a bug in the program then. Could you please share the _assembled_reads.fa and _final.out file with me?

How about just the reads and final.out (6 lines per contig) from the cell that you found had the issue? You can either send through the email as the attachment, or googledrive/dropbox/baiduwangpan link? Thank you.

ok,your email dress?

Thank you for sharing the file. I got a reasonable UMI count in the _cdr3.out file based on the files you provided:

CELL8384_N1_0 0 TRBV4-2*01  TRBD2*02  TRBJ2-7*01  TRBC2 CTGGGGCATAACGCT TACAACTTTAAAGAACAG  TGTGCCAGCAGCCCACTGGACGGGGGAGGGGGAAACGAGCAGTACTTC  1.00  11.00 100.00  1
CELL8384_N1_946 0 TRAV5*01  * TRAJ28*01 TRAC  GACAGCTCCTCCACCTAC  ATTTTTTCAAATATGGACATG TGTGCAGAGATGGCACCTGGGGCTGGGAGTTACCAACTCACTTTC 1.00  5.00  100.00  1
CELL8384_N1_12367 0 TRBV4-2*01  TRBD1*01  TRBJ2-7*01  * * * TGTGCCAGCAGCCCACTGGACGGGGGGAGGGGGAAACGAGCAGTACTTC 0.83  1.00  97.14 0
CELL8384_N1_13572 0 TRBV4-2*01,TRBV7-8*03 TRBD2*02  TRBJ2-7*01  * * * TGTGCCAGCAGCCCACTGGACGGGGGAGGGGGAAACGAGCAGTACTTC  1.00  1.00  100.00  0                                                 
CELL8384_N1_13641 0 TRBV4-2*01  TRBD1*01  TRBJ2-7*01  * * * TGTGCCAGCAGCCCACTGGACGGGGGAGGGGAAACGAGCAGTACTTC 1.00  2.00  97.06 0
CELL8384_N1_17754 0 TRBV11-2*01,TRBV11-3*01 TRBD1*01  TRBJ2-7*01  TRBC2 * * TGTGCCAGCAGCTTAGACTACAGGTTATATGGGGAGCAGTACTTC 0.83  2.00  100.00  0
CELL8384_N1_17959 0 TRBV4-2*01,TRBV4-3*01 * * * * * TGGTGCCCGGCCCGAAGTACTGCTCGTTTCCCCCTCCCCCGTCCAGTGGGC 0.00  1.00  0.00  0
CELL8384_N1_20036 0 TRAV5*01  * * * * * TCTGCAGAGACAGATGTTTATCCTTTTTATTCAATAGAACAGTGA 0.00  1.00  0.00  0
CELL8384_N1_20872 0 * * TRAJ49*01 TRAC  * * AGGGACACCGGTAACCAGTTCTATTTT 0.00  1.00  0.00  0

Which version of TRUST4 did you use?

TRUST4 v1.0.13-r473.
I'll try the latest version to see if this problem occurs.
Thanks.

Thank you for sharing the larger data set. I think I've found and fixed the bug that may assign a read to another barcode in the contig abundance estimation step. Could you please pull down the github repo again and give it a try? This is a pretty serious bug, if it works on your data set, I will draft a new release soon.

I tested a larger dataset and obtained results for several cell IDs. Now the UMIs are working properly.
Many thanks for your help, Dr. Li.

grep 'CELL1000_N3' tcrbcr_assign.out|head -n 10
E200004414L1C001R03004110063    CELL1000_N3_509
E200004414L1C002R02602089629    CELL1000_N3_509
E200004414L1C003R00300997369    CELL1000_N3_509
E200004414L1C002R03201437208    CELL1000_N3_509
E200004414L1C003R00604427789    CELL1000_N3_509
E200004414L1C002R01201985373    CELL1000_N3_509
E200004414L1C002R00603037193    CELL1000_N3_509
E200004414L1C003R02601430041    CELL1000_N3_509
E200004414L1C002R00904293832    CELL1000_N3_509
E200004414L1C002R01800009866    CELL1000_N3_509
grep 'CELL1000_N3' tcrbcr_assembled_reads.fa|head -n 10
>E200004414L1C001R03004110063 -1 58323 66418 barcode:CELL1000_N3 umi:3844
>E200004414L1C002R02602089629 -1 58323 66418 barcode:CELL1000_N3 umi:2320
>E200004414L1C003R00300997369 -1 58323 66418 barcode:CELL1000_N3 umi:3033
>E200004414L1C002R03201437208 -1 56473 66418 barcode:CELL1000_N3 umi:3941
>E200004414L1C003R00604427789 -1 56473 66418 barcode:CELL1000_N3 umi:328
>E200004414L1C002R01201985373 -1 56333 66418 barcode:CELL1000_N3 umi:3370
>E200004414L1C002R00603037193 -1 56231 66418 barcode:CELL1000_N3 umi:6312
>E200004414L1C003R02601430041 -1 55874 66418 barcode:CELL1000_N3 umi:7102
>E200004414L1C002R00904293832 -1 55874 66307 barcode:CELL1000_N3 umi:345
>E200004414L1C002R01800009866 -1 55874 66307 barcode:CELL1000_N3 umi:2299
grep -A 1 'E200004414L1C001R03004110063' /tcrbcr_umi.fa
>E200004414L1C001R03004110063
CATAACTCAG
grep -A 1 'E200004414L1C002R02602089629' tcrbcr_umi.fa
>E200004414L1C002R02602089629
CATAACTTAG
grep -A 1 'E200004414L1C003R00300997369' tcrbcr_umi.fa
>E200004414L1C003R00300997369
CATAACTCAG
grep -A 1 'E200004414L1C002R03201437208' tcrbcr_umi.fa
>E200004414L1C002R03201437208
CATAACTCAG

Why do the umi numbers of the same umi become inconsistent after assembly?

They should be consistent. Do you see those issues from the cell barcode you shared with me?

I found the issue caused by my own mistake. I included 'missing_barcode' during the analysis, which caused the problem. Removing it will be OK.
Thank you.

It's still quite strange. "E200004414L1C001R03004110063" and "E200004414L1C003R00300997369" have the same barcode and UMI, but their converted UMI numeric value is different. Their numeric UMI should not be affected by the "missing_barcode" issue. Or the UMIs correspond to other reads?

I think I've found the issue. Could you please pull the updated github repo and give it a try? Please let me know whether it works when there are "missing_barcode" in the data. Thank you again for scrutinizing TRUST4's results.

After using the new repo, the results match those obtained after removing the missing_barcode.
The information in several files also corresponds correctly.
Thank you very much for your help.