UMI count question

Question

UMI count question

lishuangshuang0616 opened this issue 4 months ago · comments

Thank you for your software.
When analyzing single-cell 10x data, although we provide UMI data, the resulting output does not include UMI counts.
How does trust4 utilize UMI data during the assembly process? In barcode_report.tsv, there is only "read_fragment_count."
Does it represent the number of reads used to assemble each sequence?

Li Song · Answer 1 · Fri Mar 29 2024 02:53:21 GMT+0800 (China Standard Time)

If you provide the UMI in a single-cell data, the read count will be with respect to the number of UMIs supporting the corresponding CDR3.

Shuangshuang Li · Answer 2 · Fri Mar 29 2024 09:01:01 GMT+0800 (China Standard Time)

I would like to know what kind of relationship exists between them, because in the cdr3.out file, I noticed that the sum of the read_fragment_count for all consensus IDs of a cell is much larger than the number of UMIs. For example, for a certain cell, the sum of read_fragment_count is 933, but the numbers of reads and UMIs are 1615 and 48, respectively. So, I am confused about how UMIs are allocated and whether only reads with the same UMIs are used for overlapping assembly.

Li Song · Answer 3 · Fri Mar 29 2024 09:48:38 GMT+0800 (China Standard Time)

The assembly step does not use the UMI information for single-cell data. After the assembly, the reads from the same UMI maybe mapped to different contigs in the quantification step, as a result, some UMIs will be overcounted. But the count for a CDR3 of a specific contig from a cell is the unique number of UMIs that mapped to the CDR3.

For the screenshot, do you mean that the count 461 (red square) is above the 48 (the UMIs for a cell), so the UMI count is wrong here?

Shuangshuang Li · Answer 4 · Fri Mar 29 2024 10:06:47 GMT+0800 (China Standard Time)

Yes, why is the count much higher at this position? Is it an error, possibly stemming from my input mistake? If it's an error, I'll reanalyze it.

Li Song · Answer 5 · Fri Mar 29 2024 10:07:48 GMT+0800 (China Standard Time)

What was your running command?

Shuangshuang Li · Answer 6 · Fri Mar 29 2024 10:23:02 GMT+0800 (China Standard Time)

$ fastq-extractor -t ${threads} -f ${coordinate} -o ${outdir}/tcrbcr -1 {R2.reads} --barcode {R1cb} --UMI {R1ub} --barcodeWhitelist {barcodewhitelist} --barcodeTranslate {barcodeTranslate}
$ trust4 -t ${threads} -f ${coordinate} -u ${outdir}/tcrbcr.fq --barcode ${outdir}/tcrbcr_bc.fa --UMI ${outdir}/tcrbcr_umi.fa
$ annotator -f ${coordinate} -a ${outdir}/tcrbcr_final.out -t ${threads} -o ${outdir}/tcrbcr --barcode --UMI  --readAssignment ${outdir}/tcrbcr_assign.out -r ${outdir}/tcrbcr/tcrbcr_assembled_reads.fa --airrAlignment > ${outdir}/tcrbcr_annot.fa

I split read1's cell barcode and unique molecular identifier (UMI) into two separate fastq files, and then I needed to convert the cell IDs, so I used barcodeTranslate. This shouldn't affect anything, right?

Li Song · Answer 7 · Fri Mar 29 2024 10:32:48 GMT+0800 (China Standard Time)

This should be fine. One minor thing is that the "-1" for fastq-extractor should be "-u". Otherwise, it will think this is a paired-end data sets and throw an error of unequal number of reads.

How did you calculate that there were 48 UMIs for this cell?

Shuangshuang Li · Answer 8 · Fri Mar 29 2024 10:47:08 GMT+0800 (China Standard Time)

def cell_summary(barcode_fa, umi_fa, report):
    read_count_dict = defaultdict(int)
    umi_dict = defaultdict(set)
    with pysam.FastxFile(barcode_fa) as f1, \
        pysam.FastqFile(umi_fa) as f2:

        for read_1, read_2 in zip(f1, f2):
            cb = read_1.sequence
            umi = read_2.sequence
            read_count_dict[cb] += 1
            umi_dict[cb].add(umi)

    barcode_list = list(read_count_dict.keys())
    df_count = pd.DataFrame({'cell': barcode_list,
                                'read_count': [read_count_dict[i] for i in barcode_list],
                                'UMI': [len(umi_dict[i]) for i in barcode_list]})
    df_count.sort_values(by='UMI', ascending=False, inplace=True)
    df_count.to_csv(report, sep=',', index=None)

Script statistics tcrbcr_bc.fa and tcrbcr_umi.fa

Shuangshuang Li · Answer 9 · Fri Mar 29 2024 10:53:09 GMT+0800 (China Standard Time)

This should be fine. One minor thing is that the "-1" for fastq-extractor should be "-u". Otherwise, it will think this is a paired-end data sets and throw an error of unequal number of reads.

How did you calculate that there were 48 UMIs for this cell?

Yes, it's "-u" in my pipeline . I made mistake in the issue description.

Li Song · Answer 10 · Fri Mar 29 2024 11:31:58 GMT+0800 (China Standard Time)

def cell_summary(barcode_fa, umi_fa, report):
    read_count_dict = defaultdict(int)
    umi_dict = defaultdict(set)
    with pysam.FastxFile(barcode_fa) as f1, \
        pysam.FastqFile(umi_fa) as f2:

        for read_1, read_2 in zip(f1, f2):
            cb = read_1.sequence
            umi = read_2.sequence
            read_count_dict[cb] += 1
            umi_dict[cb].add(umi)

    barcode_list = list(read_count_dict.keys())
    df_count = pd.DataFrame({'cell': barcode_list,
                                'read_count': [read_count_dict[i] for i in barcode_list],
                                'UMI': [len(umi_dict[i]) for i in barcode_list]})
    df_count.sort_values(by='UMI', ascending=False, inplace=True)
    df_count.to_csv(report, sep=',', index=None)

Script statistics tcrbcr_bc.fa and tcrbcr_umi.fa

This looks right to me. I just checked my run with barcode+UMI and a quick peek did not find any discrepancies.

Could you please run
"grep barcode:XXXX tcrbcr_assembled_reads.fa | cut -f6 -d' ' | sort | uniq | wc -l", where XXX is the barcode sequence of the cell with 461 read count. This command will return the number of distinct UMIs found by TRUST4 in the assembled reads.

Shuangshuang Li · Answer 11 · Fri Mar 29 2024 12:41:05 GMT+0800 (China Standard Time)

$grep '' tcrbcr_assembled_reads.fa |cut -d' ' -f6 |sort | uniq |wc -l
48
$grep '' tcrbcr_assembled_reads.fa |cut -d' ' -f6 |wc -l
1609

The numbers look similar to the statistics

Li Song · Answer 12 · Fri Mar 29 2024 12:45:45 GMT+0800 (China Standard Time)

There might be a bug in the program then. Could you please share the _assembled_reads.fa and _final.out file with me?

Li Song · Answer 13 · Fri Mar 29 2024 13:00:43 GMT+0800 (China Standard Time)

How about just the reads and final.out (6 lines per contig) from the cell that you found had the issue? You can either send through the email as the attachment, or googledrive/dropbox/baiduwangpan link? Thank you.

Shuangshuang Li · Answer 14 · Fri Mar 29 2024 13:02:48 GMT+0800 (China Standard Time)

ok，your email dress？

Li Song · Answer 15 · Fri Mar 29 2024 13:03:21 GMT+0800 (China Standard Time)

Li.Song@dartmouth.edu

Li Song · Answer 16 · Mon Apr 01 2024 01:39:25 GMT+0800 (China Standard Time)

Thank you for sharing the file. I got a reasonable UMI count in the _cdr3.out file based on the files you provided:

CELL8384_N1_0 0 TRBV4-2*01  TRBD2*02  TRBJ2-7*01  TRBC2 CTGGGGCATAACGCT TACAACTTTAAAGAACAG  TGTGCCAGCAGCCCACTGGACGGGGGAGGGGGAAACGAGCAGTACTTC  1.00  11.00 100.00  1
CELL8384_N1_946 0 TRAV5*01  * TRAJ28*01 TRAC  GACAGCTCCTCCACCTAC  ATTTTTTCAAATATGGACATG TGTGCAGAGATGGCACCTGGGGCTGGGAGTTACCAACTCACTTTC 1.00  5.00  100.00  1
CELL8384_N1_12367 0 TRBV4-2*01  TRBD1*01  TRBJ2-7*01  * * * TGTGCCAGCAGCCCACTGGACGGGGGGAGGGGGAAACGAGCAGTACTTC 0.83  1.00  97.14 0
CELL8384_N1_13572 0 TRBV4-2*01,TRBV7-8*03 TRBD2*02  TRBJ2-7*01  * * * TGTGCCAGCAGCCCACTGGACGGGGGAGGGGGAAACGAGCAGTACTTC  1.00  1.00  100.00  0                                                 
CELL8384_N1_13641 0 TRBV4-2*01  TRBD1*01  TRBJ2-7*01  * * * TGTGCCAGCAGCCCACTGGACGGGGGAGGGGAAACGAGCAGTACTTC 1.00  2.00  97.06 0
CELL8384_N1_17754 0 TRBV11-2*01,TRBV11-3*01 TRBD1*01  TRBJ2-7*01  TRBC2 * * TGTGCCAGCAGCTTAGACTACAGGTTATATGGGGAGCAGTACTTC 0.83  2.00  100.00  0
CELL8384_N1_17959 0 TRBV4-2*01,TRBV4-3*01 * * * * * TGGTGCCCGGCCCGAAGTACTGCTCGTTTCCCCCTCCCCCGTCCAGTGGGC 0.00  1.00  0.00  0
CELL8384_N1_20036 0 TRAV5*01  * * * * * TCTGCAGAGACAGATGTTTATCCTTTTTATTCAATAGAACAGTGA 0.00  1.00  0.00  0
CELL8384_N1_20872 0 * * TRAJ49*01 TRAC  * * AGGGACACCGGTAACCAGTTCTATTTT 0.00  1.00  0.00  0

Which version of TRUST4 did you use?

Shuangshuang Li · Answer 17 · Mon Apr 01 2024 08:39:17 GMT+0800 (China Standard Time)

TRUST4 v1.0.13-r473.
I'll try the latest version to see if this problem occurs.
Thanks.

Li Song · Answer 18 · Mon Apr 01 2024 13:25:33 GMT+0800 (China Standard Time)

Thank you for sharing the larger data set. I think I've found and fixed the bug that may assign a read to another barcode in the contig abundance estimation step. Could you please pull down the github repo again and give it a try? This is a pretty serious bug, if it works on your data set, I will draft a new release soon.

Shuangshuang Li · Answer 19 · Mon Apr 01 2024 14:02:37 GMT+0800 (China Standard Time)

I tested a larger dataset and obtained results for several cell IDs. Now the UMIs are working properly.
Many thanks for your help, Dr. Li.

Shuangshuang Li · Answer 20 · Mon Apr 01 2024 15:00:19 GMT+0800 (China Standard Time)

grep 'CELL1000_N3' tcrbcr_assign.out|head -n 10
E200004414L1C001R03004110063    CELL1000_N3_509
E200004414L1C002R02602089629    CELL1000_N3_509
E200004414L1C003R00300997369    CELL1000_N3_509
E200004414L1C002R03201437208    CELL1000_N3_509
E200004414L1C003R00604427789    CELL1000_N3_509
E200004414L1C002R01201985373    CELL1000_N3_509
E200004414L1C002R00603037193    CELL1000_N3_509
E200004414L1C003R02601430041    CELL1000_N3_509
E200004414L1C002R00904293832    CELL1000_N3_509
E200004414L1C002R01800009866    CELL1000_N3_509

grep 'CELL1000_N3' tcrbcr_assembled_reads.fa|head -n 10
>E200004414L1C001R03004110063 -1 58323 66418 barcode:CELL1000_N3 umi:3844
>E200004414L1C002R02602089629 -1 58323 66418 barcode:CELL1000_N3 umi:2320
>E200004414L1C003R00300997369 -1 58323 66418 barcode:CELL1000_N3 umi:3033
>E200004414L1C002R03201437208 -1 56473 66418 barcode:CELL1000_N3 umi:3941
>E200004414L1C003R00604427789 -1 56473 66418 barcode:CELL1000_N3 umi:328
>E200004414L1C002R01201985373 -1 56333 66418 barcode:CELL1000_N3 umi:3370
>E200004414L1C002R00603037193 -1 56231 66418 barcode:CELL1000_N3 umi:6312
>E200004414L1C003R02601430041 -1 55874 66418 barcode:CELL1000_N3 umi:7102
>E200004414L1C002R00904293832 -1 55874 66307 barcode:CELL1000_N3 umi:345
>E200004414L1C002R01800009866 -1 55874 66307 barcode:CELL1000_N3 umi:2299

grep -A 1 'E200004414L1C001R03004110063' /tcrbcr_umi.fa
>E200004414L1C001R03004110063
CATAACTCAG
grep -A 1 'E200004414L1C002R02602089629' tcrbcr_umi.fa
>E200004414L1C002R02602089629
CATAACTTAG
grep -A 1 'E200004414L1C003R00300997369' tcrbcr_umi.fa
>E200004414L1C003R00300997369
CATAACTCAG
grep -A 1 'E200004414L1C002R03201437208' tcrbcr_umi.fa
>E200004414L1C002R03201437208
CATAACTCAG

Why do the umi numbers of the same umi become inconsistent after assembly?

Li Song · Answer 21 · Mon Apr 01 2024 22:36:55 GMT+0800 (China Standard Time)

They should be consistent. Do you see those issues from the cell barcode you shared with me?

Shuangshuang Li · Answer 22 · Mon Apr 01 2024 23:28:54 GMT+0800 (China Standard Time)

I found the issue caused by my own mistake. I included 'missing_barcode' during the analysis, which caused the problem. Removing it will be OK.
Thank you.

Li Song · Answer 23 · Tue Apr 02 2024 03:57:03 GMT+0800 (China Standard Time)

It's still quite strange. "E200004414L1C001R03004110063" and "E200004414L1C003R00300997369" have the same barcode and UMI, but their converted UMI numeric value is different. Their numeric UMI should not be affected by the "missing_barcode" issue. Or the UMIs correspond to other reads?

Li Song · Answer 24 · Tue Apr 02 2024 09:30:55 GMT+0800 (China Standard Time)

I think I've found the issue. Could you please pull the updated github repo and give it a try? Please let me know whether it works when there are "missing_barcode" in the data. Thank you again for scrutinizing TRUST4's results.

Shuangshuang Li · Answer 25 · Tue Apr 02 2024 12:37:09 GMT+0800 (China Standard Time)

After using the new repo, the results match those obtained after removing the missing_barcode.
The information in several files also corresponds correctly.
Thank you very much for your help.