AllonKleinLab / paper-data

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Barcodes missing in raw data

william0701 opened this issue · comments

Hi,

I'm sorry to interupt you here. This is a fantasic work about cell differentiation. And we want to use your data for further study.

We download the raw fastq data of GSM4185642 from GEO database. But we find the reads from most runs do not have the barcode information in their seqname, except SRR10510898 and SRR10510899. The first 10 lines in some runs are shown below.

SRR10510898.fastq.gz

@SRR10510898.1 GGACTGGA-CAGTTTGC:GACCTC:NS500422_606_HJHGKBGX5_1_11101_15231_2223 length=59
CCCCTCATCTCCCTTCTCCACTGCTGGTTGTGTGGTGAGGAGGTACAGTTCCTAGCCCC
+SRR10510898.1 GGACTGGA-CAGTTTGC:GACCTC:NS500422_606_HJHGKBGX5_1_11101_15231_2223 length=59
AAAAAA/AEEEAEEEEEEEEEE/EEAE/E/AE/EEAE//EEEEEEE6EEEEEE/AEEEE
@SRR10510898.2 GGACTGGA-CAGTTTGC:CCATAA:NS500422_606_HJHGKBGX5_1_11101_13867_2769 length=60
GCTTTCGAAAGAAGAAAACTCACCTGTGTGAAGAAATGGTATCTGCTTTCAATAAAACTG
+SRR10510898.2 GGACTGGA-CAGTTTGC:CCATAA:NS500422_606_HJHGKBGX5_1_11101_13867_2769 length=60
AA6AAAEEEEEEEEEEEEEEEAEEEEEEEEEEEE6AEEEEAEEEEEEEEEE/EE//<EEE
@SRR10510898.3 GGACTGGA-CAGTTTGC:AAAACG:NS500422_606_HJHGKBGX5_1_11101_7560_6078 length=60
GGGTCATAGTAAACAAGAAAGGAGAGATGAAAGGCTCTGCTATCACAGGTCCAGTGGCAA
==============
SRR10510899.fastq.gz

@SRR10510899.1 GCTCGTAG-AAGGTAAT:ACCTTT:NS500422_606_HJHGKBGX5_1_11101_6791_7560 length=61
GTAGTGAGTAAATCTGGAGGGAGGTGCAGAGCCGAGGAGTCGGTGGGCAGAGGCTCTCCTG
+SRR10510899.1 GCTCGTAG-AAGGTAAT:ACCTTT:NS500422_606_HJHGKBGX5_1_11101_6791_7560 length=61
AA6AAEEEEEAEEEEEEEEEEEEE/EAAE/EEAEEEAEEEAEA/EEEEEEEEAEEEEEEEE
@SRR10510899.2 GCTCGTAG-AAGGTAAT:TGGAGT:NS500422_606_HJHGKBGX5_1_11101_7883_7977 length=61
GCCGTATTGTAGCTGATCGGGAAATGTTTGATATCTCAGCAATTTTGCATTTTTGTGTCTC
+SRR10510899.2 GCTCGTAG-AAGGTAAT:TGGAGT:NS500422_606_HJHGKBGX5_1_11101_7883_7977 length=61
AAAAA6EEEEEEEEEEEEEEAA/EEEEEEEEEEEEEEE/AAAEE/EAAEEEEAA/EAEEEE
@SRR10510899.3 GCTCGTAG-AAGGTAAT:TCAATG:NS500422_606_HJHGKBGX5_1_11101_13212_10780 length=61
TTCCGATCACCTTGCTGGCATCAGATGCACCTCAAGCATGCTGTACCACAACTGTCTGCCC
\==============
SRR10510900.fastq.gz

@SRR10510900.1 1 length=61
GCTTCCTCATACAGTTATAGTAAGGCTGTCACTTGCTTCAGAACAATCATTCTTGAAATAT
+SRR10510900.1 1 length=61
AAAAAEEEEEEEEEEEEEEEEEEEEAAEEEEEEEEEEEEEEE6AE6EEEEEEEE/EEEEEE
@SRR10510900.2 2 length=60
CTGGGGACCAAAGCTGCGGACCCATCATAGCTGACCAACCCTGTTGCCCTTGGACTCCTA
+SRR10510900.2 2 length=60
AAA/AEEEEEE6EEEE<EEEEEEAAEEEEEEEEEEE/EA<EEEEEEEEAEAEE/EEEEEA
@SRR10510900.3 3 length=61
CACCCTTCTCTTAACTATTTCTCTAACGCTCCCCTTCCTGCCTGCTCTGGGAGTAGGGAGG
==============
SRR10510901.fastq.gz

@SRR10510901.1 1 length=30
CCCTCCCTGCCCCAGCTGGCTGCCCTCCCC
+SRR10510901.1 1 length=30
AAAAAEEEEE/EE/AEEEE/EAEEEAEA/E
@SRR10510901.2 2 length=30
TTGTTGTTGTTGTTATATGTGTGTGTTTTA
+SRR10510901.2 2 length=30
A6AAAE/A/EEEEEEE/AEAA/E</E<EEA
@SRR10510901.3 3 length=61
GCTCCATGTAATTATTGGATCAACATTCCTTATTGTTTGCCTACTACGACAACTAAAATTT
==============

Do you have any idea about how to obtain the barcodes information of the reads in other runs?

Thank you!

Hmm good question. Could these files be useful instead? https://www.dropbox.com/sh/cd1o0rypmfzizn3/AADsTJl4qEkG6er-BA_pDilua?dl=0

Thanks a lot for your reply. But the content of the url is unaccessible for me. The message says "Link temporarily disabled!
This can happen when the link has been shared or downloaded too many times in a day."
微信截图_20231127142119
Would you mind modifying the authorization of the shared url?

P.S. I'm sorry that I'm not familiar with the issue function, which leads me to close and reopen this issue many times.

Maybe I can share it. What email should I use?

My e-mail address is xidianbill@gmail.com

Thanks a lot~

Try this link https://www.dropbox.com/scl/fo/efokj6rgmx5fhpvyxoxpa/h?rlkey=7yu8k93aftgazbpioulwlxhrg&dl=0

Yeah,it works! But these files are thought to be used for RNA velocity analysis. We want to do some research about alternative splicing during cell differentiation. And we need the raw read information in each cell.

We find that in the SRA database, the raw data of SRR10510898 and SRR10510899 are marked as version3 and uploaded recently. Do you have the plan to update other runs in the future?
image

Hi,

Hmm I'm not sure why those two files differ from the rest. I have prepared a dropbox link with raw data for one of the experimental replicates in the paper. Let me know id this is helpful

https://www.dropbox.com/scl/fo/moqxw8m8pjs2o1qc1k6hz/h?rlkey=8s1ojf8q5dlwvh7qcgv9raycf&dl=0

Thanks a lot for your help. We are running the pipeline to check the seqname and barcode pairs. I will tell you as soon as the results are produced~

Hi @calebweinreb ,

It works!!! We run the indrop pipeline with the StateFate2.yaml file. And we obtain the read seqname containing barcode and UMI, which satisfies our need. But there is another problem. We find the new bam file of d2_1 has less than 20 million alignment reads, while the bam file of SRR10510898, which is also the d2_1 sample, has about 48 million alignment reads. Do you know what's the difference between them?

Yes the difference is that we resequenced these data and the dropbox just has the first sequencing run. I can add the other runs.

Thank you so much~That will be a huge help for us. We are looking forward to the other runs!

The files are in dropbox

Hi,

Everything is ok now. Thanks for all your help~
Best wishes