data-bundle-examples

About

A repository housing metadata and data files (or links to data files) that are being prepared as sample data bundles for various uses.

Hand curated examples can be found in the root directory each in their own directory named for the platform. Bulk imported examples can be found in the import directory.

Since thousands of JSON files were produced in the import subdirectories, they have been removed from Git and included as a single compressed tarball import.tgz.

Extract Metadata Files

This command will extract the metadata JSON files associated with the import examples. This will create thousands of files so please do not check them into Git:

tar zxf import/import.tgz

Get Data Files

Downloads the fastq files associated with each hand-curated example:

bash bin/get_data.sh

For the import directory structure, first extract the metadata files (see above) and then run:

# submodule
git submodule update --init --recursive
git pull --recurse-submodules
# mac
brew install python3 # if you're on a mac, make sure brew upgrade
# or ubuntu
sudo apt-get install python3.6
virtualenv -p python3.6 env
source env/bin/activate
pip install python-dateutil crcmod==1.7 boto boto3 jsonschema
python bin/get_import_data.py

Smartseq2

This is E-MTAB-5061. This is based on Jim's example, see his google doc here.

Jim made several hundred sample bundles, we're just storing the first one in this repo that corresponds to a single cell.

According to Array Express: "Libraries were sequenced on an Illumina HiSeq 2000, generating 43 bp single-end reads." So I believe the single file is correct.

Drop-seq

This is GSE81904. This is based on Jim's example, see his google doc here.

Jim made several bundles (one per sample), we're just storing the first one in this repo that corresponds to a single sample and multiple cells.

10X

This is based on Jim's example, see his google doc here.

See https://support.10xgenomics.com/single-cell-gene-expression/datasets/pbmc8k

Laura/EBI

Laura and others at EBI are working within the EBI framework, which will produce a JSON file via this link: https://www.ebi.ac.uk/ena/portal/api/search?result=read_run&format=JSON&limit=0&fields=study_accession,sample_accession,experiment_alias,experiment_title,fastq_ftp,fastq_md5,fastq_bytes,last_updated&query=run_accession=ERR1630017

Import

Larger scale projects imported in full from ArrayExpress, GEO, etc. The json bundles for these are in import.tgz, and include more than 30,000 files. Do a tar -xf import.tgz to unpack, preferably in a ram-disk. These are json files are generated from the tagStorm format curated.tags file in sub-sub directories of the import subdirectory.

See http://hgwdev.soe.ucsc.edu/~kent/hca/projects.html for a list of the projects involved.

TODO & Questions for the Group

Drop-seq

UMI offset, UMI size,cell barcode size, cell barcode offset should not change for the assay Drop-seq (or the 10X version assay). It is possible to minimize the metadata here given they are all realistically one unit of information. It depends on the purpose of the json file just FYI [from Tim].
- True. I could drop these, but as assay methods proliferate I thought it might be nice to have these spelled out. [from Jim].
Need to add 10X channel information on top of sequencer lane information for this [from Tim].
- I'm not quite sure what you mean by this. [from Jim]
Need to add lane information to this like you have in 10X [from Tim].
- OK. Everything has lanes now. [from Jim]
- Thanks, resolved [Tim]

10X

This is documented as an example of v2 chemistry, do we want v1 and V(D)J? If so are you going to grab the data or do we want to simulate from this data set [from Tim]?
- I can grab a 10x v1, which I think would be good to have. The V(D)J is sort of specialized. I'd like to skip it for now. [from Jim]
  - Agreed [Tim]
- There are files that are R1 that are type index, this may be misleading given there are sample indices in the I file, maybe call it barcodes (applies to other assays as well)? [Tim]
If you are going to have a type for fastq file to differentiate the file with the transcript then this generalizes to Drop-Seq and you should do that too, you could also do this to Smartseq2 (both transcript) if you want to keep the pattern standard [from Tim].
- I've got it set up for both drop-seq and 10x_v2 to use type=index and type=reads [from Jim]
- There's a new set from Beijing that puts the index on the second rather than the first read. There's also sample sequenc on the second read now. I added a new tag assay.seq.umi_barcode_read to help sort this out. Also assay.single_cell.cell_barcode_read.
- This is not standard and I am not sure can be called Drop-seq (unless thier documentation is some how messed up). We need to understand how much variation we will allow in the assays and still allow the assay to be called a standard assay. [Tim]
Need to include the barcodes used for the 10X run (there are different library barcodes one can use) [from Tim].
- Hmm. There is and I.fastq.gz file (or is it I3.fastq.gz) file that has the observed sample library barcodes. I wonder if that's what you mean. Otherwise I'm not sure where to find it. I could parse it out of their matrix I suppose, but maybe it's somewhere pre-alignment. [from Jim]
- The barcode set is a part of the bcl2fastq command they have wrapped and called mkfastq, check the documentation on that command. [Tim]

Smart-seq2

Fluidigm C1 is more of a platform than an assay. Would probably be a good idea to record the protocol + Chip used for fluidigm, for example FluidigmC1(mRNA). [Tim]

General

The analysis.json files need to be redone to show an upload not an alignment.
- What do you mean by this? Analysis.json (now provenance.json) are generated after a green run not a purple upload [from Tim].
- It seems it was changed to manifest.json (which I like as a file). Resolved [Tim]
We need to check the fastq files, I don't think they are correct since we expect multiple fastq files per data bundle.
- Smartseq2 I think is correct since it's a single-end experiment
  - this is not standard, Smartseq2 is expected to be paired sequencing [from Tim]
  - I've got both single and paired end examples now under smartseq2 [from Jim]
- Drop-seq I think is missing the fastq1 file since it was converted from BAM, so this is lost?
  - Agreed, this would happen if the bam was post alignment, pre-annotation [from Tim].
- This is corrected now. I couldn't find the fastqs in array express, but the experiment is also in GEO. [from Jim].
  - I have some files for Smartseq2 and Drop-seq, where can I put them for the get_data.sh to pull. Also have associated output files that were ran on pipelines from the input data. It would be great to wget these files not to the data folder but into thier respective bundles [from Tim].
- We've moved to a system where there's a manifest.json that has a dir field that points to where the files live (It can include http:// https:// or ftp:// prefixes). So, you should be able to put the data files anywhere web accessible.
We have a cell.json and sample.json... do we need both? Laura and Tim think it's overlapping for sample and should just use sample.json.
- Agreed, moved to an attic space for now [from Tim].
- Cell I kept since it does contain unique info, but stuff unique to cell-at-a-time assays. I kind of expect it'll get reworked and for now the pipeline can just ignore it. The red box likely will want this later if it exists. [from Jim]
Where does quality control for a release go?
- Would we want the release to be a bundle that contains the products of the release process (in line with our handling of green runs)? I would like to see in the release a file manifest, indication of white/grey/black listing, information about the criteria to be in each listing (because this may change between releases), time/date info [from Tim].
- I would like to see the quality info go in a qc_stats.json file. [from Jim]
What about samples being run multiple times (multiple lanes)? Do they get individual data bundles or a single data bundle which has been combined?
- My current thinking is that I feel this is best served by updating a bundle but not making a new one [Tim].
What will be the input file format expected in the system? Are we going to start with fastq.gz file uploads or something else? [from Tim]
- At the moment I'm making them all fastq.gz. I'm converting the .sra format to this for GEO accessions. The 10x examples were tar'd fastq.gzs and I untarred them. For ArrayExpress so far they have had .fastq.gz already available from URL, so no conversion there. I hear they are wanting to convert to CRAM. The one I saw with CRAM for alignments did also have .fastq.gz though. [from Jim] We should talk more about this but I will use the fastq.gz assumption as well. [Tim]
Do we want to store the expression matrices in a format more usable for sparse data [from Tim]?
- I am inclinded to say yes [Tim].
- The 10x have a hca5 based format I'd sort of like to stay away from. [form Jim]
- What do you not like about the format?

phycomlab / data-bundle-examples

data-bundle-examples

About

Extract Metadata Files

Get Data Files

Smartseq2

Drop-seq

10X

Laura/EBI

Import

TODO & Questions for the Group

Drop-seq

10X

Smart-seq2

General

About

Languages