Irenexzwen / CHM13

Ultra-long reads for CHM13 genome assembly

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Telomere-to-telomere consortium

Introduction

We have sequenced the CHM13hTERT human cell line on the Oxford Nanopore GridION. We have also sequenced approximately 50x coverage using 10X Genomics as well as BioNano DLS and Arima Genomics HiC. PacBio data for this cell line has been previously generated by the Washington University School of Medicine and the University of Washington, and is available from NCBI SRA.

Human genomic DNA was extracted from the cultured cell line. As the DNA is native, modified bases will be preserved. We followed Josh Quick's ultra-long read (UL) protocol for library preparation and sequencing.

Data reuse and license

All data is released to the public domain (CC0) and we encourage its reuse. While not required, we would appreciate if you would acknowledge the "telomere-to-telomere" (T2T) consortium for the creation of this data and encourage you to join us if you would like to help finish the human reference genome. More information about our consortium can be found on the T2T homepage.

Draft Assembly

The current assembly draft (v0.4) is generated with Canu v1.7.1 including rel1 data up to 2018/11/15 and incorporating the previously released PacBio data. Two gaps on the X plus the centromere were manually resolved. The assembly was polished with two rounds of nanopolish and two rounds of arrow. The estimated base accuracy is currently QV36, which we expect to improve with future integration of the 10X Genomics data. BioNano structural variants on the X were identified, locally mapping nanopore reads selected, reassembled, and used to patch the assembly. However, these patches are not yet polished or validated using BioNano. The assembly has not been curated outside of the X chromosome.

The assembly is 2.94 Gbp in size with 657 contigs and an NG50 of 85.8 Mbp

This should be considered a draft and likely has mis-assemblies, inaccurate consensus, and frame-shifted genes. It will be further validated, scaffolded with BioNano, and polished using the available data.

Downloads

Sequencing Data

Oxford Nanopore Data

We sequenced approximately 100 flowcells of UL data for a total of 155 Gbp (50x coverage, 1.6 Gbp/flowcell). The read N50 is 70 kbp and there are 99 Gbp of data in reads >50 kbp (32x). The longest mapping read is 1.04 Mbp.

rel2 (genomic DNA)

rel2 is the same data as rel1 but recalled with the latest generation callers (Guppy flip-flop 2.3.1). We have provided mappings both to our current draft assembly and to the GRCh38 with decoys in cram format, using minimap2.

Downloads

rel1 (genomic DNA)

The full dataset as of 2019/01/09. These basecalls were generated on-instrument and use older versions of Guppy (depending on when the flowcell ran on the instrument).

Downloads

fast5 data

The raw fast5 data, without basecalls, is available for completeness. The data is grouped into 96 sets.

Downloads

10X Genomics Data

Raw fastq files

Approximately 50x of data was generated on a NovaSeq instrument. Based on the summary output of Supernova, there are 1.2 billion reads with 41x effective coverage. The mean molecule length is 130 kbp and an N50 of 864 reads per barcode.

Downloads

BioNano DLS Data

Approximately 430x of data was generated using the Saphyr instrument and the DLE-1 enzyme. There are 15.2 M molecules with an N50 molecule length of 115.9 kbp and a max of 2.3 Mbp (2 M molecules > 150 kbp, N50 218 kbp). The assembly of the molecules is 2.97 Gbp in size with 255 contigs and an NG50 of 59.6 Mbp.

Downloads

  • BNX (md5: 59a7a5583e900e1e5cecb08a34b5b0dc)
  • CMAP (md5: cf1a6fbcf006a26673499b9297664fdb)

HiC Data

The HiC raw data will be available soon.

Previously generated PacBio data

The PacBio data was previously generated and is available from the SRA

Notes on downloading files.

Files are generously hosted by Amazon Web Services. Although available as straight-forward HTTP links, download performance is improved by using the Amazon Web Services command-line interface. References should be amended to use the s3:// addressing scheme, i.e. replace https://s3.amazon.com/nanopore-human-wgs/ with s3://nanopore-human-wgs to download. For example, to download CHM13_prep5_S13_L002_I1_001.fastq.gz to the current working directory use the following command.

aws s3 --no-sign-request cp s3://nanopore-human-wgs/chm13/10x/CHM13_prep5_S13_L002_I1_001.fastq.gz .

or to download the full dataset use the following command.

aws s3 --no-sign-request sync s3://nanopore-human-wgs/chm13/ .

The s3 command can also be used to get information on the dataset, for example reporting the size of every file in human-readable format.

aws s3 --no-sign-request ls --recursive --human-readable --summarize s3://nanopore-human-wgs/chm13/ 

or to obtain technology-specific sizes.

aws s3 --no-sign-request ls --recursive --human-readable --summarize s3://nanopore-human-wgs/chm13/nanopore/fast5
aws s3 --no-sign-request ls --recursive --human-readable --summarize s3://nanopore-human-wgs/chm13/nanopore/rel2
aws s3 --no-sign-request ls --recursive --human-readable --summarize s3://nanopore-human-wgs/chm13/assemblies

Amending the max_concurrent_requests etc. settings as per this guide will improve download performance further.

Contact

Please raise issues on this Github repository concerning this dataset.

History

* rel1 and 2: 2nd March 2019. Initial release.

About

Ultra-long reads for CHM13 genome assembly

License:Other