flo-compbio / singlecell

SingleCell: A Python/Cython Package for Processing Single-Cell RNA-Seq Data.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

SingleCell

SingleCell is a Python package for processing single-cell RNA-Seq data.

Requirements

  • Python 3 (tested with Python 3.5)
  • STAR (tested with version 2.5.3a)
  • samtools (tested with version 1.4.1)

The STAR and samtools executables must both be in the PATH. To test this, you can run the following commands, and check that they return the respective version identifiers:

$ STAR --version
STAR_2.5.3a

$ samtools --version
samtools 1.4.1
Using htslib 1.4.1
Copyright (C) 2017 Genome Research Ltd.

Installation

$ cd singlecell
$ pip install -e .

Creating a STAR index (only once)

To run the inDrop pipeline on your data, the first thing you need is a STAR genome index for the species that your data is from. A STAR index consists of a directory containing a bunch of files. For the human genome, the size of these files totals about 25 GB. You only need to create an index once (per species), which is then used by all future runs of the inDrop pipeline.

To generate an index, you need to download and decompress (using gunzip) the genome (in FASTA file) and genome annotations (in GTF format) for the species from the Ensembl FTP server. For example, for human:

$ curl -O http://ftp.ensembl.org/pub/release-88/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
$ gunzip Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz

$ curl -O http://ftp.ensembl.org/pub/release-88/gtf/homo_sapiens/Homo_sapiens.GRCh38.88.gtf.gz
$ gunzip -c Homo_sapiens.GRCh38.88.gtf.gz > Homo_sapiens.GRCh38.88.gtf

For the genome annotation (GTF) file, you want to also keep the compressed version, because this is the version used by the inDrop pipeline afterwards.

Now that you have those files ready, you can run the following:

$ indrop_generate_star_index.py -g Homo_sapiens.GRCh38.dna.primary_assembly.fa \
        -n Homo_sapiens.GRCh38.88.gtf \
        -od star_index_human -os build_star_index_human.sh \
        -ol build_star_index_human_log.txt \
        -t 16

This will output the STAR index in the directory "star_index_human" (see -od parameter), and will use 16 threads in parallel (-t), making the build process signficantly faster than if you were to run it single-threaded.

Running the inDrop pipeline

To run the inDrop pipeline, you need to first create a configuration file (in YAML format), which contains the locations (paths) of all the input files, specifies an output directory, and sets a few parameters (e.g., how many cells you want to include in the expression matrix). To generate a configuration file template that you can then modify according to your setup, run the following:

$ indrop_create_config_file.py -o my_configuration.yaml

After adjusting the parameters in the configuration file, you can check if everything is configured correctly:

$ indrop_check_pipeline.py -o my_configuration.yaml

If there are no errors, you can run the pipeline:

$ indrop_pipeline.py -c my_configuration.yaml

About

SingleCell: A Python/Cython Package for Processing Single-Cell RNA-Seq Data.

License:BSD 3-Clause "New" or "Revised" License


Languages

Language:Python 94.3%Language:Shell 5.7%