Install R package using devtools.
devtools::install_github("jakeyeung/sortchicAnalysis", ref = "main")
Example pipeline for processing fastq files to generating count tables and running dimensionality reduction
In example_processing_pipeline/
there is a full example that takes fastq files, demultiplexes, trims, maps, tags bam files, generates count tables, and runs count tables through dimensionality reduction.
Example scripts are ordered from 1 to 6, which are run in ascending order. SingleCellMultiOmics scripts used v0.1.25 from https://github.com/BuysDB/SingleCellMultiOmics/releases/tag/v0.1.25 (SCMO)
1-run_demux.sh: runs demux.py from SCMO. Adapters, molecule barcodes, and cell barcodes are removed from reads and encoded into the fastq headers.
2-trim_fastq.sh: runs cutadapt to remove Illumina adapters
3-map-fastq.sh: uses bwa to map trimmed fastqs
4-sort_index_tag_bam.sh: uses SCMO to read fastq headers and mapping information to record PCR duplicates, cell barcodes, cut locations, and other meta information. See https://github.com/BuysDB/SingleCellMultiOmics/blob/master/TAGS.MD for a full description of different tags and their descriptions.
5-make_count_tables*.sh: reads tagged bams and outputs counts falling into genomic regions. Uses --filterXA to ignore alternative hits, --minMQ for read quality, --dedup to remove PCR duplicates, --r1only to count fragments rather than reads (i.e. there are often two reads per fragment), -blacklist to ignore reads falling in bad regions.
6-run_LDA*.sh: runs latent Dirichlet allocation on cleaned count matrix (i.e. bad cells removed, bad regions removed).
-
Explore HSPC dataset and get quantifications for log2 fold change from HSPCs to mature cell fates
-
Explore dual labeling outputs and plot the relationships between heterochromatin and cell types
For genome-wide analysis, it is better to look at bam files rather than count matrices (which are better for dimensionality reduction and fitting statistical models).
These bams have been already split and merged by cell types.
H3K4me3: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM5018603
H3K27me3: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM5018601
H3K4me1: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM5018604
H3K9me3: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM5018607
To get cell barcode from bam, look at the bam tag "CB", details for more tags in the github: https://github.com/BuysDB/SingleCellMultiOmics/blob/master/TAGS.MD
To extract the cell barcode from a read from bam, use pysam
e.g. read.get_tag("CB")
Avoid the qChIC bam files, they are old.