pfc-atlas-qtl

Files

ATAC caQTL blacklist files were prepared using 1) RefSeq TSS positions, and 2) from the Boyle Lab ENCODE blacklist to yield TSS-blacklist.bed, boyle-blacklist.bed, and boyle-plus-TSS-blacklist.bed

Fingerprinting

See GATK documentation

0. Prepare reference data

The tool needs a haplotype map file (VCF style) with identical chromosome order as your other inputs. I manually reordered the map with a series of grep commands, as well as prepare reference and sequence dictionary by running prepare-refdata.sh.

1. Get RNA fingerprints from `BAM`

Generate a two-column file that contains SAMPLEID and FULL_BAMFILE_PATH (no header), in my case sample_bam.txt.

Then submit job array with one job per RNA bam, each running get-rna-fingerprints.sh

mkdir -p fingerprints/rna
samplefile='sample_bam.txt'
njobs=$(wc -l $samplefile)
sbatch --array=1-${njobs}%50 scripts/get-rna-fingerprints.sh ${samplefile}

2. Combine RNA fingerprints into one file

Once all jobs complete and separate fingrprints files exist (one per sample), merge all into a single file rna-merged.vcf.gz

cd fingerprints/rna
module load samtools
ls *.vcf.gz > files.txt
bcftools merge --file-list files.txt -o rna-merged.vcf
bgzip rna-merged.vcf 
tabix -p vcf rna-merged.vcf.gz

3. Get DNA fingerprints from plink `bed/bim/fam`

merge-nabec-hbcc-genotypes.sh generates fingerprints file dna-merged.vcf.gz

The script merge-nabec-hbcc-genotypes.sh is currently very specific to this dataset.

bash scripts/merge-nabec-hbcc-genotypes.sh

4. Calculate fingerprints on whole set

rna='fingerprints/rna/rna-merged.vcf.gz'
dna='fingerprints/dna/dna-merged.vcf.gz'
map='hg38_chr.reorder.map'
output='fingerprints/crosscheck/all-pairs.txt'
sbatch scripts/crosscheck.sh ${rna} ${dna} ${map} ${output}

Plot results

The script plot-crosschecks.R is very specific to the samples ran in this project, so modify as needed.

module load R/4.3
Rscript scripts/plot-crosschecks.R

QTL Analysis

Generate table of samples along with batch and bam location

module load R/4.3 && \
Rscript scripts/finalize-samples.R

Pseudobulked counts in /data/CARD_singlecell/brain_atlas_wnn/output/rna/

Prepare genotypes

See genotypes directory's README file.

Plot covarites along Principal Component axes

module load R/4.3 && \
Rscript scripts/plot-covariates.R

HBCC covariate plots

NABEC covariate plots

Based on separation along principal component #1 for cohort HBCC, these 8 samples were excluded by running remove-hbcc-outliers.sh to generate genotypes/HBCC_polarized_nooutliers.{bed,bim,fam}.

FID	PC1
HBCC_1058	0.2968
HBCC_1331	0.3027
HBCC_1385	0.3055
HBCC_1431	0.2655
HBCC_1560	0.3043
HBCC_2429	0.3139
HBCC_2756	0.3283
HBCC_2781	0.2680

Prepare final tensorQTL covariates files

See prep-covariates.R. The script generates a data frame for subsetting with particular cell type QTL runs.

module load R/4.3 && \
Rscript scripts/prep-covariates.R

Format expression counts as BED format

See prep-QTL-bedfiles.R. Generates a file {mode}-{cohort}-{celltype}-counts.bed within the directory QTL-pseudobulk-counts

module load R/4.3 && \
Rscript scripts/prep-QTL-bedfiles.R

Manually add feature blacklists

The file data/array-params.tsv can include four columns. By default it will be generated with the first three. The fourth column specifying a feature blacklist can be added manually. I used awk to add data/TSS-blacklist.bed for the ATAC runs.

awk '{OFS="\t"; if ($0~/atac/) {print $1,$2,$3,"data/TSS-blacklist.bed"} else {print $0,""}}'  data/array-params.tsv \
> .tmpparams && \
mv .tmpparams data/array-params.tsv

Column	Value
1	Celltype, `{Astro,ExN,InN,MG,OPC,Oligo,VC}`
2	Mode of data, . `{rna, atac}`
3	Cohort, `{HBCC,NABEC}`
4	Blacklist `bed` file with (at minimum) headers `{chr,start,stop}`

Run TensorQTL

See run-tensorQTL.sh. It takes a single argument, a $BATCHNAME that will create a folder for outputs, i.e. QTL-output/$BATCHNAME. The script must be submitted as a job array. Each job corresponds to a row from data/array-params.tsv.

Briefly, the script does the following:

Imports the Nth row (of job array, using $SLURM_ARRAY_TASK_ID) from data/array-params.tsv
Creates and uses a temporary lscratch working directory
Executes intersect-files.R to generate run-specific subset of pseudobulk counts, covariates, and interaction terms
Executes subset-plink.sh to generate run-specific subset of genotypes
Executes tensorqtl job using singularity container
Copies output from lscratch to $BATCHNAME within this project directory

sbatch --array=1-28%6 scripts/run-tensorQTL.sh scvicounts-polarized-interaction-noTSS

Combine results

See rbind-qtl-results.R which generates three files:

cis-caQTL.tsv
cis-eQTL.tsv
cis-QTL-combined.tsv

Rscript scripts/rbind-qtl-results.R

wget -O data/00-common_all.vcf.gz https://ftp.ncbi.nih.gov/snp/organisms/human_9606_b151_GRCh38p7/VCF/00-common_all.vcf.gz
zgrep -v -F '#' data/00-common_all.vcf.gz | awk '{print $1,$2,$3,$4,$5}' > rsids.txt

cory-weller / pfc-atlas-qtl

pfc-atlas-qtl

Files

Fingerprinting

0. Prepare reference data

1. Get RNA fingerprints from `BAM`

2. Combine RNA fingerprints into one file

3. Get DNA fingerprints from plink `bed/bim/fam`

4. Calculate fingerprints on whole set

Plot results

QTL Analysis

Prepare genotypes

Plot covarites along Principal Component axes

Prepare final tensorQTL covariates files

Format expression counts as BED format

Manually add feature blacklists

Run TensorQTL

Combine results

About

Languages

pfc-atlas-qtl

Files

Fingerprinting

0. Prepare reference data

1. Get RNA fingerprints from BAM

2. Combine RNA fingerprints into one file

3. Get DNA fingerprints from plink bed/bim/fam

4. Calculate fingerprints on whole set

Plot results

QTL Analysis

Prepare genotypes

Plot covarites along Principal Component axes

Prepare final tensorQTL covariates files

Format expression counts as BED format

Manually add feature blacklists

Run TensorQTL

Combine results

About

Languages

1. Get RNA fingerprints from `BAM`

3. Get DNA fingerprints from plink `bed/bim/fam`