cory-weller / pfc-atlas-qtl

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

pfc-atlas-qtl

Files

ATAC caQTL blacklist files were prepared using 1) RefSeq TSS positions, and 2) from the Boyle Lab ENCODE blacklist to yield TSS-blacklist.bed, boyle-blacklist.bed, and boyle-plus-TSS-blacklist.bed

Fingerprinting

See GATK documentation

0. Prepare reference data

The tool needs a haplotype map file (VCF style) with identical chromosome order as your other inputs. I manually reordered the map with a series of grep commands, as well as prepare reference and sequence dictionary by running prepare-refdata.sh.

1. Get RNA fingerprints from BAM

Generate a two-column file that contains SAMPLEID and FULL_BAMFILE_PATH (no header), in my case sample_bam.txt.

Then submit job array with one job per RNA bam, each running get-rna-fingerprints.sh

mkdir -p fingerprints/rna
samplefile='sample_bam.txt'
njobs=$(wc -l $samplefile)
sbatch --array=1-${njobs}%50 scripts/get-rna-fingerprints.sh ${samplefile}

2. Combine RNA fingerprints into one file

Once all jobs complete and separate fingrprints files exist (one per sample), merge all into a single file rna-merged.vcf.gz

cd fingerprints/rna
module load samtools
ls *.vcf.gz > files.txt
bcftools merge --file-list files.txt -o rna-merged.vcf
bgzip rna-merged.vcf 
tabix -p vcf rna-merged.vcf.gz

3. Get DNA fingerprints from plink bed/bim/fam

merge-nabec-hbcc-genotypes.sh generates fingerprints file dna-merged.vcf.gz

The script merge-nabec-hbcc-genotypes.sh is currently very specific to this dataset.

bash scripts/merge-nabec-hbcc-genotypes.sh

4. Calculate fingerprints on whole set

rna='fingerprints/rna/rna-merged.vcf.gz'
dna='fingerprints/dna/dna-merged.vcf.gz'
map='hg38_chr.reorder.map'
output='fingerprints/crosscheck/all-pairs.txt'
sbatch scripts/crosscheck.sh ${rna} ${dna} ${map} ${output}

Plot results

The script plot-crosschecks.R is very specific to the samples ran in this project, so modify as needed.

module load R/4.3
Rscript scripts/plot-crosschecks.R

QTL Analysis

Generate table of samples along with batch and bam location

module load R/4.3 && \
Rscript scripts/finalize-samples.R

Pseudobulked counts in /data/CARD_singlecell/brain_atlas_wnn/output/rna/

Prepare genotypes

See genotypes directory's README file.

Plot covarites along Principal Component axes

module load R/4.3 && \
Rscript scripts/plot-covariates.R
HBCC covariate plots

NABEC covariate plots

Based on separation along principal component #1 for cohort HBCC, these 8 samples were excluded by running remove-hbcc-outliers.sh to generate genotypes/HBCC_polarized_nooutliers.{bed,bim,fam}.

FID PC1
HBCC_1058 0.2968
HBCC_1331 0.3027
HBCC_1385 0.3055
HBCC_1431 0.2655
HBCC_1560 0.3043
HBCC_2429 0.3139
HBCC_2756 0.3283
HBCC_2781 0.2680

Prepare final tensorQTL covariates files

See prep-covariates.R. The script generates a data frame for subsetting with particular cell type QTL runs.

module load R/4.3 && \
Rscript scripts/prep-covariates.R

Format expression counts as BED format

See prep-QTL-bedfiles.R. Generates a file {mode}-{cohort}-{celltype}-counts.bed within the directory QTL-pseudobulk-counts

module load R/4.3 && \
Rscript scripts/prep-QTL-bedfiles.R

Manually add feature blacklists

The file data/array-params.tsv can include four columns. By default it will be generated with the first three. The fourth column specifying a feature blacklist can be added manually. I used awk to add data/TSS-blacklist.bed for the ATAC runs.

awk '{OFS="\t"; if ($0~/atac/) {print $1,$2,$3,"data/TSS-blacklist.bed"} else {print $0,""}}'  data/array-params.tsv \
> .tmpparams && \
mv .tmpparams data/array-params.tsv
Column Value
1 Celltype, {Astro,ExN,InN,MG,OPC,Oligo,VC}
2 Mode of data, . {rna, atac}
3 Cohort, {HBCC,NABEC}
4 Blacklist bed file with (at minimum) headers {chr,start,stop}

Run TensorQTL

See run-tensorQTL.sh. It takes a single argument, a $BATCHNAME that will create a folder for outputs, i.e. QTL-output/$BATCHNAME. The script must be submitted as a job array. Each job corresponds to a row from data/array-params.tsv.

Briefly, the script does the following:

  1. Imports the Nth row (of job array, using $SLURM_ARRAY_TASK_ID) from data/array-params.tsv
  2. Creates and uses a temporary lscratch working directory
  3. Executes intersect-files.R to generate run-specific subset of pseudobulk counts, covariates, and interaction terms
  4. Executes subset-plink.sh to generate run-specific subset of genotypes
  5. Executes tensorqtl job using singularity container
  6. Copies output from lscratch to $BATCHNAME within this project directory
sbatch --array=1-28%6 scripts/run-tensorQTL.sh scvicounts-polarized-interaction-noTSS

Combine results

See rbind-qtl-results.R which generates three files:

  • cis-caQTL.tsv
  • cis-eQTL.tsv
  • cis-QTL-combined.tsv
Rscript scripts/rbind-qtl-results.R
wget -O data/00-common_all.vcf.gz https://ftp.ncbi.nih.gov/snp/organisms/human_9606_b151_GRCh38p7/VCF/00-common_all.vcf.gz
zgrep -v -F '#' data/00-common_all.vcf.gz | awk '{print $1,$2,$3,$4,$5}' > rsids.txt

About


Languages

Language:R 70.9%Language:Shell 29.1%