qoldt / AIControl.jl

Implementation of the AIControl paper in Julia 1.0

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

AIControl.jl

Build Status

AIControl makes ChIP-seq assays easier, cheaper, and more accurate by imputing background data from mass control data available in public.

Here is an overview of AIControl framework from our paper. alt text

Figure 1: (a) Comparison of AIControl to other peak calling algorithms. (left) AIControl learns appropriate combinations of publicly available control ChIP-seq datasets to impute background noise distributions at a fine scale. (right) Other peak calling algorithms use only one control dataset, so they must use a broader region (typically within 5,000-10,000 bps) to estimate background distributions. (bottom) The learned fine scale Poisson (background) distributions are then used to identify binding activities across the genome. (b) An overview of the AIControl approach. A single control dataset may not capture all sources of background noise. AIControl more rigorously removes background ChIP-seq noise by using a large number of publicly available control ChIP-seq datasets

Major Updates

  • (12/14/2018) Cleared all deprecations. AIControl now works with Julia 1.0. Please delete the precompiled cache from the previous versions of AIControl. You may do so by deleting the .julia folder.
  • (12/15/2018) Updated some error messages to better direct users (12/13/2018).
  • (1/7/2019) Made AIControl Pkg3 compatible for Julia 1.0.3

Installation

AIControl can be used on any Linux or macOS machine. While we tested and validated that AIControl works on Windows machines, we believe that it is easier for you to set up the AIControl pipeline on the Unix based systems.

AIControl expects a sorted .bam file as an input and outputs a .narrowpeak file. Typically, for a brand new ChIP-seq experiment, you start with a .fastq file, and you will need some external softwares for converting the .fastq file to a sorted .bam file. Thus, the whole AIControl pipeline needs the following sets of programs and packages installed on your local machine. We will explain how to install them in sections below.

  • Julia (Julia 1.0 and above)
  • bowtie2: aligning a .fastq file to the hg38 genome
  • samtools: sorting an alinged bam file
  • bedtools: for converting a bam file back to a fastq file (OPTIONAL for Step 3.1)

1a. Installing Julia 1.0 for a Linux machine

The commands below will install julia 1.0.3 on a linux machine. Please change the url accordingly. You can also download julia here. We highly recommend avoiding the conda version of julia as it currently known to have a problem locating libLLVM.so in many environments.

cd
wget https://julialang-s3.julialang.org/bin/linux/x64/1.0/julia-1.0.3-linux-x86_64.tar.gz
tar xvzf julia-1.0.3-linux-x86_64.tar.gz
echo  'export PATH=$PATH:~/julia-1.0.3/bin' >> ~/.bashrc
source .bashrc

1b. Installing Julia 1.0 for a mac OS machine

To be filled.

2. Installing Julia Packages

The command below will install required julia packages and AIControl.

julia -e 'using Pkg; Pkg.add(["FileIO", "JLD2"]); Pkg.add(PackageSpec(url = "https://github.com/hiranumn/AIControl.jl"))'

3. Installing external softwares with miniconda

Please download and install miniconda from here. The command below will install required external softwares using conda package management system.

conda install -c bioconda bowtie2 samtools bedtools

Data files required for AIControl

AIControl uses a massive amount of public control data for ChIP-seq (roughly 450 chip-seq runs). We have done our best to compress them so that you only need to download about 4.6GB. These files require approximately 13GB of free disk space to unfold. The following commands will download and decompress the compressed control data.

wget https://dada.cs.washington.edu/aicontrol/forward.data100.nodup.tar.bz2
tar xvjf forward.data100.nodup.tar.bz2
wget https://dada.cs.washington.edu/aicontrol/reverse.data100.nodup.tar.bz2
tar xvjf reverse.data100.nodup.tar.bz2

You can also obtain the control files from our data repository or Google Drive.

Paper

We have an accompanying paper in BioRxiv evaluating and comparing the performance of AIControl to other peak callers in various metrics and settings. AIControl: Replacing matched control experiments with machine learning improves ChIP-seq peak identification (BioRxiv). You can find the supplementary data files and peaks files generated by the competing peak callers on Google Drive.

Running AIControl (step by step)

Step 0: Download a toy example.

The command below will download a .fastq file that you may use as a toy example.
They are also available at our data repository.

wget https://dada.cs.washington.edu/aicontrol/example.fastq

Step 1: Map your FASTQ file from ChIP-seq to the hg38 assembly from the UCSC database.

The following commands will a) download and untar the reference database file for bowtie2 and b) run bowtie2 to map a .fastq file to the UCSC hg38 genome, which is available at the UCSC repository.

wget https://dada.cs.washington.edu/aicontrol/bowtie2ref.tar.bz2
tar xvjf bowtie2ref.tar.bz2
bowtie2 -x bowtie2ref/hg38 -q -p 10 -U example.fastq -S example.sam

Unlike other peak callers, the core idea of AIControl is to leverage all available control datasets. This requires all data (your target and public control datasets) to be mapped to the exact same reference genome. Our control datasets are currently mapped to the hg38 assembly from the UCSC repository. So please make sure that your data is also mapped to the same assembly. Otherwise, our pipeline will report an error.

Step 2: Convert the resulting sam file into a bam format.

samtools view -Sb example.sam > example.bam

Step 3: Sort the bam file in lexicographical order.

samtools sort -o example.bam.sorted example.bam

Step 3.1: If AIControl reports an error for a mismatch of genome assembly.

You are likely here, because the AIControl script raised an error. The error is most likely caused by a mismatch of genome assembly that your dataset and control datasets are mapped to. Our control datasets are mapped to the hg38 from the UCSC repository. On the other hand, your bam file is probably mapped to a slightly differet version of the hg38 assembly or different ordering of chromosomes (a.k.a. non-lexicographic). For instance, if you download a .bam file directly from the ENCODE website, it is mapped to a slightly different chromosome ordering of hg38. A recommended way of resolving this issue is to extract a .fastq file from your .bam file, go back to step 1, and remap it with bowtie2 using the UCSC hg38 assembly. bedtools provides a way to generate a .fastq file from your .bam file.

bedtools bamtofastq  -i example.bam -fq example.fastq

We will regularly update the control data when a new major version of the genome becomes available; however, covering for all versions with small changes to the existing version is not realistic.

Step 4: Download the AIControl julia script.

The following command will download the AIControl julia script and make it executable. You can also find it within this github repository.

wget https://github.com/hiranumn/AIControl.jl/raw/master/aicontrolScript.jl

Please also place the downloaded control data files to the same folder, or otherwise specify their location with --ctrlfolder option.

Step 5: Run AIControl.

The command below will run AIControl.

julia aicontrolScript.jl example.bam.sorted --ctrlfolder=. --name=test

Do julia aicontrolScript.jl --help or -h for help.

We support the following flags.

  • --dup: using duplicate reads [default:false]
  • --reduced: using subsampled control datasets [default:false]
  • --ctrlfolder=[path]: path to a control folder [default:./data]
  • --name=[string]: prefix for output files [default:bamfile_prefix]
  • --p=[float]: pvalue threshold [default:0.15]

If you would like to use the --dup or --reduced options, please download appropriate versions of compressed control data indicated with .dup or .reduced.

Simple trouble shooting

Make sure that:

  • You are using Julia 1.0.
  • You downloaded necessary control files for --reduced or --dup if you are running with those flags.
  • You sorted the input bam files according to the UCSC hg38 assembly as specified in Step 1 (and 3.1).

We have tested our implementation on ...

  • macOS Sierra (2.5GHz Intel Core i5 & 8GB RAM)
  • Ubuntu 18.04
  • Windows 8.0

If you have any question, please e-mail to hiranumn at cs dot washington dot edu.

About

Implementation of the AIControl paper in Julia 1.0

License:MIT License


Languages

Language:Julia 100.0%