This repository contains Galaxy wrappers of dnpatterntools and a docker builder. New addition to this repository is Mapping_CC program.
The pat
folder has previously derived nucleosome sequence positioning patterns. The seq
folder has some sequences of nucleosomal DNA.
This Galaxy dnpatterntools instance packaged as a docker container can be pulled from the docker hub https://hub.docker.com/r/erinija/dnpatterntools-galaxy
.
The dnpatterntools provide utilities to compute and analyze patterns of dinucleotide frequency distributions in a stack of nucleosome-wrapped DNA fasta sequences .
Computation of patterns of dinucleotide frequency distributions from nucleosome sequences consists of :
- computation of distribution of frequency of dinucleotide occurrences in a stack of aligned sequences;
- determination of a statistical dyad position in sequences;
- obtaining statistical patterns of dinucleotide frequency of occurrence in the sequence applying symmetrization and computing composite dinucleotides WW/SS (W = A or T and S=C or G) and RR/YY (R=A or G and Y=C or T);
- normalization and smoothing of the patterns and computing their periodograms.
The sequences of mouse (mm9) nucleosomal DNA that will be used in this demo are available from Zenodo . We will use sequences in controlm.fa.gz .
Start your dnpatterntools-galaxy instance locally or in docker. Log in and start a new history. Name the history as Dnpatterntools demo.
- Copy the data link
https://zenodo.org/record/3813510/files/controlm.fa.gz
- Open the Galaxy Upload Manager
- Choose
Paste/Fetch data
and paste the copied link into the place holder - Set Type to
fasta.gz
and Genome tomm9
- Press
Start
andClose
.
The data file will appear in your history. It will be a compressed file. To extract fasta:
- Go to
Edit attributes
and select aConvert
tab - Select
Convert compressed file to uncompressed
and pressConvert datatype
button.
The new uncompressed fasta file appear in your history. To rename the data:
- Go to
Edit attributes
and select anAttributes
tab - Change Name to controlm.fa and press
Save
. The renamed data file will reappear in your history.
In a stack of nucleosomal DNA sequences aligned by an experimental end a frequency of occurrence of each dinucleotide is computed at each position. Given a binary matrix of dinucleotde occurrences in sequences coded as 1 and else as 0, a frequency profile is simply a sum of occurrences of the selected dinucleotide at every position along the sequence. The sum is normalized by the number of sequences. To compute dinucleotide frequencies:
- From the Tools panel in a
Dnpatterntools
section selectDinucleotide frequencies
tool - In the field From Fasta select
controlm.fa
data - In the field Dinucleotides input all 16 dinucleotides separated by space (this is a default option)
- Press
Execute
- Inspect the data
- Rename the dataset to
Dinucleotide frequencies of control mouse
If there is no Visualize this data
(Histogram) icon in your data section, then you need to install visualization tools yourself. Login as administrator and install charts
tool of owner iuc
.
All data generated by dnpatterntools can be best visualized using the Line with focus (NVD3) visualization tool. With the Line with focus (NVD3) visualization tool visualize computed frequency profiles of AA, TT, CC, GG dinucleotides. An expected graph is shown in Figure 1. In the Figure you see the sharp narrow peak on the left of the graph. This peak indicates a micrococcal nuclease cleavage site. It can serve as a proof that your fasta file indeed contains a nucleosomal DNA sequences aligned by an experimental end. Don't forget to save your veisalization in Saved visualizations
of the User
by pressing Save
button.
Figure 1. Frequency profiles of the AA, TT, CC, GG dinucleotides.
Symmetry is a hallmark of the nucleosomal DNA (Luger et al., 1997) and statistically a distribution of peaks in AA/TT, AT, GC, CC/GG, RR/YY dinucleotide frequency profiles along a nucleosomal sequence is expected to have a recognizable dyad-symmetry. This dyad-symmetry property helps to determine a nucleosome's position in a stack of nucleosomal DNA sequences aligned by an experimental end.
At the nucleosome position centered on the dyad the frequency profiles of dinucleotides on forward and reverse complementary strands will have a maximum positive correlation. Therefore, a Pearson correlation coefficienr between the frequency profiles on forward and reverse complementary strand of each dinucleotide is computed at each position along the nucleosome sequence within a sliding window of 146bp long. To determine a position of nucleosome in the mouse fasta sequences:
- From the Tools panel in a
Dnpatterntools
section selectCorrelations
tool - In the field Dinucleotide frequency profiles select
Dinucleotide frequencies of control mouse
data - In the field Sliding window size input 146 (this is default option, this is a length of the nucleosome 146 base pairs)
- In the field Dinucleotides input all 16 dinucleotides separated by space (this is a default option)
- Press
Execute
- Inspect the data
- Rename the new dataset to
Correlatios of Dinucleotide frequencies of control mouse
The Correlations
tool outputs Pearson correlation coefficients for each dinucleotide at each position of the nucleosomal sequence minus 146. The info field contains position and value of the maximum correlation for each dinucleotide. The maximum correlation among all dinucleotides is 0.669 at the position 25 for the AA providing a very strong support that a nucleosome starts at the position 25 from the start of the fasta sequence. Figure 2 shows Line with focus (NVD3) visualization of the correlations of AA, CC and AC dinucleotides.
Figure 2. Correlations across a nucleosome sequence for frequency profiles of the CC, AA, AC dinucleotides.
In the previous step we have chosen start position of the nucleosome as 25. To further analyze patterns of the dnucleotide occurrences in nucleosomal DNA select only the interval of the nucleosome in the dinucleotide frequency profile data:
- From the Tools panel in a
Dnpatterntools
section selectSelect interval
tool - In the field Table of profiles select
Dinucleotide frequencies of control mouse
data - In the field Start position input 25
- In the field Start position input 146 (this is default option, this is a length of the nucleosome 146 base pairs)
- In the field Dinucleotides input all 16 dinucleotides separated by space (this is a default option)
- Press
Execute
- Inspect the data
- Rename the new dataset to
Selected Dinucleotide frequencies of control mouse
The symmetrization of the selected frequency profiles means superimposition of the dinucleotide frequency profiles from forward and complementary sequences with respect to a central dyad position of the frequency profile. To symmetrize:
- From the Tools panel in a
Dnpatterntools
section selectSymmetrize
tool - In the field input1 select
Selected Dinucleotide frequencies of control mouse
data - Press
Execute
- Inspect the data
- Rename the new dataset to
Symmetrized Dinucleotide frequencies of control mouse
Statistically, composite profiles reveal most prominent features in frequency patterns in nucleosomal DNA sequences. To compute composite dinucleotide profiles of Weak/Weak WW (W = A or T) Strong/Strong SS (S = C or G) Purine/Purine RR (R = A or G) and Pyrimidine/Pyrimidine YY (Y=C or T) from the symmetrized profiles:
- From the Tools panel in a
Dnpatterntools
section selectComposite profiles
tool - In the field input1 select
Symmetrized Dinucleotide frequencies of control mouse
data - Press
Execute
- Rename the new dataset to
Composite Symmetrized Dinucleotide frequencies of control mouse
At this step the Composite Symmetrized Dinucleotide frequencies of control mouse
dataset contains noisy patterns of dinucleotide frequeny distributions as is shown in Figure 3.
Figure 3. Symmetrized patterns of AA and TT dinucleotides in nucleosomal DNA sequences.