erinijapranckeviciene / galaxy-dnpatterntools

The dnpatterntools galaxy wrappers and docker builder

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Galaxy dnpatterntools

This repository contains Galaxy wrappers of dnpatterntools and a docker builder. New addition to this repository is Mapping_CC program. The pat folder has previously derived nucleosome sequence positioning patterns. The seq folder has some sequences of nucleosomal DNA.

This Galaxy dnpatterntools instance packaged as a docker container can be pulled from the docker hub https://hub.docker.com/r/erinija/dnpatterntools-galaxy.

How to use dnpatterntools in Galaxy

Introduction

The dnpatterntools provide utilities to compute and analyze patterns of dinucleotide frequency distributions in a stack of nucleosome-wrapped DNA fasta sequences .

Description of workflow

Computation of patterns of dinucleotide frequency distributions from nucleosome sequences consists of :

  1. computation of distribution of frequency of dinucleotide occurrences in a stack of aligned sequences;
  2. determination of a statistical dyad position in sequences;
  3. obtaining statistical patterns of dinucleotide frequency of occurrence in the sequence applying symmetrization and computing composite dinucleotides WW/SS (W = A or T and S=C or G) and RR/YY (R=A or G and Y=C or T);
  4. normalization and smoothing of the patterns and computing their periodograms.

Data DOI

The sequences of mouse (mm9) nucleosomal DNA that will be used in this demo are available from Zenodo . We will use sequences in controlm.fa.gz .

Hands on guide

Start your dnpatterntools-galaxy instance locally or in docker. Log in and start a new history. Name the history as Dnpatterntools demo.

Upload fasta file to your history

  • Copy the data link https://zenodo.org/record/3813510/files/controlm.fa.gz
  • Open the Galaxy Upload Manager
  • Choose Paste/Fetch data and paste the copied link into the place holder
  • Set Type to fasta.gz and Genome to mm9
  • Press Start and Close.

The data file will appear in your history. It will be a compressed file. To extract fasta:

  • Go to Edit attributes and select a Convert tab
  • Select Convert compressed file to uncompressed and press Convert datatype button.

The new uncompressed fasta file appear in your history. To rename the data:

  • Go to Edit attributes and select an Attributes tab
  • Change Name to controlm.fa and press Save. The renamed data file will reappear in your history.

Compute dinucleotides frequencies of occurrence in fasta file

In a stack of nucleosomal DNA sequences aligned by an experimental end a frequency of occurrence of each dinucleotide is computed at each position. Given a binary matrix of dinucleotde occurrences in sequences coded as 1 and else as 0, a frequency profile is simply a sum of occurrences of the selected dinucleotide at every position along the sequence. The sum is normalized by the number of sequences. To compute dinucleotide frequencies:

  • From the Tools panel in a Dnpatterntools section select Dinucleotide frequencies tool
  • In the field From Fasta select controlm.fa data
  • In the field Dinucleotides input all 16 dinucleotides separated by space (this is a default option)
  • Press Execute
  • Inspect the data
  • Rename the dataset to Dinucleotide frequencies of control mouse

If there is no Visualize this data (Histogram) icon in your data section, then you need to install visualization tools yourself. Login as administrator and install charts tool of owner iuc.

All data generated by dnpatterntools can be best visualized using the Line with focus (NVD3) visualization tool. With the Line with focus (NVD3) visualization tool visualize computed frequency profiles of AA, TT, CC, GG dinucleotides. An expected graph is shown in Figure 1. In the Figure you see the sharp narrow peak on the left of the graph. This peak indicates a micrococcal nuclease cleavage site. It can serve as a proof that your fasta file indeed contains a nucleosomal DNA sequences aligned by an experimental end. Don't forget to save your veisalization in Saved visualizations of the User by pressing Save button. Fig1 Figure 1. Frequency profiles of the AA, TT, CC, GG dinucleotides.

Determine a dyad and nucleosome's position in a stack of the nucleosomal DNA sequences

Symmetry is a hallmark of the nucleosomal DNA (Luger et al., 1997) and statistically a distribution of peaks in AA/TT, AT, GC, CC/GG, RR/YY dinucleotide frequency profiles along a nucleosomal sequence is expected to have a recognizable dyad-symmetry. This dyad-symmetry property helps to determine a nucleosome's position in a stack of nucleosomal DNA sequences aligned by an experimental end.

At the nucleosome position centered on the dyad the frequency profiles of dinucleotides on forward and reverse complementary strands will have a maximum positive correlation. Therefore, a Pearson correlation coefficienr between the frequency profiles on forward and reverse complementary strand of each dinucleotide is computed at each position along the nucleosome sequence within a sliding window of 146bp long. To determine a position of nucleosome in the mouse fasta sequences:

  • From the Tools panel in a Dnpatterntools section select Correlations tool
  • In the field Dinucleotide frequency profiles select Dinucleotide frequencies of control mouse data
  • In the field Sliding window size input 146 (this is default option, this is a length of the nucleosome 146 base pairs)
  • In the field Dinucleotides input all 16 dinucleotides separated by space (this is a default option)
  • Press Execute
  • Inspect the data
  • Rename the new dataset to Correlatios of Dinucleotide frequencies of control mouse

The Correlations tool outputs Pearson correlation coefficients for each dinucleotide at each position of the nucleosomal sequence minus 146. The info field contains position and value of the maximum correlation for each dinucleotide. The maximum correlation among all dinucleotides is 0.669 at the position 25 for the AA providing a very strong support that a nucleosome starts at the position 25 from the start of the fasta sequence. Figure 2 shows Line with focus (NVD3) visualization of the correlations of AA, CC and AC dinucleotides.

Fig2 Figure 2. Correlations across a nucleosome sequence for frequency profiles of the CC, AA, AC dinucleotides.

Select pattern interval

In the previous step we have chosen start position of the nucleosome as 25. To further analyze patterns of the dnucleotide occurrences in nucleosomal DNA select only the interval of the nucleosome in the dinucleotide frequency profile data:

  • From the Tools panel in a Dnpatterntools section select Select interval tool
  • In the field Table of profiles select Dinucleotide frequencies of control mouse data
  • In the field Start position input 25
  • In the field Start position input 146 (this is default option, this is a length of the nucleosome 146 base pairs)
  • In the field Dinucleotides input all 16 dinucleotides separated by space (this is a default option)
  • Press Execute
  • Inspect the data
  • Rename the new dataset to Selected Dinucleotide frequencies of control mouse

Symmetrize

The symmetrization of the selected frequency profiles means superimposition of the dinucleotide frequency profiles from forward and complementary sequences with respect to a central dyad position of the frequency profile. To symmetrize:

  • From the Tools panel in a Dnpatterntools section select Symmetrize tool
  • In the field input1 select Selected Dinucleotide frequencies of control mouse data
  • Press Execute
  • Inspect the data
  • Rename the new dataset to Symmetrized Dinucleotide frequencies of control mouse

Compute composite dinucleotide profiles

Statistically, composite profiles reveal most prominent features in frequency patterns in nucleosomal DNA sequences. To compute composite dinucleotide profiles of Weak/Weak WW (W = A or T) Strong/Strong SS (S = C or G) Purine/Purine RR (R = A or G) and Pyrimidine/Pyrimidine YY (Y=C or T) from the symmetrized profiles:

  • From the Tools panel in a Dnpatterntools section select Composite profiles tool
  • In the field input1 select Symmetrized Dinucleotide frequencies of control mouse data
  • Press Execute
  • Rename the new dataset to Composite Symmetrized Dinucleotide frequencies of control mouse

At this step the Composite Symmetrized Dinucleotide frequencies of control mouse dataset contains noisy patterns of dinucleotide frequeny distributions as is shown in Figure 3.

Fig3 Figure 3. Symmetrized patterns of AA and TT dinucleotides in nucleosomal DNA sequences.

Smooth and compute periodograms of the patterns

in progress, to be continued

About

The dnpatterntools galaxy wrappers and docker builder

License:MIT License


Languages

Language:Shell 77.7%Language:GLSL 9.2%Language:C 5.2%Language:Forth 5.2%Language:Dockerfile 1.9%Language:HTML 0.9%