celphin/RepeatOBserverV1

RepeatOBserver

An R package to visualize chromosome scale repeat patterns and predict centromere locations.
Report Bug

Table of Contents

Getting Started
Citation
Contact and Questions
Usage Examples
Troubleshooting

Getting Started

RepeatOBserver is an R package that can be run on any chromosome scale reference genome assembly (e.g. fasta file). RepeatOBserver returns many plots describing the tandem repeats and clusters of transposons found across each chromosome. Based on the repeat patterns, RepeatOBserver also returns a predicted centromere location for each chromosome based on the repeat diversity across that chromosome.

You can learn more about the interpretations of the plots in our manuscript here: https://www.biorxiv.org/content/10.1101/2023.12.30.573697v1

Software needed

The following software are need to run the automatic RepeatOBserver script:

seqkit/2.3.1 : https://bioinf.shenwei.me/seqkit/
r/4.1.2 : https://cran.r-project.org/bin/windows/base/old/
(optional to see isochores) emboss/6.6.0 : https://emboss.sourceforge.net/download/

Newer versions of these software may work but the program has not yet been tested throughly in them. If you are unable to install any of the programs above you can run the RepeatOBserver code in R but the automated bash script will not work for you (see Troubleshooting at the end of this page for details on how to run the code without this script).

Example software installation (using Compute Canada modules):

module load seqkit/2.3.1
module load StdEnv/2020 
module load emboss/6.6.0
module load r/4.1.2

R Package Installation

To install the R package "RepeatOBserverV1", you will first need to install the package devtools in your version of R.

 install.packages("devtools")

 library(devtools)

 install_github("celphin/RepeatOBserverV1") #to install the package
  # Select 1:All to install all the required packages

 library(RepeatOBserverV1) # to load the package

Parameter	Usage	Example Input
-i	Species Name	Fagopyrum (cannot contain an _ or space)
-f	Reference genome fasta file	Fagopyrum_Main.fasta
-h	Haplotype (string)	H0 (cannot contain an _ or space)
-c	cpus available (any integer value)	20
-m	memory available (MB)	128000
-g	FALSE to run for AT DNAwalk or TRUE to run for CG DNAwalk	FALSE

Main folders	Description (more details below)
DNAwalks	1D and 2D DNAwalks
histograms	histogram centromere predictions and plots
output_data	Raw data files including Shannon diversity, DNAwalks, Fourier transforms
Shannon_div	Shannon diversity plots for each chromosome
spectra	Heat maps of the Fourier transform output
isochores	CG isochores plot made with the EMBOSS program, useful to see if centromere positions are associated with isochores

Summary files	Description
chromosome_renaming.txt	New chromosome names assigned to each chromosome in the program
Species_Haplotype_Histograms.png	All chromosomes histograms plotted in one figure
Species_Haplotype_Shannon_div.png	All chromosomes Shannon_div plotted in one figure
Species_Haplotype_rolling_mean_500Kbp_Shannon_div.png	All chromosomes Shannon_div in 500kbp rolling windows plotted in one figure

Folder/file name	Description	Example file
1D	1D CG and AT DNAwalks, rainbow colours change every 10Kbp	Species_Haplotype_Chr1_DNAwalk1D_AT_total.png
2D	2D DNAwalks, the 1D walks plotted against each other	Species_Haplotype_Chr1_DNAwalk2D_total.png

Folder/file name	Description
Centromere_histograms_summary.txt	The predicted centromere positions for every chromosome based on the histogram output
Species_Haplotype_Chr1_histogram_....png	Histogram plots showing counts of where in the genome each repeat length minimized

Folder/file name	Description
Species_Haplotype_Chr1_Histogram_input_....txt	The Fourier window that each repeat length minimized in can be used to build the histograms
Species_Haplotype_Chr1_Shannon_div.txt	The raw Shannon diversity data for each 5kbp window in the genome
Total_dnawalk_every50_Species_Haplotype_Chr1.txt	The raw DNAwalk data sampled every 50bp (see chromosome folders for the complete walk)
Total_Species_Haplotype_Chr1_All_spec_merged.txt	The Fourier transform output merged for up to 400Mbp (chromosomes >400Mbp will be in parts)

Folder/file name	Description	Example file
Main folder	The predicted centromere predictions using Shannon diversity with varying rolling mean window sizes, columns are: # of windows averaged, Centromere_prediction, Total_chr_length, Spp, Chr	Centromere_summary_Shannon_1000.txt
Shannon_div_5kbp	Raw Shannon diversity values (no averaging) for each 5kbp Fourier window	Species_Haplotype_Chr1_Shannon_plot_norm.png
Shannon_div_500kbp	Shannon diversity values averaged with rolling window across 100 windows (500kbp region)	Species_Haplotype_Chr1_roll_mean_Shannon_100.png
Shannon_div_1.25Mbp	Shannon diversity values averaged with rolling window across 250 windows (1.25Mbp region)	Species_Haplotype_Chr1_roll_mean_Shannon_250.png
Shannon_div_2.5Mbp	Shannon diversity values averaged with rolling window across 500 windows (2.5Mbp region)	Species_Haplotype_Chr1_roll_mean_Shannon_500.png
Shannon_div_5Mbp	Shannon diversity values averaged with rolling window across 1000 windows (5Mbp region)	Species_Haplotype_Chr1_roll_mean_Shannon_1000.png
Shannon_div_window	Rolling mean of Shannon diversity values with window size determined by genome size	Species_Haplotype_Chr1_Shannon_div_window210.png

Folder/file name	Description	Example file
spectra_total_merged	Heat maps of the Fourier transforms of the whole chromosomes (up to 400 Mbp) for long repeat lengths 35-2000 bp	Species_Haplotype_Chr1_All_spec_bp35_2000seq1_6510TRUE.png
spectra_parts_2-8	Heat maps of the Fourier transforms of 100 Mbp chromosome parts for short repeat lengths 2-8 bp	All_spec1_Species_Haplotype_Chr1part01_bp15_35seq2501_32542501TRUE.png
spectra_parts_15-35	Heat maps of the Fourier transforms of 100 Mbp chromosome parts for mid repeat lengths 15-35 bp	All_spec1_Species_Haplotype_Chr1part01_bp2_8seq2501_32542501TRUE.png
spectra_parts_35-2000	Heat maps of the Fourier transforms of 100 Mbp chromosome parts for long repeat lengths 35-2000 bp	All_spec1_Species_Haplotype_Chr1part01_bp35_2000seq2501_32542501TRUE.png

RepeatOBserver

Getting Started

Software needed

R Package Installation

Version changes

Basic run

Output

Missing data

Output folders and summary files that you should find in the directory above, if the whole program worked:

Subfolders described:

DNAwalks contains:

histograms contains:

output_data contains:

Shannon_div contains:

spectra contains:

Gaps in chromosomes and missing data:

Finding repeat sequences

Citation

Contact and Questions

Usage examples

To get the centromere prediction plots

Plotting the Fourier heatmaps and DNAwalks for only a short segment of the chromosome.

Plotting a new specific range of repeat lengths in the heatmaps of the Fourier Spectra:

Running a Fourier transform on larger windows to study longer repeats (e.g. 10 kbp scales).

Troubleshooting

About

Languages