##Install gblocks


tar zxvf Gblocks_Linux_0.91b.tar.Z


Step 1: Download latest conda installer

$ wget

Step 2: Run the installer

$ bash

Step 4: Installing conda channels to make tools available Different tools are packaged in what conda calls channels. We need to add some channels to make the bioinformatics and genomics tools available for installation:

# Install some conda channels
# A channel is where conda looks for packages
$ conda config --add channels defaults
$ conda config --add channels bioconda
$ conda config --add channels conda-forge


The order of adding channels is important. Make sure you use the shown order of commands.

Pac-Bio Hifi assemblers


Update System

Firstly we need to update the system through yum update command as shown below.

$ sudo yum update

Install Perl Package Once System is fully updated, you can install perl package through yum install perl command as shown below.

$ sudo yum install perl


$ perl -v

Should get this info

This is perl 5, version 16, subversion 3 (v5.16.3) built for x86_64-linux-thread-multi
(with 44 registered patches, see perl -V for more detail)

Copyright 1987-2012, Larry Wall

Perl may be copied only under the terms of either the Artistic License or the
GNU General Public License, which may be found in the Perl 5 source kit.

Complete documentation for Perl, including FAQ lists, should be found on
this system using "man perl" or "perldoc perl".  If you have access to the
Internet, point your browser at, the Perl Home Page.

Install java

$ sudo yum install java-1.8.0-openjdk


$ java -version

Should see this info

openjdk version "1.8.0_282"
OpenJDK Runtime Environment (build 1.8.0_282-b08)
OpenJDK 64-Bit Server VM (build 25.282-b08, mixed mode)

The easy way to install canu is to download the hicanu binary

$ wget

Install from binary distribution

tar -xJf canu-2.1.1.Linux-amd64.tar.xz

Done! canu is well installed and available here

cd canu-2.1.1/bin



Should get this

usage:   canu [-version] [-citation] \
              [-haplotype | -correct | -trim | -assemble | -trim-assemble] \
              [-s <assembly-specifications-file>] \
               -p <assembly-prefix> \
               -d <assembly-directory> \
               genomeSize=<number>[g|m|k] \
              [other-options] \
              [-haplotype{NAME} illumina.fastq.gz] \
              [-corrected] \
              [-trimmed] \
              [-pacbio |
               -nanopore |
               -pacbio-hifi] file1 file2 ...

example: canu -d run1 -p godzilla genomeSize=1g -nanopore-raw reads/*.fasta.gz

  To restrict canu to only a specific stage, use:
    -haplotype     - generate haplotype-specific reads
    -correct       - generate corrected reads
    -trim          - generate trimmed reads
    -assemble      - generate an assembly
    -trim-assemble - generate trimmed reads and then assemble them

  The assembly is computed in the -d <assembly-directory>, with output files named
  using the -p <assembly-prefix>.  This directory is created if needed.  It is not
  possible to run multiple assemblies in the same directory.

  The genome size should be your best guess of the haploid genome size of what is being
  assembled.  It is used primarily to estimate coverage in reads, NOT as the desired
  assembly size.  Fractional values are allowed: '4.7m' equals '4700k' equals '4700000'

  Some common options:
      - Run under grid control (true), locally (false), or set up for grid control
        but don't submit any jobs (remote)
      - The allowed difference in an overlap between two raw uncorrected reads.  For lower
        quality reads, use a higher number.  The defaults are 0.300 for PacBio reads and
        0.500 for Nanopore reads.
      - The allowed difference in an overlap between two corrected reads.  Assemblies of
        low coverage or data with biological differences will benefit from a slight increase
        in this.  Defaults are 0.045 for PacBio reads and 0.144 for Nanopore reads.
      - Pass string to the command used to submit jobs to the grid.  Can be used to set
        maximum run time limits.  Should NOT be used to set memory limits; Canu will do
        that for you.
      - Ignore reads shorter than 'number' bases long.  Default: 1000.
      - Ignore read-to-read overlaps shorter than 'number' bases long.  Default: 500.
  A full list of options can be printed with '-options'.  All options can be supplied in
  an optional sepc file with the -s option.

  For TrioCanu, haplotypes are specified with the -haplotype{NAME} option, with any
  number of haplotype-specific Illumina read files after.  The {NAME} of each haplotype
  is free text (but only letters and numbers, please).  For example:
    -haplotypeNANNY nanny/*gz
    -haplotypeBILLY billy1.fasta.gz billy2.fasta.gz

  Reads can be either FASTA or FASTQ format, uncompressed, or compressed with gz, bz2 or xz.

  Reads are specified by the technology they were generated with, and any processing performed.


    -pacbio      <files>
    -nanopore    <files>
    -pacbio-hifi <files>

Complete documentation at


First install python3

sudo yum install python3-devel

clone the repository and compile

sudo git clone

cd Flye

sudo make


./flye -h

Should see:

usage: flye (--pacbio-raw | --pacbio-corr | --pacbio-hifi | --nano-raw |
             --nano-corr | --subassemblies) file1 [file_2 ...]
             --out-dir PATH

             [--genome-size SIZE] [--threads int] [--iterations int]
             [--meta] [--plasmids] [--trestle] [--polish-target]
             [--keep-haplotypes] [--debug] [--version] [--help]
             [--resume] [--resume-from] [--stop-after]
             [--hifi-error] [--min-overlap SIZE]

Assembly of long reads with repeat graphs

optional arguments:
  -h, --help            show this help message and exit
  --pacbio-raw path [path ...]
                        PacBio raw reads
  --pacbio-corr path [path ...]
                        PacBio corrected reads
  --pacbio-hifi path [path ...]
                        PacBio HiFi reads
  --nano-raw path [path ...]
                        ONT raw reads
  --nano-corr path [path ...]
                        ONT corrected reads
  --subassemblies path [path ...]
                        high-quality contigs input
  -g size, --genome-size size
                        estimated genome size (for example, 5m or 2.6g)
  -o path, --out-dir path
                        Output directory
  -t int, --threads int
                        number of parallel threads [1]
  -i int, --iterations int
                        number of polishing iterations [1]
  -m int, --min-overlap int
                        minimum overlap between reads [auto]
  --asm-coverage int    reduced coverage for initial disjointig assembly [not
  --hifi-error float    expected HiFi reads error rate (e.g. 0.01 or 0.001)
  --plasmids            rescue short unassembled plasmids
  --meta                metagenome / uneven coverage mode
  --keep-haplotypes     do not collapse alternative haplotypes
  --trestle             enable Trestle [disabled]
  --polish-target path  run polisher on the target sequence
  --resume              resume from the last completed stage
  --resume-from stage_name
                        resume from a custom stage
  --stop-after stage_name
                        stop after the specified stage completed
  --debug               enable debug output
  -v, --version         show program's version number and exit

Input reads can be in FASTA or FASTQ format, uncompressed
or compressed with gz. Currently, PacBio (raw, corrected, HiFi)
and ONT reads (raw, corrected) are supported. Expected error rates are
<30% for raw, <3% for corrected, and <1% for HiFi. Note that Flye
was primarily developed to run on raw reads. Additionally, the
--subassemblies option performs a consensus assembly of multiple
sets of high-quality contigs. You may specify multiple
files with reads (separated by spaces). Mixing different read
types is not yet supported. The --meta option enables the mode
for metagenome/uneven coverage assembly.

Genome size estimate is no longer a required option. You
need to provide an estimate if using --asm-coverage option.

To reduce memory consumption for large genome assemblies,
you can use a subset of the longest reads for initial disjointig
assembly by specifying --asm-coverage and --genome-size options. Typically,
40x coverage is enough to produce good disjointigs.

You can run Flye polisher as a standalone tool using
--polish-target option.

Fine! Flye is installed.


Just do:

$ sudo git clone

$ cd hifiasm && make

Install the conda package manager

# download latest conda installer
$ wget

Now lets install conda:

# run the installer
$ bash

Accept the license

Do you accept the license terms? [yes|no]

Press Yes

After you accepted the license agreement conda will be installed. At the end of the installation you will encounter the following:

Preparing transaction: done
Executing transaction: done
installation finished.
Do you wish the installer to initialize Miniconda3
by running conda init? [yes|no]

close and open the terminal

After closing and re-opening the shell/terminal, we should be able to use the conda command:

$ conda update --yes conda

Installing conda channels to make tools available

# Install some conda channels
# A channel is where conda looks for packages
$ conda config --add channels defaults
$ conda config --add channels bioconda
$ conda config --add channels conda-forge


Upgrade the gcc

$ yum -y install centos-release-scl
$ yum -y install devtoolset-7-gcc devtoolset-7-gcc-c++ devtoolset-7-binutils
$ scl enable devtoolset-7 bash


gcc -v

get this:

Using built-in specs.
Target: x86_64-redhat-linux
Configured with: ../configure --enable-bootstrap --enable-languages=c,c++,fortran,lto --prefix=/opt/rh/devtoolset-7/root/usr --mandir=/opt/rh/devtoolset-7/root/usr/share/man --infodir=/opt/rh/devtoolset-7/root/usr/share/info --with-bugurl= --enable-shared --enable-threads=posix --enable-checking=release --enable-multilib --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-gnu-unique-object --enable-linker-build-id --with-gcc-major-version-only --enable-plugin --with-linker-hash-style=gnu --enable-initfini-array --with-default-libstdcxx-abi=gcc4-compatible --with-isl=/builddir/build/BUILD/gcc-7.3.1-20180303/obj-x86_64-redhat-linux/isl-install --enable-libmpx --enable-gnu-indirect-function --with-tune=generic --with-arch_32=i686 --build=x86_64-redhat-linux
Thread model: posix
gcc version 7.3.1 20180303 (Red Hat 7.3.1-5) (GCC)

Download and install via conda

git clone

cd Peregrine


Note: I modified the sh file by replacing . ~/miniconda3/bin/activate by . ~/anaconda3/bin/activate

Improved Phased Assembly HiFi Genome Assembler

conda create -n ipa
conda activate ipa
conda install pbipa

Run on data

/usr/bin/time -o out.ram.txt -v ipa local --nthreads 16 --njobs 2 -i ~/datafile/001.i.bat.pacbio.hifi.reads/S_unknown.ccs.merged.fasta &> log &

Docker installation

Step 1: Uninstall old versions Older versions of Docker were called docker or docker-engine. If these are installed, uninstall them, along with associated dependencies.

$ sudo yum remove docker \
                  docker-client \
                  docker-client-latest \
                  docker-common \
                  docker-latest \
                  docker-latest-logrotate \
                  docker-logrotate \

Step 2: Install using the repository

The –y switch indicates to the yum installer to answer "yes" to any prompts that may come up. The yum-utils switch adds the yum-config-manager. Docker uses a device mapper storage driver, and the device-mapper-persistent-data and lvm2 packages are required for it to run correctly.

Step 2.2: Add the Docker Repository to CentOS

$ sudo yum-config-manager \
    --add-repo \

Step 3: Install the latest version of Docker Engine and containerd

Step 3: Start Docker

Although you have installed Docker on CentOS, the service is still not running.

To start the service, enable it to run at startup. Run the following commands in the order listed below.

Start Docker:

$ sudo systemctl start docker

Enable Docker:

$ sudo systemctl enable docker

Log info

Created symlink from /etc/systemd/system/ to /usr/lib/systemd/system/docker.service.

Verify that Docker Engine is installed correctly by running the hello-world image.

$ sudo docker run hello-world

$ sudo docker run hello-world

Log info

Unable to find image 'hello-world:latest' locally
latest: Pulling from library/hello-world
0e03bdcc26d7: Pull complete
Digest: sha256:95ddb6c31407e84e91a986b004aee40975cb0bda14b5949f6faac5d2deadb4b9
Status: Downloaded newer image for hello-world:latest

Hello from Docker!
This message shows that your installation appears to be working correctly.

To generate this message, Docker took the following steps:
 1. The Docker client contacted the Docker daemon.
 2. The Docker daemon pulled the "hello-world" image from the Docker Hub.
 3. The Docker daemon created a new container from that image which runs the
    executable that produces the output you are currently reading.
 4. The Docker daemon streamed that output to the Docker client, which sent it
    to your terminal.

To try something more ambitious, you can run an Ubuntu container with:
 $ docker run -it ubuntu bash

Share images, automate workflows, and more with a free Docker ID:

For more examples and ideas, visit:

Install peregrine via docker

$ sudo docker pull cschin/peregrine:latest

prepare the input file

find /home/kplee/datafile/001.i.bat.pacbio.hifi.reads/ -name "S_unknown.ccs.merged.fasta" | sort > seqdata.lst

docker run -it -v //home/kplee/analysis/01.peregrine_assembly_sweetpotato_pacbio_hifi_data://home/kplee/analysis/01.peregrine_assembly_sweetpotato_pacbio_hifi_data \ cschin/peregrine:latest asm
/home/kplee/analysis/01.peregrine_assembly_sweetpotato_pacbio_hifi_data/seqdata.lst \ --with-consensus --shimmer-r 3 --best_n_ovlp 8 \

/usr/bin/time -o out.txt -v sudo docker run -it -v /home/kplee/analysis/01.peregrine_assembly_sweetpotato_pacbio_hifi_data:/wd
--user $(id -u):$(id -g) cschin/peregrine:latest asm
/home/kplee/analysis/01.peregrine_assembly_sweetpotato_pacbio_hifi_data/seqdata.lst 32 32 32 32 32 32 32 32 32 --with-consensus --with-alt --shimmer-r 3
--best n ovlp 8 --output /home/kplee/analysis/01.peregrine_assembly_sweetpotato_pacbio_hifi_data/

Install ggc compiler version 9.3 using conda

$ conda install -c conda-forge gcc_linux-64

Today 31 March 2021

On new Centos

First install gcc

sudo yum group install "Development Tools"


gcc --version

Got this

[yedomon@localhost]$ gcc --version
gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44)
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO

Now trying to install PAML tool

$ wget
$ tar xf paml4.9j.tgz
$ cd paml4.9j
$ rm bin/*.exe
$ cd src
$ make -f Makefile
$ ls -lF
$ mv baseml basemlg codeml pamp evolver yn00 mcmctree chi2 ../bin

gbextractor installation

sudo yum install python3-pip.noarch

sudo pip3 install gbseqextractor

 gbseqextractor --help


usage: gbseqextractor [-h] -f <STR> -prefix <STR> [-seqPrefix <STR>]
                      [-types {CDS,rRNA,tRNA,wholeseq,gene} [{CDS,rRNA,tRNA,wholeseq,gene} ...]]
                      [-cds_translation] [-gi] [-p] [-t] [-s] [-l] [-rv] [-F]

Extract any CDS or rNRA or tRNA DNA sequences of genes from Genbank file.

Seqid will be the value of '/gene=' or '/product=', if they both were not
present, the gene will not be output!

version 20201128:
    Now we can handle compounlocation (feature location with "join")!
    We can also output the translation for each CDS (retrived from '/translation=')

Please cite:
Guanliang Meng, Yiyuan Li, Chentao Yang, Shanlin Liu,
MitoZ: a toolkit for animal mitochondrial genome assembly, annotation
and visualization, Nucleic Acids Research,

optional arguments:
  -h, --help            show this help message and exit
  -f <STR>              Genbank file
  -prefix <STR>         prefix of output file. required.
  -seqPrefix <STR>      prefix of each seq id. default: None
  -types {CDS,rRNA,tRNA,wholeseq,gene} [{CDS,rRNA,tRNA,wholeseq,gene} ...]
                        what kind of genes you want to extract? wholeseq for
                        whole fasta seq. WARNING: Each sequence in the result
                        files corresponds to ONE feature in the GenBank file,
                        I will NOT combine multiple CDS of the same gene into
                        ONE! [CDS]
  -cds_translation      Also output translated CDS (required -types CDS). The
                        translations are retrived directly from the
                        '/translation=' key word. [False]
  -gi                   use gi number as sequence ID instead of accession
                        number when " gi number is present. (default:
                        accession number)
  -p                    output the position information on the ID line.
                        Warning: the position on ID line is 0 left-most!
  -t                    output the taxonomy lineage on ID line [False]
  -s                    output the species name on the ID line [False]
  -l                    output the seq length on the ID line [False]
  -rv                   reverse and complement the sequences if the gene is on
                        minus strand. Always True!
  -F                    only output full length genes,i.e., exclude the genes
                        with '>' or '<' in their location [False]

Server settings code

