jnarayan81 / SynNet-Pipeline

Workflow for Building Microsynteny Networks

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Update June-7th-2019

-------------------------------

SynetBuild-X.sh

Improved the function enabling parallel running of MCScanX.

Codes are better organized and explained.

Four key parameters:  k (# tophits), s (# anchors), m (# gaps), and p (# CPUs)

Setting for -p is used for both Diamond and MCScanX parralleling. 

Notes:

Change Line70 according to your genome list

Options:

Line127: duplicate_gene_classifier

Line132: detect_collinear_tandem_arrays [intra-species]

Line180-183: detect_collinear_tandem_arrays [inter-species]

 

 

Update June 3rd 2019

-------------------------------

Phylogenomic Profiling

When you have constructed a synteny network database of your interested genomes, you could then perform clustering to the entire network (using infomap algorithm for example), or you could filter out subnetworks first (for certain gene family) and then perform clustering (using infomap, CFinder etc.).

Next, we would like to summarize clusters according to its node compositions. Then we could infer what are the conserved clusters, and what are the specific ones (shared by certain species group for example). A rough description of this process is like this:  we first generate a matrix, rows stand for clusters, and columns stand for species, the value stands for the number of nodes of that species in that cluster. Then we calculate a distance matrix between pair-wise clusters, and finally perform hierarchical clustering to cluster similar-patterned clusters.

 

Species 1

Species 2

Species 3

Species n

Cluster 1

1

2

1

1

…..

0

0

1

1

Cluster n

0

1

1

0

 

Now let’s start. Suppose you are using the infomap script infomap.r, the result looks like this:

names   mem

aar_AA31G00673  1

aar_AA32G00725  2

aar_AA39G00041  1

aar_AA29G00273  3

ach_Achn050361  1

ach_Achn168171  1

ach_Achn330591  1

ach_Achn198651  1

ach_Achn060901  1

…….

Result in such format can be directly feed into Phylogenomic_Profiling.r This R script can analyze cluster composition, calculate distance, and perform hierarchical clustering. Please read the notes within the codes. Also remember to change the content of Line 22 according to the genomes you are using.

Usage: Rscript Phylogenomic_Profiling.r infomap_clustering_result cluster_profiled cluster_profiled_clustered

 

 

UPDATES 5 Jul 2018

-------------------------------

Here I attach two scripts, using DIAMOND (Buchfink et al., 2015) for faster genome comparisons at the first step.

SynetBuilding-Diamond.sh: Used for the first time, when you would like to construct synteny network of your interested genomes

SynetAdding-Diamond.sh: Used when you would like to add new genomes into the existing results.

-Prepeations

   - Whole genome protein files in fasta format.

   - GFF/BED file for each genome (Example: http://chibba.pgml.uga.edu/mcscan2/examples/at.gff)

   -Install DIAMOND and MCScanX (Wang et al., 2012) 

       -DIAMOND: https://github.com/bbuchfink/diamond

       -MCScanX:  http://chibba.pgml.uga.edu/mcscan2/

-Notes / Tips

   - Index your genome files, using 3-5 letters, for example for "Arabidopis thaliana", rename genome file and gff file as "Ath.pep" and "Ath.bed", and for "Oryza sativa" ("Osa.pep" and "Osa.bed")

-To run

  - Put all pairs of "*.pep" and "*.bed" of your genomes under one folder, copy the script (SynetBuilding-  Diamond.sh) to the same folder. 

  - Change the line 41 in the script (content in array): enter the genome indexes of your own selection. 

     - for example:      array=(ath osa oth Alyr) 

  - The array can be of any length, depending the genomes you want to compare. 

 

======================================

 

Plant Phylogenomic Synteny Network Construction and Analysis Pipeline

p

Synteny network construction consists of five primary steps: (1) Annotated genome data preparation, (2) pairwise whole-genome comparisons, (3) syntenic block detection and data merging, (4) sub-network extraction (optional), and (5) network data analysis and visualization.

For Step 1, plant genomes can be downloaded from Phytozome, NCBI, Plaza, CoGe, etc. For each genome two files are needed: peptide sequences for all annotated/predicted genes (primary transcripts only) and a bed/GFF file indicating the genomic location of each gene. Users can prepare any number of genomes for synteny network construction. More genomes, longer computation time required.

>>> Fifty-one plant genomes used in the study of Tao Zhao et al., 2017b are listed and available for download below (Table 1).

For Steps 2 and 3, we provide a bash script (SynNet.sh) that can automatically perform pairwise inter- and intra- species comparisons, trimming the outputs for synteny detection, and treating outputs containing all synteny blocks to a final network file. This network database contains four columns: Block_ID, Block_Score, Gene1, and Gene2 (Gene 1 and Gene 2 are a syntenic gene pair).

  • Users have to pre-install RAPSearch2 (BLAST-like program, but much faster) and MCScanX (for pairwise synteny block detection).
  • Put all required genome files and the bash script in the same directory. Then, alter the first line of the program, which is a bracket containing species abbreviations (which are consistent to the names used in the genome files, tab separated).
  • Run the program and get the result file called “Final_Network”, which contains all pairwise synteny blocks of your input species.

>>> Synteny network of the fifty-one plant genomes used in the study of Tao et al., 2017b are available for download (“51_Genomes_Blocks”).

At Step 4, for specific gene family studies you may need to extract sub-networks from the database. To do this, you need to first identify all gene family members from the genomes and then query this gene candidate list against the synteny block database.

  • We use HMMER for gene family identification. HMMs (Hidden Markov Models) for specific gene families can be obtained from Pfam. Users can use Pfam Search or NCBI BLAST to help identify the feature domain(s) in the protein sequence. For example, a plant MADS-box protein is characterized by a core DNA binding domain (PF00319).

       Brief Guidelines of HMMER Usage:

  • Install HMMER followed the instructions at: http://hmmer.org/documentation.html
  • Download the protein sequence alignment for PF00319 in Stockholm format (default name : “PF00319_seed.txt”): http://pfam.xfam.org/family/PF00319#tabview=tab3
  • Hmmbuild: to make a model from the alignment
    • Usage: hmmbuild [-options] <hmmfile output> <alignment file input>
    • Example: hmmbuild MADS.hmm PF00319_seed.txt
    • hmm is the output model for characterizing MADS-box genes.
  • Hmmsearch: to identify all candidate members from the peptide database.
    • Usage: hmmsearch [options] <query hmmfile> <target seqfile>
    • Example: hmmsearch MADS.hmm 51_Genomes_Peps > MADS_Results

>>> Peptides for 51 plant genomes are merged and available for download, which can be used for an easier identification of gene family members of all 51 genomes.  (“51_Genomes_Peps”).

>>> The gene list of candidate MADS-box genes from the 51 Genomes (“MADS_list”)

  • Extract subnetwork from the synteny network database as needed, using a list containing all HMMER-identified family members.
  • Command: grep -f MADS_list 51_Genomes_Blocks > MADS.SynNet
  • Now we obtain all syntenic relationships for all MADS-box genes.

>>> Synteny network of MADS-box genes across 51genomes (MADS.SynNet)

Step 5:

The subnetwork file (MADS.SynNet) can be trimmed into several formats for clustering and visualization, which can be performed in different ways.

Clustering algorithms: K-clique percolation (e.g., CFinder, SNAP), Infomap, CNM, k-core decomposition, etc.

Visualization tools:  Cytoscape, Gephi et al.

>>> Example networks from Tao Zhao et al., 2017b are available for download and visualization in Cytoscape (MADS.cys), Cytoscape version 3.4.0.

 

Table 1: Genomes Used in the study of Tao Zhao et al., 2017

No

Species

Order

Peptides

BED/GFF

Version

#Genes

Reference

1

Phaseolus vulgaris (Common bean)

Rosids

pv.pep

pv.bed

Version 1.0

27082

Schmutz et al., 2014

2

Glycine max (Soybean)

Rosids

gm.pep

gm.bed

Wm82.a2.v1

56044

Schmutz et al., 2010

3

Cajanus cajan (Pigeonpea)

Rosids

cc.pep

cc.bed

Nov_2011

48680

Varshney et al., 2012

4

Medicago truncatula (Barrel medic)

Rosids

mt.pep

mt.bed

Mt4.0v1

50894

Young et al., 2011

5

Cicer arietinum (Chickpea)

Rosids

ca.pep

ca.bed

Version 1.0

28269

Varshney et al., 2013

6

Lotus japonicus (Lotus)

Rosids

lj.pep

lj.bed

Version 2.5

42399

Sato et al., 2008

7

Citrullus lanatus (Watermelon)

Rosids

cl.pep

cl.bed

Version 1.0

23440

Guo et al., 2013

8

Cucumis sativus (Cucumber)

Rosids

cs.pep

cs.bed

Version 1.0

21491

Huang et al., 2009

9

Populus trichocarpa (Western poplar)

Rosids

pt.pep

pt.bed

Version 3.0

41335

Tuskan et al., 2006

10

Ricinus communis (Castor bean)

Rosids

rc.pep

rc.bed

Version 0.1

38613

Chan et al., 2010

11

Malus x domestica (Apple)

Rosids

md.pep

md.bed

Version 1.0

63514

Velasco et al., 2010

12

Pyrus x bretschneideri (Pear)

Rosids

pb.pep

pb.bed

Version 1.0

42812

Wu et al., 2013

13

Prunus persica (Peach)

Rosids

pp.pep

pp.bed

Version 1.0

28689

International Peach Genome et al., 2013

14

Prunus mume (Mei)

Rosids

pm.pep

pm.bed

Version 1.0

31390

Zhang et al., 2012

15

Fragaria vesca (Strawberry)

Rosids

fv.pep

fv.bed

Version 1.1

32831

Shulaev et al., 2011

16

Arabidopsis thaliana (Arabidopsis)

Rosids

at.pep

at.bed

TAIR10

27416

Arabidopsis Genome, 2000

17

Arabidopsis lyrata (Lyrate rockcress)

Rosids

al.pep

al.bed

Version 1.0

32670

Hu et al., 2011

18

Capsella rubella (Capsella)

Rosids

cb.pep

cb.bed

Version 1.0

26521

Slotte et al., 2013

19

Brassica oleracea (Kale)

Rosids

bo.pep

bo.bed

Version 2.1

59225

Liu et al., 2014

20

Brassica rapa (Chinese cabbage)

Rosids

br.pep

br.bed

Version 1.3

40492

Wang et al., 2011

21

Aethionema

Rosids

aeth.pep

aeth.bed

Version 2.5

22230

Haudry et al., 2013

22

Tarenaya

Rosids

tare.pep

tare.bed

Version 5

31580

Cheng et al., 2013

23

Carica papaya (Papaya)

Rosids

cp.pep

cp.bed

ASGPBv0.4

24782

Ming et al., 2008

24

Gossypium raimondii (Cotton)

Rosids

gr.pep

gr.bed

Version 2.1

37505

Paterson et al., 2012

25

Theobroma cacao (Cacao)

Rosids

ta.pep

ta.bed

Version 1.1

29452

Argout et al., 2011

26

Citrus sinensis (Sweet orange)

Rosids

ci.pep

ci.bed

Version 1.1

25379

Xu et al., 2013

27

Eucalyptus grandis (Eucalyptus)

Rosids

eg.pep

eg.bed

Version 1.1

36376

Myburg et al., 2014

28

Vitis vinifera (Grape vine)

Rosids

vv.pep

vv.bed

Genoscope (Aug 2007)

26346

Jaillon et al., 2007

29

Solanum tuberosum (Potato)

Solanace

st.pep

st.bed

Version 3.4

39031

Potato Genome Sequencing et al., 2011

30

Solanum lycopersicum (Tomato)

Solanace

sl.pep

sl.bed

Version 2.4

34727

Tomato Genome, 2012

31

Capsicum annuum (Hot pepper)

Solanace

cu.pep

cu.bed

Version 1.55

34899

Kim et al., 2014

32

Utricularia gibba (Humped bladderwort)

Solanace

ug.pep

ug.bed

CoGe (Jun 2013)

28494

Ibarra-Laclette et al., 2013

33

Actinidia chinensis (Kiwifruit)

Solanace

ah.pep

ah.bed

May_2013

32670

Huang et al., 2013

34

Beta vulgaris (Sugar beet)

Eudicots

bv.pep

bv.bed

RefBeet-1.1

27421

Dohm et al., 2014

35

Nelumbo nucifera (Sacred lotus)

Eudicots

nn.pep

nn.bed

Version 1.0

26685

Ming et al., 2013

36

Triticum urartu (Wheat A-genome)

Monocots

tu.pep

tu.bed

Version 1.0

34879

Ling et al., 2013

37

Hordeum vulgare (Barley)

Monocots

hv.pep

hv.bed

Version 1.0

16598

International Barley Genome Sequencing et al., 2012

38

Brachypodium distachyon (Purple false brome)

Monocots

bd.pep

bd.bed

Version 2.1

31694

International Brachypodium, 2010

39

Oryza sativa (Rice)

Monocots

os.pep

os.bed

Version 7.0

39049

International Rice Genome Sequencing, 2005

40

Zea mays (Maize)

Monocots

zm.pep

zm.bed

Version 6a

63480

Schnable et al., 2009

41

Sorghum bicolor (Sorghum)

Monocots

sb.pep

sb.bed

Version 2.1

33032

Paterson et al., 2009

42

Setaria italica

Monocots

si.pep

si.bed

Version 2.1

35471

Bennetzen et al., 2012

43

Elaeis guineensis (Oil palm)

Monocots

el.pep

el.bed

Version 2.0

30752

Singh et al., 2013

44

Musa acuminata (Banana)

Monocots

ma.pep

ma.bed

July_2012

36542

D'Hont et al., 2012

45

Phalaenopsis equestris (Orchid)

Monocots

pe.pep

pe.bed

Version 5.0

42293

Cai et al., 2015

46

Zostera muelleri (Seagrass)

Monocots

zo.pep

zo.bed

Version 1.0

33245

Golicz et al., 2015

47

Amborella trichopoda (Amborella)

Basal Angiosperm

ar.pep

ar.bed

Version 1.0

26846

Chamala et al., 2013

48

Picea abies (Norway spruce)

Gymnosperm

pa.pep

pa.bed

Version 1.0

66632

Nystedt et al., 2013

49

Selaginella moellendorffii (Selaginella)

Moss

sm.pep

sm.bed

Version 1.0

22273

Banks et al., 2011

50

Physcomitrella patens (Moss)

Moss

ph.pep

ph.bed

Version 3.0

26610

Rensing et al., 2008

51

Chlamydomonas reinhardtii (Green algae)

Green algae

cr.pep

cr.bed

Version 5.5

17741

Merchant et al., 2007

 

 

Citations:

Zhao, T. and Schranz, E., (2019). Network-based microsynteny analysis identifies major differences and genomic outliers in mammalian and angiosperm genomes. Proceedings of the National Academy of Sciences. 116(6), 2165-2174.

Zhao, T., Holmer, R., de Bruijn, S., Angenent, G.C., van den Burg, H.A., and Schranz, M.E. (2017b). Phylogenomic synteny network analysis of MADS-box transcription factor genes reveals lineage-specific transpositions, ancient tandem duplications, and deep positional conservation. The Plant Cell 29, 1278-1292.

Zhao, T., and Schranz, E. (2017a). Network Approaches for Plant Phylogenomic Synteny Analysis. Current Opinion in Plant Biology 36, 129-134.

 

About

Workflow for Building Microsynteny Networks


Languages

Language:Shell 83.6%Language:R 16.4%