elsayed-lab / coex-network-optimization-construction

Code for submitting co-expression network construction jobs to a cluster and summarizing the results

Repository from Github https://github.comelsayed-lab/coex-network-optimization-constructionRepository from Github https://github.comelsayed-lab/coex-network-optimization-construction

---
title: "Co-expression Network Analysis Parameter Optimization (v6)"
author: V. Keith Hughitt
date: "`r format(Sys.time(), '%d %B, %Y')`"
output:
  html_document:
    theme: cosmo
    toc: true
    toc_float: true
    number_sections: true
  pdf_document:
    toc: true
    latex_engine: xelatex
---

```{r echo=FALSE}
# Clean up any existing variables
rm(list=ls())
```

Introduction
============

Overview
--------

The purpose of this analysis is to test out a range of parameters for
co-expression network construction in order to help find some reasonable values
to use for downstream network analysis.

The basic approach used here is to construct multiple networks with varying
parameters (typically in combinations less than ~1000 or so to avoid taxing the
cluster), and for each network, determine how many enriched GO terms are
observed for each resulting module in the network, and what is the average
p-value of the GO terms. Networks which result in significantly more enrichment
with lower p-values may then be considered more "optimal" or more realistic. Of
course, one must also keep in mind that when testing large numbers of networks,
it is always possible that you can generate seemingly real modules by chance,
and also that even if one set of parameters is optimal relative to the eithers,
there is no guarantee that it is the global optimal.

Parameters Tested
-----------------

For each dataset, all possible combinations of the following parameters will be
tested:

- Counts-per-million (CPM) transformation (TRUE|FALSE)
- Log2 transformation (TRUE|FALSE)
- Quantile normalization (TRUE|FALSE)
- Batch adjustment, when available (limma|combat|none)
- Similarity Measure (pearson correlation, spearman correlation, biweight
  mid-correlation, cor-dist)
- Adjacency power (1-14)

This will result in a total of 1344 networks per dataset where batch
information is available, and 448 networks otherwise.

Further, for a single dataset (HsLm), additional parameter which have
previously been compared and ruled out are including for demonstrative
purposes.

Changelog
---------

### v6

- Including GO ancestor terms in enrichment analysis
- Updated human reference annotations from Ensembl 83 -> 88
- Updated TriTrypDB from 27 -> 32
- Updated Bioconductor from 3.2 -> 3.5

### v5

- Fixed an issue with HTSeq causing many reads to be double-counted.
- Improved computation of gene lengths for Human/Mouse.
- Updated human reference annotations from Ensembl 76 -> 83
- Updated TriTrypDB from 25 -> 27
- Updated Bioconductor from 3.1 -> 3.2

### v4

- Added differential-expression-based filtering of genes for host and parasite
  and multicopy gene family filtering for *T. cruzi* and *L. major*.

### v3

- Updated TriTrypDB to version 25
- Quantile normalization performed before log2 transformation
- When `similarity_measure='dist'`, Euclidean distance is used directly for
  dissimilarity matrix instead of converting to an intermediate similarity
  matrix first.

### v2

- Updated TriTrypDB to version 24
- Gene/module mapping saved for each network generated
- No longer testing voom, topological_overlap, low_count_threshold, and
  merge_cor; the later two can be separately optimized after all other
  parameters have been optimized.

Results
=======

Setup
-----

```{r load_libraries}
library('DT')
library('dplyr')
library('ggplot2')
library('knitr')
library('reshape2')
library('readr')
source('../00-shared/R/util.R')

# knitr preferences
opts_chunk$set(fig.width=768/96,
               fig.height=480/96,
               dpi=96)
options(digits=4)
options(stringsAsFactors=FALSE)
options(knitr.duplicate.label='allow')

theme_set(theme_grey(base_size=16))

# number of high-scoring entries to display in table output
num_rows_to_display <- 15 
```

*H .sapiens* All samples
------------------------

```{r load_data}
df <- tbl_df(read.csv('output/normal/hsapiens_all-v6.1.csv'))
```

```{r child='child/load-human-data.Rmd'}
```

### All networks

```{r child='child/results-summary-tables-host.Rmd'}
```

```{r child='child/results-plots.Rmd'}
```

*H .sapiens* Body Map multi-tissue network
------------------------------------------

```{r load_data}
df <- tbl_df(read.csv('output/normal/illumina_bodymap-v6.1.csv'))
```

```{r child='child/load-human-data.Rmd'}
```

### All networks

```{r child='child/results-summary-tables-host.Rmd'}
```

```{r child='child/results-plots.Rmd'}
```

*H. sapiens* infected with *T. cruzi* (intracellular)
-----------------------------------------------------

```{r load_data}
df <- read_csv('output/normal/hsapiens_infected_with_tcruzi-v6.1.csv')
```

```{r child='child/load-human-data.Rmd'}
```

### All networks

```{r child='child/results-summary-tables-host.Rmd'}
```

```{r child='child/results-plots.Rmd'}
```

*H. sapiens* infected with *T. cruzi* (uninfected samples)
----------------------------------------------------------

```{r load_data}
df <- read_csv('output/normal/hsapiens_infected_with_tcruzi-uninf-v6.1.csv') %>%
    filter(network_type == 'signed')
```

```{r child='child/load-human-data.Rmd'}
```

### All networks

```{r child='child/results-summary-tables-host.Rmd'}
```

```{r child='child/results-plots.Rmd'}
```

*H. sapiens* infected with *L. braziliensis*
--------------------------------------------

```{r load_data}
df <- tbl_df(read.csv('output/normal/hsapiens_infected_with_lbraziliensis-v6.1.csv'))
```

```{r child='child/load-human-data.Rmd'}
```

### All networks

```{r child='child/results-summary-tables-host.Rmd'}
```

```{r child='child/results-plots.Rmd'}
```

*H. sapiens* infected with *L. major* (intracellular)
-----------------------------------------------------

### General

```{r load_data}
df <- read_csv('output/normal/hsapiens_infected_with_lmajor-v6.1.csv')
```

```{r child='child/load-human-data.Rmd'}
```

### All networks

```{r child='child/results-summary-tables-host.Rmd'}
```

```{r child='child/results-plots.Rmd'}
```

*H. sapiens* infected with *L. major* (including alt parameters)
----------------------------------------------------------------

### General

```{r load_data}
df <- read_csv('output/normal/hsapiens_infected_with_lmajor-full-v6.1.csv')
```

```{r child='child/load-human-data.Rmd'}
```

### All networks

```{r child='child/results-summary-tables-host.Rmd'}
```

```{r child='child/results-plots.Rmd'}
```

*H. sapiens* infected with *L. major* (uninfected samples)
----------------------------------------------------------

### General

```{r load_data}
df <- read_csv('output/normal/hsapiens_infected_with_lmajor-uninf-v6.1.csv') %>%
    filter(network_type == 'signed')
```

```{r child='child/load-human-data.Rmd'}
```

### All networks

```{r child='child/results-summary-tables-host.Rmd'}
```

```{r child='child/results-plots.Rmd'}
```

*M. musculus* infected with *L. major* (intracellular)
------------------------------------------------------

### General

```{r load_data}
df <- read_csv('output/normal/mmusculus_infected_with_lmajor-v6.1.csv') %>%
    filter(network_type == 'signed')
```

```{r child='child/load-mouse-data.Rmd'}
```

### All networks

```{r child='child/results-summary-tables-host.Rmd'}
```

```{r child='child/results-plots.Rmd'}
```


*L. major* infecting *M. musculus* (intracellular)
--------------------------------------------------

### General

```{r load_data}
df <- read_csv('output/normal/lmajor_infecting_mmusculus-v6.1.csv') %>% 
    filter(network_type == 'signed')
```

```{r child='child/load-other-data.Rmd'}
```

### All networks

```{r child='child/results-summary-tables-other.Rmd'}
```

```{r child='child/results-plots.Rmd'}
```

*L. major* all hosts
--------------------

### General

```{r load_data}
df <- read_csv('output/normal/lmajor_all_samples-v6.1.csv') %>%
    filter(network_type == 'signed')
```

```{r child='child/load-other-data.Rmd'}
```

### All networks

```{r child='child/results-summary-tables-other.Rmd'}
```

```{r child='child/results-plots.Rmd'}
```

*L. major* all hosts (Incl. multicopy genes)
--------------------------------------------

### General

```{r load_data}
df <- read_csv('output/normal/lmajor_all_samples-multicopy-v6.1.csv') %>%
    filter(network_type == 'signed')
```

```{r child='child/load-other-data.Rmd'}
```

### All networks

```{r child='child/results-summary-tables-other.Rmd'}
```

```{r child='child/results-plots.Rmd'}
```

*L. major* infecting *H. sapiens* (intracellular)
--------------------------------------------------

### General

```{r load_data}
df <- read_csv('output/normal/lmajor_infecting_hsapiens-v6.1.csv') %>% 
    filter(network_type == 'signed')
```

```{r child='child/load-other-data.Rmd'}
```

### All networks

```{r child='child/results-summary-tables-other.Rmd'}
```

```{r child='child/results-plots.Rmd'}
```

*T. cruzi* infecting *H. sapiens* (intracellular)
-------------------------------------------------

### General

```{r load_data}
df <- read_csv('output/normal/tcruzi_infecting_hsapiens-v6.1.csv') %>%
    filter(network_type == 'signed')
```

```{r child='child/load-other-data.Rmd'}
```

### All networks

```{r child='child/results-summary-tables-other.Rmd'}
```

```{r child='child/results-plots.Rmd'}
```

*T. cruzi* infecting *H. sapiens* (all stages)
----------------------------------------------

### General

```{r load_data, eval=FALSE}
df <- read_csv('output/normal/tcruzi_all_samples-v6.1.csv') %>%
    filter(network_type == 'signed')
```

{r child='child/load-other-data.Rmd'}


### All networks

{r child='child/results-summary-tables-other.Rmd'}


{r child='child/results-plots.Rmd'}


*T. cruzi* infecting *H. sapiens* (all stages; including multicopy genes)
-------------------------------------------------------------------------

### General

```{r load_data}
df <- read_csv('output/normal/tcruzi_all_samples-multicopy-v6.1.csv') %>%
    filter(network_type == 'signed')
```

```{r child='child/load-other-data.Rmd'}
```

### All networks

```{r child='child/results-summary-tables-other.Rmd'}
```

```{r child='child/results-plots.Rmd'}
```

ModENCODE - Fly
---------------

### General

```{r load_data}
df <- tbl_df(read.csv('output/normal/modencode_fly-v6.1.csv'))
```

```{r child='child/load-other-data.Rmd'}
```

### All networks

```{r child='child/results-summary-tables-other.Rmd'}
```

```{r child='child/results-plots.Rmd'}
```

ModENCODE - Worm
----------------

### General

```{r load_data}
df <- tbl_df(read.csv('output/normal/modencode_worm-v6.1.csv'))
```

```{r child='child/load-other-data.Rmd'}
```

### All networks

```{r child='child/results-summary-tables-other.Rmd'}
```

```{r child='child/results-plots.Rmd'}
```

System Information
------------------

```{r sysinfo}
sessionInfo()
date()
```

About

Code for submitting co-expression network construction jobs to a cluster and summarizing the results


Languages

Language:Shell 53.9%Language:R 46.1%