stheroux / SCCWRP_Networks

Co-occurence networks on DNA data

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

File descriptions:

  1. SCCWRP_NetworkAnalysis.R: Script for SpiecEasi network analysis ran on the cluster. For reference, I requested 36 CPUs at 960GB. A detailed write-up of the methods is included below. This script outputs two files: network_hubscores.csv and network_topology.csv.
  2. SCCWRP_ModelSelection_POLR.R: Script ran locally that determines which network topology variables are the best predictors of site status (I.e., reference, intermediate, stressed). This script takes network_topology.csv and outputs two files: topology_summary.csv and prediction_summary.csv.
  3. SCCWRP_rename_hubscore_otuid.R: Script ran locally that matches OTU IDs from network_hubscores.csv with appropriate taxonomic name. This script takes network_hubscores.csv and outputs renamed_network_hubscores.csv.
  4. network_hubscores.csv: Reports the top ten OTU/ASV IDs with the highest hub score for each site status (reference, intermediate, or stressed), taxonomic level (family, genus, or species), and primer set (16S, 18S, or rbcL).
  5. renamed_network_hubscores.csv: Same as network_hubscores.csv but with an additional column that lists the appropriate taxa name matching the OTU ID and taxonomic level.
  6. network_topology.csv: Reports stats for each network created (more context on this is included in the Methods write-up found below).
  7. topology_summary.csv: Reports the average values for each network topology variable recorded (Total.Nodes = total nodes, Total.Edges = total edges, Positive.Edges = number of positive edges, Negative.Edges = number of negative edges, Pos.Neg = ratio of positive to negative edges, Pos.Total = ratio of positive to total edges, Neg.Total = ratio of negative to total edges, Avg.Path.Length = average path length, Modularity = modularity, Avg.Degree = average degree, Heterogeneity = heterogeneity, Clustering.Coefficient = clustering coefficient) for each site status, level, and primer set. 
  8. prediction_summary.csv: Reports the output of ordered logistic regression model selection. Modelled site status as a function of network topology variables and outputs an odds ratio value. Since this can be challenging to interpret, I included a "Statement" column that reports how much more or less likely a site is to be more stressed with each one unit increase in the respective network topology factor.
  9. num_samples_vs_taxa.csv: Reports number of samples versus number of taxa used to construct networks.

Methods

 This analysis was run on three datasets that each used one of the following primers: 16S, 18S, and rbcL. Replicate samples within each dataset (labeled ‘eDNA’, ‘MB’, or ‘LB’ were removed. Primary to analysis, basic filtering was performed: taxa with singleton and doubleton reads were removed, and samples with fewer than 2,000 reads were removed. Any taxa that were classified as ‘Unassigned’ were removed. Analysis described below was run with taxa classified at the family, genus, and species taxonomic levels. For species level analysis, taxa with less than 0.1% relative abundance were removed. Samples were divided into three groups based on site status (reference, intermediate, and stressed). For each site status, we randomly selected 50 samples 100 times and constructed networks using the R package ‘SpiecEasi’ (version 1.1.0). The following parameters were set to create networks that neared an optimal lambda value of 0.05, while balancing efficient use of available compute power: rep.num = 50, scores = 36, nlambda = 100, sel.criterion = bstars. Neighborhood selection (the Meinshausen and Bühlmann or “MB” method) was used. For each network, the following network topology features were recorded: total nodes, total edges, number of positive edges, number of negative edges, ratio of positive to negative edges, average path length, heterogeneity, modularity, average degree per node, clustering coefficient, and hub score. Within each network, nodes represent unique OTU/ASVs and edges represent the significant co-occurrences between them. Positive edges indicate that connected OTU/ASVs tend to be present together and negative edges indicate the opposite (i.e., if one is present in a community, the other is absent). The average path length considers the shortest edge path connecting each pair of nodes. Heterogeneity, the distribution of degrees or connections from each node, was calculated as described in Jacob et al. (2017). Modularity, the density of node connections compared to a randomly structured network, was measured with the Louvain method that maximizes the score for each community (Blondel et al, 2008). Hub score was calculated for the whole network without subsampling using Kleinberg’s centrality score,  which ranges from 0 to 1 (Kleinberg, 1999). This analysis was done separately for each primer set (16S, 18S, and rbcL) and taxonomic level (family, genus, and species). An ordered logistic regression model was estimated using the ‘polr’ command from the ‘MASS’ package (version 7.3.53) in R (Venables & Ripley, 2002). The model was first run using all non-multicollinear factors: total nodes, total edges, positive to negative edge ratio, average path length, modularity, average degree, heterogeneity, and clustering coefficient. Using the ‘regsubsets’ command from the ‘leaps’ package (version 3.1) in R we determined the best predictors for site status. Models were additionally confirmed for best fit factors using the ‘stepAIC’ command from MASS. Log likelihoods were converted to odds ratio values and recorded. Odds ratio values with positive exponents are interpreted as that much “more likely” to have increased stress with each one unit increase in the corresponding network topology factor. For odds ratio values with negative exponents, the reciprocal value is reported to represent how “less likely” the odds of increased biomass is with each one unit increase in the corresponding network topology factor.

Compute power

This analysis takes ~1 week with the following resources: --ntasks=1 --cpus-per-task=36 --mem=960gb

Package versions (PackageName_version)

R_4.0.3 (base R) dplyr_1.1.2 xlsx_0.6.5 tidyverse_1.3.0 phyloseq_1.34.0 SpiecEasi_1.1.0 igraph_1.2.6

About

Co-occurence networks on DNA data


Languages

Language:R 100.0%