The microrms R package

Lars Snipen

Installation
Tutorial 1 - microbial community composition
Tutorial 2 - subspecies resolution

Installation

To install directly from GitHub you first need to install the packages devtools. Start R and:

install.packages("devtools")

or if you use RStudio, use the meny Tools - Install Packages… You may also need the package githubinstall. If so, install it the same way as above.

Next, you should be able to install this package by

devtools::install_github("larssnip/microrms")

The R functions RMSobject() and readMapper() calls upon the external software vsearch that must be installed and available on the system, see VSEARCH on GitHub. You need to know how to start vsearch on your system, and these functions have the argument vsearch.exe where you specify the exact text that will run vsearch. As an example, in Windows the command to start vsearch is simply "vsearch", but in case you run it as a singularity container the corresponding command would perhaps be "singularity exec <container_name> vsearch" or something along that line.

Below are two short tutorial, to get you going. The first shows a basic analysis on a very small data set. The second digs more into the resolution of this method.

Tutorial 1 - microbial community composition

This is a short step-by-step tutorial on a small toy example to illustrate a typical RMS study. We want to estimate the abundance of various taxa in a microbial community. Just as for shotgun metagenome data, this is a ‘closed reference’ type of assignment. This means we have a set of already sequenced genomes, and estimate the relative abundances of these in the sequenced samples. Instead of full shotgun sequencing, we have used the Reduced Metagenome Sequencing approach, i.e. sequencing of amplicons obtained by restriction enzyme cutting of the genomes. The use of RMS to discover new taxa is out of the scope of this tutorial.

You need to install the tidyverse and ggdendro R packages as these are used in the example code below, in addition to the microrms package and its dependencies.

Download the archive RMS_tutorial.tar.gz, and unzip it to some folder named RMS_tutorial. It should contain the folders gnm, fastq, fasta, frg and tmp, where the latter three are empty. There should also a file gold_standard.txt. The last folder, ecoli_frg, is used in the second tutorial below.

Start R/RStudio and make the RMS_tutorial your working directory for this R session, e.g.

setwd("C:/my_tutorials/RMS_tutorial")   # edit this

where you need to replace the path above with the correct one depending on where you unzipped the archive.

Create a new R-script, copy the code chunks below into it, and save it in the RMStutorial folder. Step through this, and inspect the results.

After some code chunks below there are some output lines, usually starting with ##. Do not copy these into your R-script.

Creating the RMS object

First, we need a collection of sequenced genomes that cover the community we want to study. For this tutorial, we have collected 10 genomes in the gnm/ folder. From every genome, we collect all RMS fragments, cluster these, and create the fragment cluster copy number matrix which is the central data structure of this method. In real life your genome collection would be larger, hundreds or thousands of genomes, depending on the environment you study.

All information is stored in an rms object, which is simply a list with several tables and matrices. It is convenient to have all data assembled into one object like this, and since this is a simple list you have full access to all components.

Collecting RMS fragments

First, we collect the RMS fragments from each genome, and store these as fasta-formatted files in a separate folder. This is something you do once for each genome. You also find the file gnm/genome_table.txt among the genomes, containing a small table with one row for each genome. It is required you have such a metadata table with information about the genomes. This table must contain the columns genome_id and genome_file, but may contain any other genome metadata columns in addition to these.

The genome_id should be a text without spaces, and unique to each genome. It is added to the header-lines of the fragment fasta files, to indicate the genome of origin for all fragments. Here we used the prefix of the genome fasta filenames as genome_id.

The genome_file column specify the genome fasta files, but without the path to where they are located. Keep it this way. The reason is we re-use these filenames for the fragment fasta files, but in another folder. Thus, we supply the path as a separate argument when needed.

The first code chunk reads genomes, collect their RMS fragments, and store these as fasta files in the frg/ folder:

library(tidyverse)
library(microrms)
gnm.dir <- "gnm"   # genome fasta files are found here
frg.dir <- "frg"   # fragment fasta files end up here
genome.tbl <- suppressMessages(read_delim("gnm/genome_table.txt", delim = "\t"))
for(i in 1:nrow(genome.tbl)){
  readFasta(file.path(gnm.dir, genome.tbl$genome_file[i])) %>% 
    getRMSfragments(genome.id = genome.tbl$genome_id[i]) %>% 
    writeFasta(out.file = file.path(frg.dir, genome.tbl$genome_file[i]))
}

## Genome GCA_003471545.1 : found 1511 RMS-fragments
## Genome GCA_003475135.1 : found 2033 RMS-fragments
## Genome GCA_003463625.1 : found 1621 RMS-fragments
## Genome GCA_003466465.1 : found 2265 RMS-fragments
## Genome GCA_003469365.1 : found 1959 RMS-fragments
## Genome GCA_003474775.1 : found 794 RMS-fragments
## Genome GCA_003470035.1 : found 1283 RMS-fragments
## Genome GCA_003464445.1 : found 2366 RMS-fragments
## Genome GCA_003470645.1 : found 2781 RMS-fragments
## Genome GCA_003464645.1 : found 1893 RMS-fragments

We notice all these genomes have hundreds or even thousands of RMS fragments. If one of these genomes (or a very close relative) is present in a sample, these fragments will be amplified and sequenced in the RMS method.

This job is done once for each genome, and the fragment files are stored (genome files are no longer needed). Thus, if you later want to add more genomes, you only run this for the new genomes. Notice that the fragment files have names identical to the genome files, and must be in a separate folder. Thus, the genome_file column in genome.tbl is used to name both genome and fragment files.

Also, note the readfasta()/writeFasta() functions from the microseq package handles gzipped files. Since the genome_file names have the extension .gz the fragment files will also be compressed.

The RMS object

Select the genomes you want to be able to recognize later. This means you may slice() or filter() your genome.tbl to only contain the rows (=genomes) you are interested in. Here we use all genomes.

The creation of the RMS object requires the use of the vsearch software for clustering. The function RMSobject() will not work unless vsearch is a valid command in your system/computer.

rms.obj <- RMSobject(genome.tbl, frg.dir, vsearch.exe = "vsearch", tmp = "tmp")

## VSEARCH clustering of RMS fragments...
## ...produced 17648 clusters
## ...the cluster table...done
## ...the copy number matrix
## genome 1 / 10 genome 2 / 10 genome 3 / 10 genome 4 / 10 genome 5 / 10 genome 6 / 10 genome 7 / 10 genome 8 / 10 genome 9 / 10 genome 10 / 10 
## ...the genome table...done

Note that you need to specify some folder for temporary output ("tmp"). You may delete its content after the computations are finished.

The clustering of these fragments results in more than 17000 clusters, using the default identity, see ?RMSobject. The resulting object rms.obj is a list with two tables and a matrix.

The rms.obj$Genome.tbl is a copy of the genome.tbl, but has got two new columns, and should be inspected right away. The column N_clusters lists the number of fragment clusters in each genome, and N_unique how many of these are unique to each genome:

print(rms.obj$Genome.tbl)

## # A tibble: 10 × 6
##    genome_id       genome_file            tax_id organism_name   N_clu…¹ N_uni…²
##    <chr>           <chr>                   <dbl> <chr>             <int>   <int>
##  1 GCA_003471545.1 GCA_003471545.1.fna.gz    820 Bacteroides un…    1504    1479
##  2 GCA_003475135.1 GCA_003475135.1.fna.gz    821 Bacteroides vu…    2016    1720
##  3 GCA_003463625.1 GCA_003463625.1.fna.gz    823 Parabacteroide…    1600    1576
##  4 GCA_003466465.1 GCA_003466465.1.fna.gz 357276 Bacteroides do…    2250    1946
##  5 GCA_003469365.1 GCA_003469365.1.fna.gz    818 Bacteroides th…    1943    1895
##  6 GCA_003474775.1 GCA_003474775.1.fna.gz  39491 [Eubacterium] …     778     754
##  7 GCA_003470035.1 GCA_003470035.1.fna.gz 360807 Roseburia inul…    1269    1249
##  8 GCA_003464445.1 GCA_003464445.1.fna.gz 371601 Bacteroides xy…    2345    2070
##  9 GCA_003470645.1 GCA_003470645.1.fna.gz  28116 Bacteroides ov…    2754    2480
## 10 GCA_003464645.1 GCA_003464645.1.fna.gz  40520 Blautia obeum …    1870    1847
## # … with abbreviated variable names ¹N_clusters, ²N_unique

In this case there are plenty of unique clusters for all genomes. If two or more genomes are very closely related, this number will shrink towards zero, making the recognition of each impossible.

The rms.obj$Cluster.tbl is a table listing information about all fragment clusters, one row for each cluster. Here is a listing of the first entries:

head(rms.obj$Cluster.tbl)

## # A tibble: 6 × 7
##   Cluster   Length    GC N.genomes Members                 Header        Seque…¹
##   <chr>      <int> <dbl>     <int> <chr>                   <chr>         <chr>  
## 1 CLST1        498 0.484         1 GCA_003463625.1_RMS774  CLST1;size=1… AATTCC…
## 2 CLST10       497 0.497         1 GCA_003463625.1_RMS1033 CLST10;size=… AATTCT…
## 3 CLST100      491 0.466         1 GCA_003466465.1_RMS149  CLST100;size… AATTCC…
## 4 CLST1000     432 0.442         1 GCA_003470645.1_RMS211  CLST1000;siz… AATTCC…
## 5 CLST10000    141 0.340         1 GCA_003470035.1_RMS606  CLST10000;si… AATTCG…
## 6 CLST10001    141 0.305         1 GCA_003470035.1_RMS632  CLST10001;si… AATTCA…
## # … with abbreviated variable name ¹Sequence

Very similar fragments (identity above the threshold set by the identity argument to RMSobject) will belong to the same cluster. Closely related genome will typically share several fragments. The Cluster.tbl indicate how many genomes contain each fragment (N.genomes), and also which genomes (Members). Note this table has a Header and a Sequence column, making it possible to write this to a fasta file using writeFasta().

The rms.obj$Cpn.mat is the copy number matrix. This is a central data structure of the RMS method. It has one row for each fragment cluster, and one column for each genome. Here are the first few rows and columns:

print(rms.obj$Cpn.mat[1:10,1:4])

## 10 x 4 sparse Matrix of class "dgCMatrix"
##           GCA_003471545.1 GCA_003475135.1 GCA_003463625.1 GCA_003466465.1
## CLST1                   .               .               1               .
## CLST10                  .               .               1               .
## CLST100                 .               .               .               1
## CLST1000                .               .               .               .
## CLST10000               .               .               .               .
## CLST10001               .               .               .               .
## CLST10002               .               .               .               .
## CLST10003               .               .               .               .
## CLST10004               .               .               .               .
## CLST10005               .               .               .               .

The copy numbers indicate how many copies of a given fragment cluster are found in a given genome. By far most copy numbers are zero, i.e. a fragment is usually found in one or a few genomes only. For this reason, the Cpn.mat is represented as a sparse data matrix in R, using the Matrix package. This save a lot of memory, at the cost of slightly slower computations. The zero elements are displayed as dots above. Most nonzero elements are 1, but some fragment clusters may occur several times in some genomes.

Genome similarity and resolution

When estimating the relative abundances of taxa, there is always a lower resolution, i.e. we cannot separate between genomes who are too similar. This will apply to all methods, not only RMS. For 16S sequencing, the resolution is usually at the genus rank, i.e. we cannot in general separate species from each other. With full shotgun sequencing, the resolution is higher, and we can separate species and even some strains within species if the sequencing is deep enough and the strains are different enough. With RMS we may also separate within species.

In this toy example, we have chosen genomes who are all of different species, even if six of them are from the same genus (Bacteroides). Thus, the problem of too similar genomes is probably not a big one here. In the next tutorial below, we dig deeper into this problem.

The similarity between genomes is, in the RMS context, simply the similarity between their respective columns in the copy number matrix (rms.obj$Cpn.mat). If two genomes have very similar columns here, it means they have more or less the same fragments, and by sequencing these we cannot distinguish if we are actually seing one or the other of these two genomes.

One way of quantifying the pairwise difference between the genomes is to compute the correlation distance between all genomes based on the copy number matrix. A distance of 0.0 means identical genomes (identical columns in the copy number matrix), and the largest possible distance of 2.0 means the genomes have orthogonal copy number vectors. From such distances we can compute a hierarchical clustering and display the genomes in a dendrogram:

library(ggdendro)
D <- corrDist(rms.obj$Cpn.mat)
tree <- hclust(as.dist(D), method = "single")
ggd <- ggdendrogram(dendro_data(tree),
                    rotate = T,
                    theme_dendro = F) +
  labs(x = "", y = "Correlation distance")
print(ggd)

All branches look deep and nice here, with correlation distances above 0.9. If branches get too shallow (correlation distance close to zero) in this tree, the genomes in that clade will be diffcult/impossible to separate since they share too many fragments. The solution to this is to cluster the genomes, which is illustrated in the next tutorial below.

Processing reads

The read processing means essentially taking the fastq files from sequencing as input and producing a fasta file as output, for each sample. This step is independent of what we did above. You only do this once for each sample, and it is the resulting fasta files we use in the downstream analysis.

The data from sequencing are the paired fastq files in the folder fastq/. Again, it is recommended to have a table like the one in fastq/sample_table.txt with metadata about each sample. There should always be a column sample_id with a unique text for each sample. Also, this sample.tbl has two columns R1_file and R2_file specifying the corresponding fastq file names.

In addition we also create the required column fasta_file below, containing the name of the resulting fasta files with reads for each sample. Here we create this from the sample_id, which means this text must be useful as a file name prefix (e.g. no / or spaces inside). The table may contain this column already, with no need to create it. Again, paths should not be part of any file names, we supply them as separate inputs. Below we output these files to the fasta folder.

There is no R function for doing the read processing, since this may be done in many different ways. Here is an R script with some suggested code for doing this processing using the vsearch software. Note the explicit decompression of the fastq-files. This is only added here in case you run this on a Windows 10 computer, on which we have found vsearch is not capable of reading gzipped files. On all other system these code-chunks should be deleted, and vsearch will read compressed files directly. The (long) screen output from this is hidden in this document:

#################
### The settings
vsearch.exe <- "vsearch"  # command to start vsearch
fq.dir <- "fastq"         # path to folder with (input) fastq files
fa.dir <- "fasta"         # path to folder with (output) fasta files
tmp.dir <- "tmp"          # temporary files, delete in the end
PCR.forward.primer <- "GACTGCGTACCAATTC"
PCR.reverse.primer <- "GATGAGTCCTGAGTAA"
min.read.length <- 30
maxee <- 0.02

#####################
### The sample table
sample.tbl <- suppressMessages(read_delim("fastq/sample_table.txt",
                                          delim = "\t")) %>% 
  mutate(fasta_file = str_c(sample_id, ".fasta"))

############################################################
### Looping over all samples
### 1) Filtering by maxee, discarding read-pairs
### 2) Merging read-pairs
### 3) Trimming primers from merged reads
### 4) Trimming primers from un-merged reads
### 5) Writing all reads to fasta-file
### 6) De-replicating and saving one fasta-file per sample
###########################################################
Nf <- str_length(PCR.forward.primer)
Nr <- str_length(PCR.reverse.primer)
for(i in 1:nrow(sample.tbl)){
  ##-- code chunk only needed for Windows 10 computers
  R.utils::gunzip(file.path(fq.dir, sample.tbl$R1_file[i]))
  sample.tbl$R1_file[i] <- str_remove(sample.tbl$R1_file[i], ".gz$")
  R.utils::gunzip(file.path(fq.dir,sample.tbl$R2_file[i]))
  sample.tbl$R2_file[i] <- str_remove(sample.tbl$R2_file[i], ".gz$")
  ##-- end code chunk for Windows 10
  
  cat("\n\n##### VSEARCH quality filtering sample", sample.tbl$sample_id[i], "...\n")
  cmd <- paste(vsearch.exe,
               "--fastq_filter", file.path(fq.dir, sample.tbl$R1_file[i]),
               "--reverse",      file.path(fq.dir, sample.tbl$R2_file[i]),
               "--fastq_maxee_rate", maxee,
               "--fastqout", file.path(tmp.dir, "filtered_R1.fq"),
               "--fastqout_rev", file.path(tmp.dir, "filtered_R2.fq"))
  system(cmd)
  
  cat("\n\n##### VSEARCH mergings read-pairs...\n")
  cmd <- paste(vsearch.exe,
               "--fastq_mergepairs", file.path(tmp.dir, "filtered_R1.fq"),
               "--reverse",          file.path(tmp.dir, "filtered_R2.fq"),
               "--fastq_allowmergestagger",
               "--fastq_minmergelen", min.read.length,
               "--fastaout", file.path(tmp.dir, "merged.fa"),
               "--fastqout_notmerged_fwd", file.path(tmp.dir, "notmerged_R1.fq"),
               "--fastqout_notmerged_rev", file.path(tmp.dir, "notmerged_R2.fq"))
  system(cmd)
  
  cat("\n\n##### VSEARCH trimming primers from merged reads...\n")
  cmd <- paste(vsearch.exe,
               "--fastx_filter", file.path(tmp.dir, "merged.fa"),
               "--fastq_stripleft",  Nf,
               "--fastq_stripright", Nr,
               "--fastq_minlen", min.read.length,
               "--relabel", "'size=2;pair'",
               "--fastaout", file.path(tmp.dir, "merged_filt.fa"))
  system(cmd)
  
  cat("\n\n##### VSEARCH trimming primers from un-merged reads...\n")
  cmd <- paste(vsearch.exe,
               "--fastq_filter", file.path(tmp.dir, "notmerged_R1.fq"),
               "--fastq_stripleft", Nf,
               "--fastq_minlen", min.read.length,
               "--relabel", str_c("'size=1;notmerged_R1_'"),
               "--fastaout", file.path(tmp.dir, "notmerged_R1_filt.fa"))
  system(cmd)
  cmd <- paste(vsearch.exe,
               "--fastq_filter", file.path(tmp.dir, "notmerged_R2.fq"),
               "--fastq_stripleft", Nr,
               "--fastq_minlen", min.read.length,
               "--relabel", str_c("'size=1;notmerged_R2_'"),
               "--fastaout", file.path(tmp.dir, "notmerged_R2_filt.fa"))
  system(cmd)
  
  cat("\n\n##### VSEARCH adding all reads to one fasta-file...\n")
  ok <- file.append(file1 = file.path(tmp.dir, "merged_filt.fa"),
                    file2 = file.path(tmp.dir, "notmerged_R1_filt.fa"))
  cmd <- paste(vsearch.exe,
               "--fastx_revcomp", file.path(tmp.dir, "notmerged_R2_filt.fa"),
               "--fastaout", file.path(tmp.dir, "notmerged_R2_filt_rc.fa"))
  system(cmd)
  ok <- file.append(file1 = file.path(tmp.dir, "merged_filt.fa"),
                    file2 = file.path(tmp.dir, "notmerged_R2_filt_rc.fa"))
 
  cat("\n\n##### VSEARCH de-replicating sample", sample.tbl$sample_id[i], "...\n")
  cmd <- paste(vsearch.exe,
               "--derep_fulllength", file.path(tmp.dir, "merged_filt.fa"),
               "--minuniquesize", 1,
               "--minseqlength", min.read.length,
               "--sizein --sizeout",
               "--relabel", str_c(sample.tbl$sample_id[i], ":uread_"),
               "--output", file.path(fa.dir, sample.tbl$fasta_file[i]))
  system(cmd)
  
  ##-- code chunk only needed for Windows 10 computers
  R.utils::gzip(file.path(fq.dir, sample.tbl$R1_file[i]))
  sample.tbl$R1_file[i] <- str_c(sample.tbl$R1_file[i], ".gz")
  R.utils::gzip(file.path(fq.dir, sample.tbl$R2_file[i]))
  sample.tbl$R2_file[i] <- str_c(sample.tbl$R2_file[i], ".gz")
  ##-- end code chunk for Windows 10
}

Note that you should have created the folders fasta and tmp before running this script. The first is where the resulting fasta files appear. The second is just temporary files. They may be nice to have for debugging, in case something goes wrong, but should be deleted in the end.

In this script we remove primers from the reads, but in many cases this has been done as part of de-mutliplexing, and then these steps should be omitted. We also set minimum read length to 30 and maximum error rate to 0.02. These may be edited, depending on your data. You may find it strange that the fastq-files in fastq are not compressed (.gz). This is very common, and the only reason they are not here is that we have found the vsearch has some problems reading gzipped files under Windows 10. If you run this on a windows 10 computer, we reccomend de-compressing the files used as input to vsearch, to be on the safe side, e.g. by using gunzip() and gzip() in R

Before we are done with this step, we add the sample.tbl to our rms.obj. Note that sample.tbl must have at least the two columns sample_id and fasta_file for the downstream analysis (the R1_file and R2_file may still be present but are no longer needed):

rms.obj <- addSampleTable(rms.obj, sample.tbl)

In this way we have the information about our samples in the same object as we have all other information. Inspect the rms.obj to verify the Sample.tbl is now another element in this list. Note that the function used above will replace an existing Sample.tbl inside the rms.obj if the latter already contains such a table. This means you may re-use the genome collection (and the corresponding Genome.tbl, Cluster.tbland Cpn.mat) with other samples (reads) by adding a different Sample.tbl.

Mapping reads to clusters

The next step is to map reads from each sample to the fragment clusters, and obtain a readcount matrix. This matrix has one column for each sample, and one row for each fragment cluster, similar to an OTU or ASV matrix for 16S amplicon data.

The readcount matrix

The readMapper() function needs the rms.obj with information about fragment clusters (rms.obj$Cluster.tbl) and samples (rms.obj$Sample.tbl), and uses vsearch to search with reads against the fragment cluster centroids. The path to the fasta files with processed reads is also required, since the rms.obj$Sample.tbl has the filenames, but not their path.

rms.obj <- readMapper(rms.obj, fa.dir, vsearch.exe = "vsearch", tmp.dir = "tmp")

## Mapping reads from sample Sample_1 ...
## Mapping reads from sample Sample_2 ...
## Mapping reads from sample Sample_3 ...
## Mapping reads from sample Sample_4 ...

The matrix Readcount.mat is added as a new component to the returned rms.obj. There should be one column for each sample and one row for each fragment cluster. Here are the first rows:

print(rms.obj$Readcount.mat[1:10,1:4])

##           Sample_1 Sample_2 Sample_3 Sample_4
## CLST1            2        0        0        0
## CLST10           0        0        0        0
## CLST100          0       10        6       16
## CLST1000         0        0        0        0
## CLST10000        0        0        8        4
## CLST10001        4        0       27       24
## CLST10002       15        5       84       82
## CLST10003       28        4       19        2
## CLST10004       66       11       35        5
## CLST10005       18        2        4        0

The readMapper() also adds information about how many reads are in each sample, and how many mapped to some RMS fragments. This appears as two new columns in the rms.obj$Sample.tbl:

print(rms.obj$Sample.tbl)

## # A tibble: 4 × 6
##   sample_id R1_file              R2_file              fasta_file reads…¹ reads…²
##   <chr>     <chr>                <chr>                <chr>        <dbl>   <dbl>
## 1 Sample_1  Sample_1_R1.fastq.gz Sample_1_R2.fastq.gz Sample_1.…  123977  121281
## 2 Sample_2  Sample_2_R1.fastq.gz Sample_2_R2.fastq.gz Sample_2.…  123454  120746
## 3 Sample_3  Sample_3_R1.fastq.gz Sample_3_R2.fastq.gz Sample_3.…  122017  119723
## 4 Sample_4  Sample_4_R1.fastq.gz Sample_4_R2.fastq.gz Sample_4.…  122144  119896
## # … with abbreviated variable names ¹reads_total, ²reads_mapped

Length normalization

We must expect the readcounts from RMS amplicons to have some length bias, due to the PCR amplification. We may plot and see if this is indeed the case:

rms.obj$Cluster.tbl %>% 
  select(Length) %>% 
  bind_cols(as_tibble(rms.obj$Readcount.mat)) %>% 
  pivot_longer(cols = -Length, names_to = "Sample", values_to = "Readcounts") %>%  
  ggplot() +
  geom_point(aes(x = Length, y = Readcounts), alpha = 0.3) +
  scale_y_log10() +
  facet_wrap(~Sample) -> plt
print(plt)

## Warning: Transformation introduced infinite values in continuous y-axis

Note the log-transformed y-axes. Since we have limited fragments to lengths of 30 to 500 only, the bias is not severe, but the shortest and longest fragments have slightly lower readcounts. Let us try to normalize:

rms.obj.norm <- normLength(rms.obj)

We make the same plot with the normalized data:

rms.obj.norm$Cluster.tbl %>% 
  select(Length) %>% 
  bind_cols(as_tibble(rms.obj.norm$Readcount.mat)) %>% 
  pivot_longer(cols = -Length, names_to = "Sample", values_to = "Readcounts") -> tbl
plt %+% tbl %>% print

## Warning: Transformation introduced infinite values in continuous y-axis

There is some effect on the most extreme lengths, as usual. Note the log-transformed y-axes, giving the illusion of large distortions in the smallest readcounts. They actually change little compared to the more normal readcount values.

If we decide to stick to the normalized data, we do not need both rms.obj and rms.obj.norm:

rms.obj <- rms.obj.norm
rm(rms.obj.norm)

Estimating abundances

Finally, we estimate the abundances of the various genomes in our samples.

The Constrained Ordinary Least Square (COLS)

Abundance estimation is done by the rmscols() function. The supplied rms.obj must contain the cluster copy number matrix rms.obj$Cpn.mat and the readcounts rms.obj$Readcount.mat. The idea is to look for a linear combination of genome abundances that, given cluster copy numbers, best explains the observed readcounts in a sample:

abd.mat <- rmscols(rms.obj)

## De-convolving sample Sample_1 having 163245.1 reads mapped to 11091 clusters...
##    initial estimate...
##    constrained optimization...
## final  value 3194067.591337 
## converged
## De-convolving sample Sample_2 having 154517.2 reads mapped to 11204 clusters...
##    initial estimate...
##    constrained optimization...
## final  value 2342174.199609 
## converged
## De-convolving sample Sample_3 having 160493.8 reads mapped to 8996 clusters...
##    initial estimate...
##    constrained optimization...
## iter    1 value 3428712.642983
## final  value 3428712.642983 
## converged
## De-convolving sample Sample_4 having 178379.8 reads mapped to 7851 clusters...
##    initial estimate...
##    constrained optimization...
## iter    1 value 5677208.081404
## final  value 5677208.081404 
## converged

The abd.mat is a matrix with one row for each genome, and one column for each sample. The numbers in a column are the relative abundances for each genome in the corresponding sample.

We may plot the results as stacked bar charts:

abd.mat %>% 
  as_tibble(rownames = "genome_id") %>% 
  pivot_longer(cols = -genome_id, names_to = "sample_id", values_to = "Estimated") -> long.tbl

ggplot(long.tbl) +
  geom_col(aes(x = sample_id, y = Estimated, fill = genome_id), color = "black")

Comparing to gold standard

The sequencing fastq files in fastq/ are simulated data. The art software (https://www.ncbi.nlm.nih.gov/pubmed/22199392) was used to simulate Illumina HiSeq reads. Biases typical for RMS data (i.e. due to fragment length and fragment-specific amplification bias) were added as described in Snipen et al, 2020.

The file gold_standard.txt contains the actual relative abundances of all genomes in all samples. Let us compare the estimated abundances from above to this, and also replace genome_id with organism_name in the figure legend:

suppressMessages(read_delim("gold_standard.txt", delim = "\t")) %>%
  left_join(rms.obj$Genome.tbl, by = "genome_id") %>% 
  select(genome_id, organism_name, starts_with("Sample")) %>% 
  pivot_longer(cols = c(-genome_id, -organism_name), names_to = "sample_id", values_to = "Gold.standard") %>% 
  full_join(long.tbl, by = c("genome_id", "sample_id")) %>% 
  pivot_longer(cols = c(-genome_id, -sample_id, -organism_name), names_to = "Type", values_to = "Abundance") %>% 
  ggplot() +
  geom_col(aes(x = Type, y = Abundance, fill = organism_name), color = "black") +
  facet_wrap(~sample_id)

New data

This forms a code template, and by replacing the genomes in gnm/ by your genomes, and the sequencing data in fastq/ by your own data, it should be possible to run an analysis.

Beware that for large collections of genomes (thousands), computations will be much slower than in this tutorial. You may also need a computer with a lot of memory, even if the copy number matrix has been implemented as a sparse matrix here.

Tutorial 2 - subspecies resolution

As mentioned above, the RMS method has a potential for estimating abundances down to below the species rank, i.e.separating strains. This is in fact one of the strengths of RMS. The reason for this sensitivity is the fact that we know a priori which fragments belong to which genomes.

If two genomes are very similar, they will also share many RMS fragments. As with all methods, there will always be a lower resolution, i.e. it is impossible to ‘see’ the difference between two identical genomes! How close can two genomes be, and still be separated by RMS?

The key to the resolution by RMS is the copy number matrix mentioned above. When estimating the abundances, its covariance matrix must be inverted. If two or more genomes are too similar, the corresponding columns of the copy number matrix are also very similar. This lead to a singular, or close-to singular, covariance matrix that cannot be inverted. Even if the matrix can be inverted, similar columns means results will be very unstable, leading to poor estimates. This effect can be measured by the condition value of the covariance matrix of the copy number matrix. If this is very large, the results will be unstable.

In the folder ecoli_frg we have the RMS fragment files for a random selection of 50 Escherichia coli genomes. It also contains the genome table, and we use this to create an RMS object:

library(tidyverse)
library(microrms)
ecoli.tbl <- read_delim("ecoli_frg/genome_table.txt", delim = "\t")
ecoli.rms.obj <- RMSobject(ecoli.tbl, frg.dir = "ecoli_frg", vsearch.exe = "vsearch", tmp = "tmp")

## VSEARCH clustering of RMS fragments...
## ...produced 8602 clusters
## ...the cluster table...done
## ...the copy number matrix
## genome 1 / 50 genome 2 / 50 genome 3 / 50 genome 4 / 50 genome 5 / 50 genome 6 / 50 genome 7 / 50 genome 8 / 50 genome 9 / 50 genome 10 / 50 genome 11 / 50 genome 12 / 50 genome 13 / 50 genome 14 / 50 genome 15 / 50 genome 16 / 50 genome 17 / 50 genome 18 / 50 genome 19 / 50 genome 20 / 50 genome 21 / 50 genome 22 / 50 genome 23 / 50 genome 24 / 50 genome 25 / 50 genome 26 / 50 genome 27 / 50 genome 28 / 50 genome 29 / 50 genome 30 / 50 genome 31 / 50 genome 32 / 50 genome 33 / 50 genome 34 / 50 genome 35 / 50 genome 36 / 50 genome 37 / 50 genome 38 / 50 genome 39 / 50 genome 40 / 50 genome 41 / 50 genome 42 / 50 genome 43 / 50 genome 44 / 50 genome 45 / 50 genome 46 / 50 genome 47 / 50 genome 48 / 50 genome 49 / 50 genome 50 / 50 
## ...the genome table...done

Notice that these 50 genomes produces ‘only’ 8602 fragment clusters, which is half of what the 10 genomes in tutorial 1 had. This is of course because here all genomes are from the same species, and they share many fragments. By inspecting the Genome.tbl we can see how many unique clusters each genome has:

print(ecoli.rms.obj$Genome.tbl)

## # A tibble: 50 × 5
##    genome_id                   genome_file               organ…¹ N_clu…² N_uni…³
##    <chr>                       <chr>                     <chr>     <int>   <int>
##  1 GCA_001651945.2_ASM165194v2 GCA_001651945.2_ASM16519… Escher…    1093       6
##  2 GCA_002007705.1_ASM200770v1 GCA_002007705.1_ASM20077… Escher…    1073      81
##  3 GCA_001612495.1_ASM161249v1 GCA_001612495.1_ASM16124… Escher…    1101      20
##  4 GCA_000599645.1_ASM59964v1  GCA_000599645.1_ASM59964… Escher…     939      78
##  5 GCA_003956305.1_ASM395630v1 GCA_003956305.1_ASM39563… Escher…     985      87
##  6 GCA_009909465.1_ASM990946v1 GCA_009909465.1_ASM99094… Escher…    1018     118
##  7 GCA_004299805.1_ASM429980v1 GCA_004299805.1_ASM42998… Escher…    1084      12
##  8 GCA_001683435.1_ASM168343v1 GCA_001683435.1_ASM16834… Escher…    1053     182
##  9 GCA_000813165.1_ASM81316v1  GCA_000813165.1_ASM81316… Escher…    1010      34
## 10 GCA_009762415.1_ASM976241v1 GCA_009762415.1_ASM97624… Escher…    1085     156
## # … with 40 more rows, and abbreviated variable names ¹organism_name,
## #   ²N_clusters, ³N_unique

The last two columns tell us there are around 1000 fragments in an E. coli genome, but the number of unique fragments is much smaller, even down to zero in a couple of cases. Thus, we cannot expect to be able to separate all these genomes by RMS.

Let us compute correlation distances, and make a dendrogram, like we did in tutorial 1:

library(ggdendro)
d <- corrDist(ecoli.rms.obj$Cpn.mat)
tree <- hclust(as.dist(d), method = "single")
ggd <- ggdendrogram(dendro_data(tree),
                    rotate = T,
                    theme_dendro = F) +
  labs(x = "", y = "Correlation distance", title = "All E. coli genomes")
print(ggd)

It is clear that some distances are close to zero. We may also compute the condition value from the copy number matrix:

print(conditionValue(ecoli.rms.obj$Cpn.mat))

## [1] 9056.452

How large condition value can we tolerate? The perfect value is 1, but this is never achieved with real data. a value below 10 is extremely good. The condition value for the copy number matrix in tutorial 1 is around 3.7, and we saw how all those genomes separated nicely in the dendrogram. Condition values below 100 or even 1000 is still quite acceptable, and even 10 000 is not all that bad. Going above this, we must expect some substantial errors in some abundance estimates. When later running rmscols() to estimate abundances, the number of iterations is also a measure indicating how separable the genomes are. A larger condition value means more iterations are needed, and less precise estimates.

Let us ‘prune’ the genome collection such that we get a smaller set of genomes, that we are capable of separating better. This is done by genome clustering. Genomes who are too similar are clustered into a group, and only the centroide genome in this group is used in the analysis, as a representative for the group. We use the function genomeClustering() for this:

ecoli.rms.obj.cls100 <- genomeClustering(ecoli.rms.obj, max.cond = 100, verbose = T)

## genomeClustering:
## ...starts with 50 genomes...
## ...computing correlation distances...
## ...finding maximum condition value = 9056.452 ...
## ...finding minimum condition value = 1.938526 ...
## ...searching for optimal clustering...
## ... 19 clusters...
## ... 33 clusters...
## ... 26 clusters...
## ... 28 clusters...
## ... 27 clusters...
## ... 28 clusters...
## ... 28 clusters...
## ...finding cluster members and medoides...
## ...pruning the data structure...
## ...updating N_unique...
## ...done!

Here we specified that we tolerate a condition value of maximum 100. You would typically choose values of $10^1$, $10^2$, $10^3$ or $10^4$ here, even if any value above 1 is allowed. The resulting ecoli.rms.obj.cls100 has a new Genome.tbl that we should inspect:

print(ecoli.rms.obj.cls100$Genome.tbl)

## # A tibble: 27 × 6
##    genome_id                   members_genome_id genom…¹ organ…² N_clu…³ N_uni…⁴
##    <chr>                       <chr>             <chr>   <chr>     <int>   <int>
##  1 GCA_001651945.2_ASM165194v2 GCA_001651945.2_… GCA_00… Escher…    1093     380
##  2 GCA_002899535.1_ASM289953v1 GCA_002007705.1_… GCA_00… Escher…     908      52
##  3 GCA_000599645.1_ASM59964v1  GCA_000599645.1_… GCA_00… Escher…     939      85
##  4 GCA_009909465.1_ASM990946v1 GCA_009909465.1_… GCA_00… Escher…    1018     121
##  5 GCA_009577985.1_ASM957798v1 GCA_004299805.1_… GCA_00… Escher…    1074     323
##  6 GCA_001683435.1_ASM168343v1 GCA_001683435.1_… GCA_00… Escher…    1053     202
##  7 GCA_001021615.1_APECO18     GCA_000813165.1_… GCA_00… Escher…    1001     143
##  8 GCA_009762415.1_ASM976241v1 GCA_009762415.1_… GCA_00… Escher…    1085     165
##  9 GCA_003355175.1_ASM335517v1 GCA_002899475.1_… GCA_00… Escher…     954      98
## 10 GCA_002165115.2_ASM216511v2 GCA_002165115.2_… GCA_00… Escher…    1031     120
## # … with 17 more rows, and abbreviated variable names ¹genome_file,
## #   ²organism_name, ³N_clusters, ⁴N_unique

It has 27 rows, instead of the original 50, i.e. 23 genomes less. These ‘lost’ genomes are now represented by some other, and the new column members_genome_id lists the original genome_id to all members of each cluster (comma separated). Also, notice there are now many more unique fragments for each cluster centroide genome.

We may now re-compute the condition value and the dendrogram:

print(conditionValue(ecoli.rms.obj.cls100$Cpn.mat))

## [1] 96.16339

d <- corrDist(ecoli.rms.obj.cls100$Cpn.mat)
tree <- hclust(as.dist(d), method = "single")
ggd <- ggdendrogram(dendro_data(tree),
                    rotate = T,
                    theme_dendro = F) +
  labs(x = "", y = "Correlation distance", title = "Clustered E. coli genomes")
print(ggd)

Note that no correlation distance are now below 0.20, and we expect to be able to separate these genomes fairly well. In Snipen et al (2021) we demonstrate this on a much larger collection of E. coli genomes. Note also that this is something you do in silico prior to any experimental efforts, since it only involves the sequenced genomes. Having sequenced some samples, you proceed as in tutorial 1 in order to estimate how abundant these clusters are in the samples.

larssnip / microRMS