nanxstats / protr

🧬 Toolkit for generating various numerical features of protein sequences

Home Page:https://nanx.me/protr/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Efficient way to compare 2 sets of multiple sequences

seb-mueller opened this issue · comments

Really nice package, I've one question though.

Is there are parallel version of to compare to sets of sequences?
I know there is twoSeqSim to compare 2 sequences and parSeqSim which takes one set of sequences and compares them pairwise. However if I have a 2 sets, say set A with seq1 and seq2, as well as set B with seq3 and seq4. I want to compare each member of set A with each member of set B, but don't want to compare within the 2 sets.
Is there an efficient (also using parallel execution) way to do so without having to loop using twoSeqSim?

@seb-mueller Great question. You can probably reuse protr:::.seqPairSim() which is the backbone of parSeqSim(). This is a function designed to compute the similarity between two protein sequences. It takes in the indices of two sequences, the list of sequences, and parameters for alignment. The output is a similarity score.

Here is a simple code sketch - I haven't validated its correctness but you can easily verify if it works as expected:

crossSetSim <- function(
    set1, set2,
    type = "local",
    submat = "BLOSUM62",
    gap.opening = 10,
    gap.extension = 4) {
  combinations <- expand.grid(seq_along(set1), seq_along(set2))

  results <- foreach(
    i = seq_len(nrow(combinations)),
    .combine = c,
    .packages = c("Biostrings")
  ) %dopar% {
    idx1 <- combinations[i, 1]
    idx2 <- combinations[i, 2]

    protr:::.seqPairSim(
      c(idx1, idx2),
      c(set1, set2),
      type,
      submat,
      gap.opening,
      gap.extension
    )
  }

  matrix(
    results,
    nrow = length(set1),
    ncol = length(set2),
    byrow = TRUE
  )
}

Example:

library(protr)
library(foreach)
library(doParallel)

doParallel::registerDoParallel(2)

s1 <- readFASTA(system.file("protseq/P00750.fasta", package = "protr"))[[1]]
s2 <- readFASTA(system.file("protseq/P08218.fasta", package = "protr"))[[1]]
s3 <- readFASTA(system.file("protseq/P10323.fasta", package = "protr"))[[1]]
s4 <- readFASTA(system.file("protseq/P20160.fasta", package = "protr"))[[1]]
s5 <- readFASTA(system.file("protseq/Q9NZP8.fasta", package = "protr"))[[1]]

set.seed(42)
plist1 <- as.list(c(s1, s2, s3, s4, s5)[sample(1:5, 10, replace = TRUE)])
plist2 <- as.list(c(s1, s2, s3, s4, s5)[sample(1:5, 20, replace = TRUE)])

psimmat <- crossSetSim(plist1, plist2)

Hi @nanxstats , thanks so much for you quick response and the great suggestion.
I've tested your solution, and whilst it did return numbers, there was something off. After some debugging, I suppose there was only one minor mistake:
Instead of
c(idx1, idx2),
it needs to be:
c(idx1, idx2 + length(set1)),

The index of set2 needs to be offset by the length of set1 to make it work in the combined vector. I'll have to run some more test, but it should be ok like this.
Maybe this can be even added into the package since I believe this is a common usecase? Happy to create a PR if so.

@seb-mueller Perfect. I agree that this will be a very useful feature. Please send a PR when you are ready. (Also, feel free to add yourself as ctb to DESCRIPTION.)

This feature has been shipped with protr 1.7-0, now on CRAN: https://cran.r-project.org/package=protr