sequence encoding feature

Question

sequence encoding feature

arronlacey opened this issue 7 years ago · comments

Hi - in the cran viginette for the protr package there is a line "For example, for a given sequence MTEITAAMVKELRESTGAGA, it will be encoded as 32132223311311222222 according to its hydrophobicity."

but I don't see a function to do this in protr. It would be helpful to be able to obtain the raw encoding for a a sequence across all CTD descriptors.

Thanks.

Nan Xiao · Answer 1 · Tue Mar 20 2018 01:27:16 GMT+0800 (China Standard Time)

@arronlacey - sounds reasonable. I'm not sure why there wasn't such a function back then (probably six years ago we were not having the best atomic functional interface design yet and I feel sorry for that). The following function might not be the prettiest implementation but it seems to work:

hydro = list(
  '1' = c('R', 'K', 'E', 'D', 'Q', 'N'),
  '2' = c('G', 'A', 'S', 'T', 'P', 'H', 'Y'),
  '3' = c('C', 'L', 'V', 'I', 'M', 'F', 'W'))
seq = 'MTEITAAMVKELRESTGAGA'

encode = function (seq, prop) {
  lst = lapply(prop, function(x, y) which(y %in% x), strsplit(seq, '')[[1]])
  vec = rep(NA, nchar(seq))
  for (i in 1:length(lst)) vec[lst[[i]]] = i
  paste(vec, collapse = '')
}

encode(seq, hydro)

Nan Xiao · Answer 2 · Tue Mar 20 2018 01:30:16 GMT+0800 (China Standard Time)

Maybe it's time to consider a protr2 project, with a better interface and low-level architecture.
Now I can't even look at the code I wrote this many years ago. 😂

Arron Lacey · Answer 3 · Tue Mar 20 2018 18:36:00 GMT+0800 (China Standard Time)

Thanks very much for your reply, and generously sharing some code to help with my problem. I think protein annotaters such as the protr package are great, especially for looking at difference sin mutation data, so maybe a version 2 is worth a shot!