nanxstats / protr

🧬 Toolkit for generating various numerical features of protein sequences

Home Page:https://nanx.me/protr/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

sequence encoding feature

arronlacey opened this issue · comments

Hi - in the cran viginette for the protr package there is a line "For example, for a given sequence MTEITAAMVKELRESTGAGA, it will be encoded as 32132223311311222222 according to its hydrophobicity."

but I don't see a function to do this in protr. It would be helpful to be able to obtain the raw encoding for a a sequence across all CTD descriptors.

Thanks.

@arronlacey - sounds reasonable. I'm not sure why there wasn't such a function back then (probably six years ago we were not having the best atomic functional interface design yet and I feel sorry for that). The following function might not be the prettiest implementation but it seems to work:

hydro = list(
  '1' = c('R', 'K', 'E', 'D', 'Q', 'N'),
  '2' = c('G', 'A', 'S', 'T', 'P', 'H', 'Y'),
  '3' = c('C', 'L', 'V', 'I', 'M', 'F', 'W'))
seq = 'MTEITAAMVKELRESTGAGA'

encode = function (seq, prop) {
  lst = lapply(prop, function(x, y) which(y %in% x), strsplit(seq, '')[[1]])
  vec = rep(NA, nchar(seq))
  for (i in 1:length(lst)) vec[lst[[i]]] = i
  paste(vec, collapse = '')
}

encode(seq, hydro)

Maybe it's time to consider a protr2 project, with a better interface and low-level architecture.
Now I can't even look at the code I wrote this many years ago. 😂

Thanks very much for your reply, and generously sharing some code to help with my problem. I think protein annotaters such as the protr package are great, especially for looking at difference sin mutation data, so maybe a version 2 is worth a shot!