sequence encoding feature
arronlacey opened this issue · comments
Hi - in the cran viginette for the protr package there is a line "For example, for a given sequence MTEITAAMVKELRESTGAGA, it will be encoded as 32132223311311222222 according to its hydrophobicity."
but I don't see a function to do this in protr. It would be helpful to be able to obtain the raw encoding for a a sequence across all CTD descriptors.
Thanks.
@arronlacey - sounds reasonable. I'm not sure why there wasn't such a function back then (probably six years ago we were not having the best atomic functional interface design yet and I feel sorry for that). The following function might not be the prettiest implementation but it seems to work:
hydro = list(
'1' = c('R', 'K', 'E', 'D', 'Q', 'N'),
'2' = c('G', 'A', 'S', 'T', 'P', 'H', 'Y'),
'3' = c('C', 'L', 'V', 'I', 'M', 'F', 'W'))
seq = 'MTEITAAMVKELRESTGAGA'
encode = function (seq, prop) {
lst = lapply(prop, function(x, y) which(y %in% x), strsplit(seq, '')[[1]])
vec = rep(NA, nchar(seq))
for (i in 1:length(lst)) vec[lst[[i]]] = i
paste(vec, collapse = '')
}
encode(seq, hydro)
Maybe it's time to consider a protr2 project, with a better interface and low-level architecture.
Now I can't even look at the code I wrote this many years ago. 😂
Thanks very much for your reply, and generously sharing some code to help with my problem. I think protein annotaters such as the protr package are great, especially for looking at difference sin mutation data, so maybe a version 2 is worth a shot!