bjmt / universalmotif

Motif manipulation functions for R.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

`create_motif` makes incorrect motif for amino acid sequences

MattBrauer opened this issue · comments

What happens

Given a list of sequences (as AAStringSet), create_motif returns an obviously wrong PWM and consensus. It appears that there are some issues with inconsistent ordering of amino acid label values.

What I suspect might be causing the problem

create_motif creates a matrix with row labels from Biostring's AA_STANDARD. This is a list of single-letter amino acid codes that are in the order:
"A" "R" "N" "D" "C" "Q" "E" "G" "H" "I" "L" "K" "M" "F" "P" "S" "T" "W" "Y" "V"

Later manipulations of this matrix seem to expect the order to be alphabetical.

How to reproduce

library(universalmotif)
library(Biostrings)

sequences <- AAStringSet(c("VTTDLQVKV", "STSDLLTLR", "TSLHLLVLR", "QALELLPRL", "LTDTLVSKL", 
  "TSLHLVLRL", "TSLRLLTSL", "LSTPVLRFT", "APEEHPVLL", "GSSDFLVKL", 
  "VTFLLPAGW", "LTSELLTHL", "TSSSLLLLR", "LSTEVNPKL", "QSLPTKETL", 
  "LLDPHVVLL", "SGLVLKVLL", "LTAHVEPLL", "STVKVLLRL", "FLDTVLLSW", 
  "LSKALVAYY", "KASSLVPKL", "LTADLARVL", "SGTDRQVTL", "TFDVALSPR", 
  "EDFTLLVNL", "FDDVAVVTF", "SGAYLKVSL", "LWDLSLLTR", "LTTKALYRN", 
  "GVAPLQVVK", "FFDPVTLHL", "LVSALQLLL", "TESKYYVTL", "LFDLFRFGF", 
  "LSVPLFKQF", "KRTLLDVVY", "KSFEAPLLK", "TTTPQQTKL", "SAADLPLNL", 
  "VSSKLLLVL", "QSLPTKETL", "VTLFKVAAP", "LTAHVEPLL", "LDVRYLLDL", 
  "TTGTLLKTL", "MLLDVYLTL", "SGLVVLKLL", "KSTDVFTTF", "LTAQHKLMA", 
  "HFDLLLRVN", "KALDSSKTF", "YNDEALLLR", "KSLTLTPQL", "YTRYGPKAF", 
  "MVAKKPNLL", "YQPDFYFEF", "KDLLMVPTF", "FSLPWRSST", "LPDSSPRTL", 
  "SMAALFVLL", "PELEVKVTV", "KTPVKVPVL", "LKLLLGLLL", "VLTTKLLVL", 
  "LSQRKSTSL", "KTTPDVLFV", "LEELSKYLF", "QSLPLFVQL", "KDTKTLVLL", 
  "HGFFLPEKL", "KLYYQEFKK", "HSLTEDVTL", "ASSTNLLHL", "NDAYLVQGL", 
  "STLLKFEAA", "HSAELLAEL", "PDLLTKLTF", "TFTKTQETL", "LSGRLLTVL", 
  "KPEVVFLLL", "KGFVGSFLV", "KAVDTSKTF", "FDDTTFGTF", "VQVVLMLLL", 
  "VALAKSLYY", "TAHDLLAEL", "KAAKKAPLN", "SYVKLLLSY", "LPLFVSLDL", 
  "VNFLVLVRT", "FLKAPLLFL", "TLPHLSESF", "TPHDPTVPL", "VDGKTLVNV", 
  "KLTSEVLNL", "EPFVLPLTW", "VTDLHKTSL", "KQKWLALLK", "MVAKKPNLL", 
  "SLRNVKVTL", "QNTLAVPEL", "PSPFAALVH", "HTFWGVVFF", "KSDVFLTEL", 
  "FTDARAYTT", "LTERFTLVF", "KGTSTTHLL", "VGNLRALVR", "KEASLQLVL", 
  "PVTTKPVTL", "KNASLYLLV", "HTAELVLVL", "TYDLQESNV", "TTATQVLLL", 
  "KDGLFWVLV", "VSTGLVKLR", "KANEKLAVK", "LVSVQVVLV", "VAKVNAYTF", 
  "KAAELQTGL", "QVKFAGVKL", "FDDDSKLFW", "AVRMVGLQL", "LSNVAYPVL", 
  "MFDDTELLF", "KLTLTEVEP", "YRSLGPALR", "LLARASLLL", "LTNSSTVTL", 
  "GTDLASFNL", "LCNAKLYLF", "KLEDFAFTF", "HTNALQTLL", "KVDSVYYLF", 
  "VVKAKVNAP", "TTLLKEVEP", "ASLPRSVLF", "WLAWSTFGE", "LTDGYKLTL", 
  "QGYEKLVEV", "KDGFTLFYF", "LWPLLAVAL", "LDPLSVKTF", "KVQQYAVKL", 
  "TDELSPHLL", "VDFLLATWF", "CYGRSVLNY", "GSTERNVTL", "KDEVYYVKL", 
  "KNKAAVLQL", "TDDYMELLF", "TTTLAKVEV", "FLGKALFFL", "LPDMSQPLW", 
  "KTRTEVSQY", "TEPEYLTEY", "HATTQNVLL", "KSDVFTLEL"))

Compare the output of create_motif with that of consensusMatrix:

create_motif(sequences, alphabet="AA", type="PWM")
consensusMatrix(sequences)

Note that I'm using v1.4.0 of universalmotif, but I don't think this issue has been addressed subsequently.

sessionInfo()

R version 3.6.1 (2019-07-05)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Mojave 10.14.5

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets methods base

other attached packages:
[1] Biostrings_2.54.0 XVector_0.26.0 IRanges_2.20.0 S4Vectors_0.24.0 BiocGenerics_0.32.0 universalmotif_1.4.0

loaded via a namespace (and not attached):
[1] treeio_1.10.0 tidyselect_0.2.5 remotes_2.1.0 purrr_0.3.3 lattice_0.20-38 colorspace_1.4-1
[7] vctrs_0.2.0 yaml_2.2.0 rlang_0.4.1 pkgbuild_1.0.6 pillar_1.4.2 glue_1.3.1
[13] withr_2.1.2 rvcheck_0.1.6 lifecycle_0.1.0 stringr_1.4.0 ggseqlogo_0.1 zlibbioc_1.32.0
[19] munsell_0.5.0 gtable_0.3.0 callr_3.3.2 ps_1.3.0 gbRd_0.4-11 curl_4.2
[25] Rcpp_1.0.2 scales_1.0.0 backports_1.1.5 BiocManager_1.30.9 jsonlite_1.6 ggplot2_3.2.1
[31] packrat_0.5.0 stringi_1.4.3 processx_3.4.1 dplyr_0.8.3 grid_3.6.1 rprojroot_1.3-2
[37] bibtex_0.4.2 ggtree_2.0.0 Rdpack_0.11-0 cli_1.1.0 tools_3.6.1 magrittr_1.5
[43] lazyeval_0.2.2 tibble_2.1.3 crayon_1.3.4 ape_5.3 tidyr_1.0.0 pkgconfig_2.0.3
[49] zeallot_0.1.0 MASS_7.3-51.4 tidytree_0.2.9 prettyunits_1.0.2 assertthat_0.2.1 rstudioapi_0.10
[55] R6_2.4.0 nlme_3.1-141 compiler_3.6.1

Thank you for the very comprehensive report! I believe you are quite right as to what was going wrong. Many functions do indeed expect everything to be sorted alphabetically.

I applied a quick fix which at least makes your example work now. It'll be applied either tomorrow or the day after in the bioconductor release version, should you still be interested.