yaozhong / ns_profiling

modeling nucleotide sequencing

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Problem

  • Motif finding
  • Sequence patterns For example, we want to use a better to describe the following patterns in the figure. MAQC_sampleA_contextGenomic

How to profiling these sequences, how to uncover potential sequence patterns in a list of sequences?

The basic tools for profiling these sequences is kmer-based counting and calcuate n-order Markov models. Based on the foreground and background sequences and calcuate kmer.postion.matrix pt.mat as the model, that each element of the matrix is k-mer frequence list for the k and postion.

We investigated the foreground and background model profiling a random selected foreground/background sequence (trained) and (?test)

A foreground sequence

foreseq_foreModel_backModel

A background sequence

backseq_foreModel_backModel

KL divergence of differet Kmer model for the postions

KL divergences

We could observed from this graph, the higher order k-mer model find more different for the difference of the seqeunce. As we may know the different for the first 10 postions should not that big!! So higher order models has the risk of overfitting!

working list

  • basic analysis of kmer based Markov models
  • high efficent sampling of RNA reads
  • RNN R-based implementation

About

modeling nucleotide sequencing


Languages

Language:R 100.0%