On the application of BERT models for nanopore methylation detection

MethBERT explores a non-recurrent modeling approach for nanopore methylation detection based on the bidirectional encoder representations from transformers (BERT). Compared with the state-of-the-art model using bi-directional recurrent neural networks (RNN), BERT can provide a faster model inference solution without the limit of computation in sequential order. We proviede two types of BERTs: the basic one [Devlin et al.] and the refined one. The refined BERT is refined according to the task-specific features, including:

learnable postional embedding
self-attetion with realtive postion representation [Shaw et al.]
center postitions concatenation for the output layer

The model structures are shown in the above figure.

Installation

git clone https://github.com/yaozhong/methBERT.git
cd methBERT
pip install .

Docker enviroment

We also provide a docker image for running this source code

docker pull yaozhong/ont_methylation:0.6

ubuntu 16.04.10
Python 3.5.2
Pytorch 1.5.1+cu101

nvidia-docker run -it --shm-size=64G -v LOCAL_DATA_PATH:MOUNT_DATA_PATH yaozhong/ont_methylation:0.6

Training

Data sampling and split

We use reads from complete methylation data and amplicon data (Control) for training models. We provide the following two choice for sampling reads.

random and balanced selection of reads
region-based selection

N_EPOCH=50
W_LEN=21
LR=1e-4
MODEL="biRNN_basic"
MOTIF="CG"
NUCLEOTIDE_LOC_IN_MOTIF=0
POSITIVE_SAMPLE_PATH=<methylated fast5 path>
NEGATIVE_SAMPLE_PATH=<unmethylated fast5 path>
MODEL_SAVE_FILE=<model saved path>

# training biRNN model
python3 train_biRNN.py --model ${MODEL}  --model_dir ${MODEL_SAVE_FILE} --gpu cuda:0 --epoch ${N_EPOCH} \
 --positive_control_dataPath ${POSITIVE_SAMPLE_PATH}   --negative_control_dataPath ${NEGATIVE_SAMPLE_PATH} \
 --motif ${MOTIF} --m_shift ${NUCLEOTIDE_LOC_IN_MOTIF} --w_len ${W_LEN} --lr $LR --data_balance_adjust

# training bert models
MODEL="BERT_plus" (option: "BERT", "BERT_plus")
python3 train_bert.py --model ${MODEL}  --model_dir ${MODEL_SAVE_FILE} --gpu cuda:0 --epoch ${N_EPOCH} \
 --positive_control_dataPath ${POSITIVE_SAMPLE_PATH}   --negative_control_dataPath ${NEGATIVE_SAMPLE_PATH} \
 --motif ${MOTIF} --m_shift ${NUCLEOTIDE_LOC_IN_MOTIF} --w_len ${W_LEN} --lr $LR  --data_balance_adjust

Detection

We provided independent trained models on each 5mC and 6mA datasets of different motifs and methyltransferases in the ./trained_model fold.

MODEL="BERT_plus" 
MODEL_SAVE_PATH=<model saved path>
REF=<reference genome fasta file>
FAST5_FOLD=<fast5 files to be analyzed>
OUTPUT=<output file>

time python detect.py --model ${MODEL} --model_dir ${MODEL_SAVE_PATH} \
--gpu cuda:0  --fast5_fold ${FAST5_FOLD} --num_worker 12 \
--motif ${MOTIF} --m_shift ${NUCLEOTIDE_LOC_IN_MOTIF} --evalMode test_mode --w_len ${W_LEN} --ref_genome ${REF} --output_file ${OUTPUT}

We generate the same output format as the deepSignal (https://github.com/bioinfomaticsCSU/deepsignal).

# output example
NC_000913.3     4581829 +       4581829 43ea7b03-8d2b-4df3-b395-536b41872137    t       1.0     3.0398369e-06   0       TGCGGGTCTTCGCCATACACG
NC_000913.3     4581838 +       4581838 43ea7b03-8d2b-4df3-b395-536b41872137    t       0.9999996       0.00013372302   0       TCGCCATACACGCGCTCAAAC
NC_000913.3     4581840 +       4581840 43ea7b03-8d2b-4df3-b395-536b41872137    t       1.0     0.0     0       GCCATACACGCGCTCAAACGG
NC_000913.3     4581848 +       4581848 43ea7b03-8d2b-4df3-b395-536b41872137    t       1.0     0.0     0       CGCGCTCAAACGGCTGCAAAT
NC_000913.3     4581862 +       4581862 43ea7b03-8d2b-4df3-b395-536b41872137    t       1.0     0.0     0       TGCAAATGCTCGTCGGTAAAC

Available benchmark dataset

We test models on 5mC and 6mA dataset sequenced with Nanopore R9 flow cells, which is commonly used as the benchmark data in the previous work.

The stoiber's dataset, https://www.biorxiv.org/content/10.1101/094672v2
Simpson's dataset, https://www.nature.com/articles/nmeth.4184

The fast5 reads are supposed to be pre-processed with re-squggle (Tombo). More detailed information on data pre-prcessing, please refer to methBERT.readthedoc

Reference genome

E.coli: K-12 sub-strand MG1655
H.sapiens: GRCh38

Reference

This source code is referring to the follow github projects.

yaozhong / methBERT