yaozhong / bert_investigation

BERT analysis

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Investigation of the BERT model on nucleotide sequences with non-standard pre-training and evaluation of different k-mer embeddings

In this study, we used a non-standard pre-training approach through incorporating randomness at the data and model level to investigate a BERT model pre-trained on nucleotide sequences.

Data

Pre-training data

  • data/ptData: source code of generate random sequences.

Fine-tuning data

Source code

  • ft_tasks: source code of using different k-mer embeddings in downstream tasks of TATA prediciton and TBFS prediction.

Enviroments and required packages

Evaluated k-mer embeddings

k-mer embedding Description required files
dnabert k-mer embedding from DNABERT pre-trained on hg38 pre-trained model provided by DNABERT
dnabert k-mer embedding from DNABERT pre-trained on random data DNABERT model pre-trained on random data
onehot one-hot embedding None
dna2vec k-mer embedding from dna2vec pretrained model

TATA prediction task

KMER=5
SPIECE= "human" (or "mouse")
MODEL="deepPromoterNet"
MODEL_SAVE_PATH="model/"
DATA_PATH="ftData/TATA/TATA_${SPIECE}/overall"
EMBEDDING="dnabert" (or "onehot", "dna2vec")
embed_file=FOLD_PATH_OF_THE_PRETRAINED_MODEL (or NONE)
KERNEL="5,5,5"
LR=1e-4
EPOCH=20
BS=64
DROPOUT=0.1

CODE="ft_tasks/TATA/tata_train.py"
python $CODE --kmer $KMER --cnn_kernel_size $KERNEL --model $MODEL --model_dir $MODEL_SAVE_PATH \
    --data_dir $DATA_PATH  --embedding $EMBEDDING --embedding_file $embed_file \
    --lr $LR --epoch $EPOCH --batch_size $BS --dropout $DROPOUT --device "cuda:0"

TFBS prediction task

KMER=5
MODEL="zeng_CNN"
KERNEL="24" 
MODEL_SAVE_PATH="model/"
DATA_PATH="TBFS/motif_discovery/" or "TBFS/motif_occupancy/"
EMBEDDING="dnabert" (or "onehot", "dna2vec")
embed_file=FOLD_PATH_OF_THE_PRETRAINED_MODEL (or NONE)
LR=0.001
EPOCH=10
BS=64
DROPOUT=0.1

CODE="ft_tasks/TFBS/TBFS_all_run.py"
python $CODE --kmer $KMER --cnn_kernel_size $KERNEL --model $MODEL --model_dir $MODEL_SAVE_PATH \
	--data_dir $DATA_PATH --embedding $EMBEDDING --embedding_file $embed_file \
	--lr $LR --epoch $EPOCH --batch_size $BS --dropout $DROPOUT --device "cuda:0" 

Pre-trained models

Experiment results

  • results: detailed results of each dataset of TBFS tasks.

About

BERT analysis

License:MIT License


Languages

Language:Python 100.0%