solislemuslab / dna-nn-theory

Here, we investigate the robustness, generalization potential and prediction accuracy of widely used convolutional neural network and natural language processing models with a variety of heterogeneous genomic datasets.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Towards a robust out-of-the-box neural network model forgenomic data

Zhaoyi Zhang, Songyang Cheng, Claudia Solis-Lemus

Citation

This repository contains the scripts for the Zhang et al, 2022 manuscript:

@article{Zhang_Cheng_Solis-Lemus_2022, 
title={Towards a robust out-of-the-box neural network model for genomic data}, 
volume={23}, 
DOI={10.1186/s12859-022-04660-8},  
number={1}, 
journal={BMC Bioinformatics}, 
author={Zhang, Zhaoyi and Cheng, Songyang and Solis-Lemus, Claudia}, 
year={2022}, 
pages={125}
}

Joint first authors with equal contribution (Zhang, Zhaoyi and Cheng, Songyang). Order determined randomly by scripts/order-authors.jl.

Data

We used publicly available data from the following manuscripts:

  • Zeng H., Edwards M.D., Gifford D. K.(2015) "Convolutional Neural Network Architectures for Predicting DNA-Protein Binding". Proceedings of Intelligent Systems for Molecular Biology (ISMB) 2016 Bioinformatics, 32(12):i121-i127. doi: 10.1093/bioinformatics/btw255. Motif data link, Paper link

  • Nguyen, N.G., Tran, V.A., Ngo, D.L., Phan, D., Lumbanraja, F.R., Faisal, M.R., Abapihi, B., Kubo, M., Satou, K. (2016) "DNA sequence classification by convolutional neural network". JBiSE 09(05), 280–286 Splice data link, Histone data link, Paper link

Scripts

Pre-processing data

Python functions to download, clean and reformat the input data:

CNN models

All scripts and output files corresponding to the CNN models are in the cnn folder.

NLP models

All the scripts and output files corresponding to the NLP models are in the nlp folder. Jupyter notebooks contain the reproducible steps to run the analyses on each of the datasets.

LSTM-layer

doc2vec+NN

LSTM-AE+NN

Figures

All figures in the manuscript were created with the R script in plots/final-plots.Rmd

License

Our code is licensed under the MIT License © Solis-Lemus lab projects (2021).

Feedback, issues and questions

Please use the GitHub issue tracker to report any issues or difficulties with the current code.

About

Here, we investigate the robustness, generalization potential and prediction accuracy of widely used convolutional neural network and natural language processing models with a variety of heterogeneous genomic datasets.

License:MIT License


Languages

Language:Jupyter Notebook 91.4%Language:HTML 8.3%Language:Python 0.3%Language:Julia 0.0%