hybrid-protein-predictor

本仓库用于存放AI for Science比赛"CG2302-基于机器学习的蛋白质适应度预测任务"赛题解决方案。我们设计了一个包含局部特征提取和全局特征学习功能的混合模型，其中局部特征提取使用卷积神经网络CNN实现，蛋白质序列的全局特征学习由预训练模型ProtBert实现。

我们的代码基于PEER_Benchmark框架运行，关于PEER_Benchmark框架的详细信息请点击此处。

安装

运行代码前需要配置PEER_Benchmark的相关环境。安装要求使用Python 3.7/3.8且PyTorch版本不低于1.8.0。

conda create -n protein python=3.7
conda activate protein

conda install pytorch==1.8.0 cudatoolkit=10.2 -c pytorch
conda install scikit-learn pandas decorator ipython networkx tqdm matplotlib -y
conda install pytorch-scatter pytorch-cluster -c pyg -c conda-forge
pip install fair-esm transformers easydict pyyaml lmdb

cd ./torchdrug
pip install -e .

运行

单GPU运行

运行前需要对模型结构、任务类别、训练参数等参数进行设置，完成设置后执行以下命令：

cd PEER_Benchmark

# 单任务学习
python script/run_single.py -c config/single_task/$model/$yaml_config --seed 0

# 多任务学习
python script/run_multi.py -c config/multi_task/$model/$yaml_config --seed 0

多GPU运行

多GPU运行首先需要在yaml配置文件中指定运行GPU的编号（gpus = [0,1,2,3]表示使用4块GPU进行数据并行训练，注意此处的GPU编号数量需要与命令中--nproc_per_node指定的GPU数量相同），然后执行以下命令：

cd PEER_Benchmark

# 单任务学习
python -m torch.distributed.launch --nproc_per_node=4 script/run_single.py -c config/single_task/$model/$yaml_config --seed 0

# 多任务学习
python -m torch.distributed.launch --nproc_per_node=4 script/run_multi.py -c config/multi_task/$model/$yaml_config --seed 0

实验结果复现

我们的最佳模型表现由混合模型在fluorescence_ss和stability_ss两个多任务学习环境下测试得到，训练配置在./PEER_Benchmark/config/multi_task/Hybrid_best中。复现前需要先将模型参数（模型参数下载链接请点击此处）下载至本地并解压（大小约为10GB），并配置文件中的checkpoint属性修改为./best_params中模型参数的绝对路径，再执行以下命令：

cd PEER_Benchmark

# 荧光性预测任务
python -m torch.distributed.launch --nproc_per_node=2 script/run_multi_test.py -c config/multi_task/Hybrid_best/fluorescence_ss.yaml --seed 0

# 稳定性预测任务
python -m torch.distributed.launch --nproc_per_node=2 script/run_multi_test.py -c config/multi_task/Hybrid_best/stability_ss.yaml --seed 0

预测结果如下：

蛋白质荧光性预测结果

蛋白质稳定性预测结果

About

Source code and model parameters of AI for Science competition

Apache License 2.0

Languages

Language:Python 89.1%Language:C++ 5.5%Language:Cuda 5.0%Language:HTML 0.3%Language:Dockerfile 0.1%

Gemini321 / hybrid-protein-predictor