skphy / Computational-Material-Band-Gap-Prediction

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Anaconda-Server Badge Anaconda-Server Badge License: MIT

computational-material-project

The project contains two separate parts. One is to extract the information of molecules from the cif files and compile to pandas dataframe, the other is to use machine learning to analyze the relationship between crystal structure and the band gap. And create an user interface to output the predicted bandgap value.

I. CIF Processing

The first component of the computational material project is the preprocessing of .cif files, which is an international standardized data format that stores the structural information of molecules. We are motivated to provide a tool that does massive reading and data extraction to provide the first step towards any statistical or computational analysis.

Combining with the advanced Materials Project and its API access, we also designed a download function that helps people without computer science context to select materials of interested band gap boundaries, without learning the complex syntax required in the original pymatgen package.

In the CIF Processing folder, there is CIF_process.py that could be directly executed from the command line by calling python CIF_process.py. Following the sequential prompt to provide your information, you will be able to obtain a csv file made from either the downloaded data or the local files.

A special note regarding the test file of the CIF processing program:

The test file is named with manualtest* rather than the conventional test naming. It is because the testing file involves the evaluation of the returned raw data and we will ask you to provide confidential information to help the testing. See the prompt in the script for more information.

II. Band Gap Prediction with Machine Learning

A practice as well as an exploratory study of this data processing program is also provided, which makes of the second part of the project. We applied several machine learning models on the data frame containing all materials with a band gap fall in the visible light spectrum from Materials Project, generated by CIF_process.py. The generated data file is provided in doc/dataset in this repo.

Hypothesis: the structure of substances is strongly correlated with the band gap values.

Method: For the prediction of the band gap from the structural information, the following 6 parameters extracted from the cif datafram are involved:

  • Length of the edges in x, y, and z directions
  • Angle between the edge and the three cartesian axis

A couple of neural network models have been adapted to train the model.

Results:

Deep learning for regression task – MSE of 0.14 K-means clustering (with 3 clusters) – 10.8% accuracy Deep learning classification with one-hot encoding – 36.7% accuracy

We also actively applied other machine learning models such as decision tree and random forest to explore the same hypothesis.

Interpretation: Despite decent training curve (training cost goes down and converges over the epochs), the classification tasks (both supervised and unsupervised) produce very low accuracy for classification tasks. And regression task’s evaluation metric requires further assessment.

The relatively negative results suggest weak correlation between the structural information and the band gap in the current database. The hypothesis requires further refinement , since the given data set has only band gap as criteria, without controlling other factors in the substances (such as type of atoms ). Therefore, it is very likely that the database contains lots of noise.

III. Cell Parameter Prediction with Neural Network

With the pair plot of all the features in the extracted dataframe, the correlation within the cell parameters is strong. We thus revised our hypothesis based on this discovery.

Hypothesis: For the band gap of a substance to fall in the visible light spectrum, there is a specific relationship for the cell parameters to satisfy, hence confining the size of the unit cell.

Method: Neural Network

Results: R2 score = 0.60

User Interface to predict the band gap

In the right hand side, users can enter the crystal lattice constant and choose the machine learning method and then the left hand side will output the predict band gap value.

Installation

Install and activate the environment with finalProject.yml by:

conda env create -f finalProject.yml conda activate comma_env In console, execute the following command where package_path is the path to the folder containing this Readme (computational-material-project): pip install package_path It can then be imported on the installed environment as comma.

Repo structure

computational-material-project
-----
setup.py
finalProject.yml
CIF processing/
|-CIF_process.py
|-df_CIF.py
|manualtest_df_CIF.ipynb
|manualtest_df_CIF.py
computational-materials/
|-tests/
| |-NN_metrics.py
| |-Neural_Network.py
| |-test_bandgap_dt_rf.py
| |-test_cif_conversion.py
| |-test_nn.py
|-models/
| |-Neural_Network.py
| |-band_gap_prediction.ipynb
| |-bandgap_dt_rf.py
| |-comp_material.ipynb
| |-user_interface_dt_rf.ipynb
|-quality_test/
| |-NN_metrics.py
|-optimization/
| |-hypersearch_nn.py
| |-hyper_tuning_DT_RF.ipynb
examples/
|-NN_demo.ipynb
|-DTandRF_demo.ipynb
doc/
|-DIRECT_finproj.pptx
|-Use case and component specification.md
|-dataset/
| |-bg_struct.csv
|-Image/
| |-pairplot.png
| |-actual vs predicted.png
| |-optimal nn model loss.png
| |-User Interface to predict band gap.png

Examples

See examples folder for more demonstrations on predicting feature with the available tools.

Next step and lessons learned

  • Add more flexible and customizable components in the CIF processing program, so that users could select properties other than band gaps for downloading.
  • Current CIFconvert() only extracts the cell length and cell angles from the cif files. We hope to include user-input-initiated selection of parameters to extract from the cif files.
  • Based on the improvement of the above functionality, we are also interested to design an UI for this component as well.
  • The exploration with machin learning and molecule structure-property relationship indicates that the current hypothesis regarding a correlation between crystalline cell parameters and the band gap needs to be further refined, as the example dataset bg_struct.csv didn't control non-structural factors that could impact the band gap.

About

License:MIT License


Languages

Language:Jupyter Notebook 99.5%Language:Python 0.5%