Supplementary Materials for "Unsupervised word embeddings capture latent knowledge from materials science literature", Nature (2019).
Tshitoyan, V., Dagdelen, J., Weston, L., Dunn, A., Rong, Z., Kononova, O., Persson, K. A., Ceder, G. and Jain, A.
doi: 10.1038/s41586-019-1335-8
- Make sure you have
python3.6
and thepip
module installed. We recommend using conda environments. - Navigate to the root folder of this repository (the same folder that contains this README file)
and run
pip install -r requirements.txt
. Note: If you are using a conda env and any packages fail to compile during this step, you may need to first install those packages separately withconda install package_name
. - Wait for all the requirements to be downloaded and installed.
- Run
python setup.py install
to install this module. This will also download the Word2vec model files. If the download fails, manually download the model, word embeddings and output embeddings and put them in mat2vec/training/models. - Finalize your chemdataextractor installation by executing
cde data download
(You may need to restart your virtual environment for the cde command line interface to be found). - You are ready to go!
Example python usage:
from mat2vec.processing import MaterialsTextProcessor
text_processor = MaterialsTextProcessor()
text_processor.process("LiCoO2 is a battery cathode material.")
(['CoLiO2', 'is', 'a', 'battery', 'cathode', 'material', '.'], [('LiCoO2', 'CoLiO2')])
For the various methods and options see the docstrings in the code.
Load and query for similar words and phrases:
from gensim.models import Word2Vec
w2v_model = Word2Vec.load("mat2vec/training/models/pretrained_embeddings")
w2v_model.wv.most_similar("thermoelectric")
[('thermoelectrics', 0.8435688018798828), ('thermoelectric_properties', 0.8339033126831055), ('thermoelectric_power_generation', 0.7931368350982666), ('thermoelectric_figure_of_merit', 0.7916493415832 52), ('seebeck_coefficient', 0.7753845453262329), ('thermoelectric_generators', 0.7641351819038391), ('figure_of_merit_ZT', 0.7587921023368835), ('thermoelectricity', 0.7515754699707031), ('Bi2Te3', 0 .7480161190032959), ('thermoelectric_modules', 0.7434879541397095)]
Phrases can be queried with underscores:
w2v_model.wv.most_similar("band_gap", topn=5)
[('bandgap', 0.934801459312439), ('band_-_gap', 0.933477520942688), ('band_gaps', 0.8606899380683899), ('direct_band_gap', 0.8511275053024292), ('bandgaps', 0.818678617477417)]
Analogies:
# helium is to He as ___ is to Fe?
w2v_model.wv.most_similar(
positive=["helium", "Fe"],
negative=["He"], topn=1)
[('iron', 0.7700884938240051)]
Material formulae need to be normalized before analogies:
# "GaAs" is not normalized
w2v_model.wv.most_similar(
positive=["cubic", "CdSe"],
negative=["GaAs"], topn=1)
KeyError: "word 'GaAs' not in vocabulary"
from mat2vec.processing import MaterialsTextProcessor
text_processor = MaterialsTextProcessor()
w2v_model.wv.most_similar(
positive=["cubic", text_processor.normalized_formula("CdSe")],
negative=[text_processor.normalized_formula("GaAs")], topn=1)
[('hexagonal', 0.6162797212600708)]
Keep in mind that words should also be processed before queries.
Most of the time this is as simple as lowercasing, however, it is the safest
to use the process()
method of mat2vec.processing.MaterialsTextProcessor
.
To run an example training, navigate to mat2vec/training/ and run
python phrase2vec.py --corpus=data/corpus_example --model_name=model_example
from the terminal. It should run an example training and save the files in models and tmp folders. It should take a few seconds since the example corpus has only 5 abstracts.
For more options, run
python phrase2vec.py --help
You can either report an issue on github or contact one of us directly. Try vahe.tshitoyan@gmail.com, jdagdelen@berkeley.edu, lweston@lbl.gov or ajain@lbl.gov.