MatSci-LumEn: Materials Science Large Language Models Evaluation for text and data mining

Code, data, and results described in the paper "Mining experimental data from materials science literature with large language models: an evaluation study", https://www.tandfonline.com/doi/full/10.1080/27660400.2024.2356506

@article{foppiano2024mining,
    author = {Luca Foppiano, Guillaume Lambard, Toshiyuki Amagasa and Masashi Ishii},
    title = {Mining experimental data from materials science literature with large language models: an evaluation study},
    journal = {Science and Technology of Advanced Materials: Methods},
    volume = {0},
    number = {ja},
    pages = {2356506},
    year = {2024},
    publisher = {Taylor \& Francis},
    doi = {10.1080/27660400.2024.2356506},
    URL = {https://doi.org/10.1080/27660400.2024.2356506},
    eprint = {https://doi.org/10.1080/27660400.2024.2356506}
}

Evaluation summary

Information	Task	Dataset	Link	Evaluation results	Evaluation data
Material expressions	NER	SuperMat	Github	Results	predicted, expected
Properties	NER	MeasEval	Github	Results	predicted, expected
Materials -> properties extraction	RE	SuperMat	Github	Results	predicted, expected

Fine-tuning training data stored

Getting started

Set-up environment

conda create --name lumen python=3.9
conda activate lumen

pip install -r requirements.txt

Formula matching

The algorithm requires the material-parser project.

Scripts

Scripts must be run as python modules, using the parameter -m and the package path.

Processing

Formula matching evaluation

Script: formula_matching-eval.py

Description: Evaluate the formula matching, displaying the gain F1 and the new matches as compared with the strict matching

Usage:

usage: formula_matching-eval.py [-h] --predicted PREDICTED --expected EXPECTED [--verbose] [--base-url BASE_URL]

Evaluation of the formula matching, as compared with the strict matching: how many element that are not matching with strict matching, are actually matching with formula?

optional arguments:
-h, --help            show this help message and exit
--predicted PREDICTED
Predicted dataset
--expected EXPECTED   Expected dataset
--verbose             Enable tons of prints
--base-url BASE_URL   Formula matcher base url

NER:

Script: process_openai_ner_materials.py

Description: Implementation NER with LLM on materials

Usage:

  usage: process_openai_ner_materials.py [-h] --input-text INPUT_TEXT
  [--model {chatgpt,chatgpt-ft-re,chatgpt-ft_shuffled-re,chatgpt-ft_shuffled-augmented-re,chatgpt-ft-ner-materials,chatgpt-ft-ner-quantities,gpt4,gpt4-turbo}]
  --output OUTPUT
  
  Data preparation for the materials extraction using OpenAI LLMs
  
  optional arguments:
  -h, --help            show this help message and exit
  --input-text INPUT_TEXT
  Input CSV/TSV file containing text
  --model {chatgpt,chatgpt-ft-re,chatgpt-ft_shuffled-re,chatgpt-ft_shuffled-augmented-re,chatgpt-ft-ner-materials,chatgpt-ft-ner-quantities,gpt4,gpt4-turbo}
  --output OUTPUT       Output CSV file or directory

Script: process_openai_few_shot_ner_materials.py

Description: Implementation NER with LLM on materials

Usage:

usage: process_openai_few_shot_ner_materials.py [-h] --input-text INPUT_TEXT
[--model {chatgpt,chatgpt-ft-re,chatgpt-ft_shuffled-re,chatgpt-ft_shuffled-augmented-re,chatgpt-ft-ner-materials,chatgpt-ft-ner-quantities,gpt4,gpt4-turbo}]
[--config CONFIG] --output OUTPUT

Data preparation for materials extraction using OpenAI LLMs

optional arguments:
-h, --help            show this help message and exit
--input-text INPUT_TEXT
Input CSV/TSV file containing text
--model {chatgpt,chatgpt-ft-re,chatgpt-ft_shuffled-re,chatgpt-ft_shuffled-augmented-re,chatgpt-ft-ner-materials,chatgpt-ft-ner-quantities,gpt4,gpt4-turbo}
--config CONFIG       Configuration file
--output OUTPUT       Output CSV/TSV file

Script: process_openai_ner_properties.py

Description:

Usage:

usage: process_openai_ner_properties.py [-h] --input INPUT --output OUTPUT [--config CONFIG]
                                        [--model {chatgpt,chatgpt-ft-re,chatgpt-ft_shuffled-re,chatgpt-ft_shuffled-augmented-re,chatgpt-ft-ner-materials,chatgpt-ft-ner-quantities,gpt4,gpt4-turbo}]

Data preparation for the properties extraction using OpenAI LLMs

optional arguments:
  -h, --help            show this help message and exit
  --input INPUT         Input CSV/TSV file
  --output OUTPUT       Output file, support both JSON, CSV, or TSV
  --config CONFIG       Configuration file
  --model {chatgpt,chatgpt-ft-re,chatgpt-ft_shuffled-re,chatgpt-ft_shuffled-augmented-re,chatgpt-ft-ner-materials,chatgpt-ft-ner-quantities,gpt4,gpt4-turbo}

Script: process_openai_few_shot_ner_properties.py

Description:

Usage:

  usage: process_openai_few_shot_ner_properties.py [-h] --input INPUT --output OUTPUT [--config CONFIG]
                                                   [--model {chatgpt,chatgpt-ft-re,chatgpt-ft_shuffled-re,chatgpt-ft_shuffled-augmented-re,chatgpt-ft-ner-materials,chatgpt-ft-ner-quantities,gpt4,gpt4-turbo}]
  
  Data preparation for the properties extraction using OpenAI LLMs
  
  optional arguments:
    -h, --help            show this help message and exit
    --input INPUT         Input CSV/TSV file
    --output OUTPUT       Output file, support both JSON, CSV, or TSV
    --config CONFIG       Configuration file
    --model {chatgpt,chatgpt-ft-re,chatgpt-ft_shuffled-re,chatgpt-ft_shuffled-augmented-re,chatgpt-ft-ner-materials,chatgpt-ft-ner-quantities,gpt4,gpt4-turbo}

RE:

Script: process_openai_re_supermat.py

Description:

Usage:

usage: process_openai_re_supermat.py [-h] --input INPUT
                                     [--model {chatgpt,chatgpt-ft-re,chatgpt-ft_shuffled-re,chatgpt-ft_shuffled-augmented-re,chatgpt-ft-ner-materials,chatgpt-ft-ner-quantities,gpt4,gpt4-turbo}] --output
                                     OUTPUT [--shuffle]

Extract relations using the SuperMat dataset

optional arguments:
  -h, --help            show this help message and exit
  --input INPUT         Input CSV/TSV file containing text
  --model {chatgpt,chatgpt-ft-re,chatgpt-ft_shuffled-re,chatgpt-ft_shuffled-augmented-re,chatgpt-ft-ner-materials,chatgpt-ft-ner-quantities,gpt4,gpt4-turbo}
  --output OUTPUT       Output CSV file or directory
  --shuffle             Shuffle entities before passing to the LLM

Evaluation

NER:

Script: eval_formulas.py

Description:

Usage:

usage: eval_formulas.py [-h] --predicted PREDICTED --expected EXPECTED [--verbose] [--base-url BASE_URL]

Evaluation of extracted entities for materials and properties using the novel formula matching.

optional arguments:
  -h, --help            show this help message and exit
  --predicted PREDICTED
                        Predicted dataset
  --expected EXPECTED   Expected dataset
  --verbose             Enable tons of prints
  --base-url BASE_URL   Formula matcher base url
  ```

Script: eval_ner.py

Description:

Usage:

usage: eval_ner.py [-h] --predicted PREDICTED --expected EXPECTED --entity-type {material,property} [--matching-type {all,strict,soft,sbert_cross}] [--threshold THRESHOLD] [--verbose]

Evaluation of extracted entities for materials and properties using the standard approaches.

optional arguments:
  -h, --help            show this help message and exit
  --predicted PREDICTED
                        Predicted dataset
  --expected EXPECTED   Expected dataset
  --entity-type {material,property}
                        Types of entities to evaluate
  --matching-type {all,strict,soft,sbert_cross}
                        Type of matching
  --threshold THRESHOLD
                        Matching threshold
  --verbose             Enable tons of prints

RE:

Script: eval_re_supermat.py

Description: Evaluation script for RE using the SuperMat dataset.

Usage:

usage: eval_re_supermat.py [-h] --predicted PREDICTED --expected EXPECTED [--matching-type {all,strict,soft}] [--threshold THRESHOLD] [--verbose]

Evaluation extracted data

optional arguments:
-h, --help            show this help message and exit
--predicted PREDICTED Input dataset
--expected EXPECTED   Expected dataset
--matching-type {all,strict,soft} Type of matching
--threshold THRESHOLD Matching threshold
--verbose             Enable tons of prints

lfoppiano / MatSci-LumEn

MatSci-LumEn: Materials Science Large Language Models Evaluation for text and data mining

Evaluation summary

Getting started

Set-up environment

Formula matching

Scripts

Processing

Evaluation

About

Languages