sophiaalthammer / patent-lim

Code for the paper "Linguistically Informed Masking for Representation Learning in the Patent Domain" https://arxiv.org/abs/2106.05768

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

This code belongs to the paper "Linguistically informed masking for representation learning in the patent domain".

It contains different folders for the different coding parts like the linguistical analysis and the domain and downstream fine-tuning of the paper. The patent domain fine-tuned checkpoints of the BERT and SciBERT model with the MLM and the LIM method are available here. The Semantic Scholar data we used and the Wikitext-raw-2 data can be found in the datafolder. There is also a sample of the patent data we used, the whole USPTO13M dataset can be extracted from Google BigQuery using this query.

Please cite our work as follows

@inproceedings{althammer2021patentlim,
      title={Linguistically Informed Masking for Representation Learning in the Patent Domain}, 
      author={Sophia Althammer and Mark Buckley and Sebastian Hofstätter and Allan Hanbury},
      year={2021},
      booktitle={Proceedings of the 2nd Workshop on Patent Text Mining and Semantic Technologies (PatentSemTech) 2021 co-located with the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2021)},
}

Structure

Requirements

There are two different python environments needed for running the script, for training the BERT model we need Python2.7, for the other implementations we need Python3. All detailed requirements can be found in requirements_python2.txt for the Python2.7 environment and in requirements_python3.txt for the Python3 environment

Python2.7 environment:

  • Tensorflow==1.15

Python3 environment:

  • Tensorflow > 0.12
  • pandas==0.25.3
  • sentencepiece==0.1.83
  • spacy==2.2.2
  • numpy==1.17.3

Execution

Preprocess Patent and Semantic Scholar text in linguistic analysis

  • Run in ling_ana with Python3 environment for patent text:

    python preprocess_patent.py file_name

    where file_name needs to be the direction and name of the csv-file containing the patent text with columns 'title', 'abstract', 'claim' and 'description'. For example: file_name = 'data/patent/part-000000000674.csv'

  • Run in ling_ana for Semantic Scholar text:

    python preprocess_semscho.py file_name

    where file_name needs to be the direction and name of the Semantic Scholar from the Semantic Scholar Reasearch Corpus file containing columns 'title', 'abstract'. For example: file_name = 'data/semanticscholar/s2-corpus-000'

Run in ling_ana with Python3 environment:

  • Explore IPC and CPC class occurences

    python ipc_cpc_class_occurences.py file_name

    where file_name needs to be the direction and name of the csv-file containing the patent text with columns 'ipc', 'cpc'. For example: file_name = 'data/patent/part-000000000674.csv'

  • Analyze sentence length of patents

    python sent_length_patent.py file_name

    where file_name needs to be the direction and name of the csv-file containing the patent text with columns 'title', 'abstract', 'claim' and 'description'. For example: file_name = 'data/patent/part-000000000674.csv'

  • Analyze noun chunks of patents (needs preprocessed patent text)

    python noun_chunks_patent.py file_name

    where file_name needs to be the direction and name of the pickle-file containing the preprocessed patent text with columns 'title', 'abstract', 'claim' and 'description'. For example: file_name = 'data/patent/part-000000000674_preprocessed_wo_claims.pkl'

  • Count words in the different segments of patents (needs preprocessed patent text)

    python count_words_patent.py file_name

    where file_name needs to be the direction and name of the pickle-file containing the preprocessed patent text with columns 'title', 'abstract', 'claim' and 'description'. For example: file_name = 'data/patent/part-000000000674_preprocessed_wo_claims.pkl'

  • Analyze hyphen expressions (needs preprocessed patent text)

    python hyphen_exp.py file_name

    where file_name needs to be the direction and name of the pickle-file containing the preprocessed patent text with columns 'title', 'abstract', 'claim' and 'description'. For example: file_name = 'data/patent/part-000000000674_preprocessed_wo_claims.pkl'

  • Analyze the Semantic Scholar text (needs preprocessed semantic scholar text)

    python ling_ana_semscho.py file_name

    where file_name needs to be the direction and name of the pickle-file containing the preprocessed Semantic Scholar text with columns 'title', 'abstract'. For example: file_name = 'data/semanticscholar/s2-corpus-000_clean.pkl'

  • Analyze the Wikipedia dataset

    python ling_ana_wiki.py location_input location_output

    where location_input needs to be the direction and name of the input Wikitext raw 2 file and location_output is the direction where the preprocessed file will be saved. For example: location_input = 'data/wikitext/wikitext-2-raw/wiki.test.raw', location_output = 'data/wikitext/wikitext-2-raw/wiki.train.wo.captions.txt'

  • Compare the noun chunk distributions of patent, semantic scholar and wikipedia data (needs preprocessed patent and semantic scholar text)

    python compare_noun_chunks.py patent_file semscho_file wiki_file

    where patent_file and semscho need to be the direction and name of the preprocessed patent or semantic scholar pickle file, and the direction of the Wikipedia file For example: patent_file = 'data/en_since_2000_a_unzip/part-000000000674_preprocessed_wo_claims.pkl', sem_file = 'data/semanticscholar/s2-corpus-000_clean.pkl', wiki_file = 'data/wikitext/wikitext-2-raw/wiki.train.raw'

  • Training a patent vocabulary

    python patent_vocab_train.py patent_file model_dir

    where patent_file needs to be the direction and name of the csv-file containing the patent text and model_dir is the direction of the model output For example: patent_file = 'data/patent/part-000000000674.csv', model_dir = 'models/sentencepiece/patents_part-000000000674_sp_preprocessed/patent_wdescr_50k_sent_30k_vocab'

  • Comparing the different vocabularies for encoding patent text

    python compare_vocab_encodings.py patent_file model_dir bert_tokenizer scibert_tokenizer

    where patent_file needs to be the direction and name of the preprocessed pickle file containing the patent text and model_dir is the direction of the sentencepiece model, bert_tokenizer is the direction of the bert_vocabulary and scibert_tokenizer is the direction of the scibert vocabulary For example: file_name = 'data/patent/part-000000000674_preprocessed.pkl', model_dir = 'models/sentencepiece/patents_part-000000000674_sp_preprocessed/patent_wdescr_5m_sent_30k_vocab' , bert_tokenizer = ,'bert-base-cased', scibert_tokenizer = 'models/scibert_scivocab_cased'

Run in format_text_pretrain in Python2.7 environment:

  • Create the format for the domain fine-tuning with BERT

    python format_pretrain_data.py file_loc file_name

    where file_loc is the directory of the files and file_name is the name of the csv.gz files containing the patent text For example: file_location = '/home/ubuntu/Documents/thesis/data/patent/', file_name = 'part-000000000674.csv.gz'

  • Create the file with the noun chunk positions

     python create_np_pretrain.py input_file output_file 

    where input_file_loc is the text file from the output of format_pretrain_data and output_file is the name of the output_file with the noun chunk positions denoted with a value > 1. For example: input_file = 'bert/data/npvector_test_text.txt', output_file = 'bert/data/npvector_test_vectors.txt'

Run in bert in Python2.7 environment:

  • Create pretraining data for training BERT with the LIM method:

    export BERT_BASE_DIR=/path/to/bert/cased_L-12_H-768_A-12
    python create_pretraining_data_lim.py   
    --input_file=./data/part-000000000674.txt 
    --input_file_np=./data/np_vectors_part-000000000674.txt  
    --output_file=./data/tfrecord_128_lim_1_part-000000000674.tfrecord 
    --lim_prob=1.0 
    --vocab_file=$BERT_BASE_DIR/vocab.txt   
    --do_lower_case=False   
    --max_seq_length=128   
    --max_predictions_per_seq=20   
    --masked_lm_prob=0.15   
    --random_seed=12345 
    --dupe_factor=5

    where lim_prob is the noun chunk masking probability and input_file_np is the input file of the noun chunking masking position txt file

For domain-finetuning with MLM and for fine-tuning on the downstream tasks, we use the same commands as in the Google repository and they can be found in the bash scripts.

Run in ipc_classifcation in Python3 environment:

  • Preprocess data for IPC classification

     python preprocess_ipc.py file_location new_file_location train_test start_index end_index 

    where file_location is the directory of the csv.gv-files with the patent text, new_file_location is the directory where the new files are stored, train_test determines if the files belong to the train or test set and needs to be either 'train' or 'test', start_index is the start number of the files and end_index is the end number of the files where the number is in the file name for example 'part-00000000674.csv.gz' For example: file_location = '/home/ubuntu/Documents/thesis/data/patent_contents_en_since_2000_application_kind_a' new_file_location = '/home/ubuntu/PycharmProjects/patent/data/ipc' train_test = 'train' start_files = 670 end_files = 674

Run in citation_precition in Python3 environment:

  • Preprocess data

     python preprocess_data.py

    preprocesses the claim data of the positive pairs and creates a file 'patent-contents-for-citations_en_claim_wclaims_all.pkl'

  • Create negative citation pairs with ramdon permutation of positive citation pairs

     python negative_samples.py df_location positive_pairs_loc

    where df_location is the location of the preprocessed positive citation pairs created above and positive_pairs_loc is the location of the csv file containing the claims For example: df_location = "data/citations/patent-contents-for-citations-wclaims/patent-contents-for-citations_en_claim_wclaims_all.pkl" positive_pairs_loc = 'data/citations/patent-contents-for-citations-wclaims/citations-only-type-x-with_claims.csv'

  • Preprocess citation pairs for fine-tuning BERT

     python preprocess_cit_data.py file_name start_index end_index

    where file_name is the name of the csv.gz file containing the patent text, start_index is the number of the patent document and end_index is the end number of the documents For example: file_name = 'data/citations/patent-contents-for-citations-wclaims/patent-contents-for-citations_en_claim_wclaims000000000000.csv.gz' start_index = 0 end_index = 10

Run in cnn_baseline with Python3 environment:

  • Preprocess the IPC classification data

     python preprocess_ipc.py input_dir output_dir start_index end_index

    where input_dir is the input directory containing the tsv files which are used for the IPC classification downstream in BERT fine-tuning output_dir is the output directory where a structure of folders is built with each sample being a text file contained in the folder of its IPC tag start_index is the start number of the files, end_index is the end number of the files For example: input_dir = /data/ipc_classification/' output_dir = '/data/cnn_baseline/ipc/' start_index = 0 end_index = 10

  • Preprocess the citation data

     python preprocess_cit.py file_name sample_number output_dir

    where file_name is the name of the pickle file containing all citation pairs sample__number is the number of samples which are taken from the file with the citation pairs output_dir is the output directory where a file with the positive pairs and a file with the negative pairs is created For example: file_name = 'data/citations/patent-contents-for-citations-wclaims/citations-only-type-x-with_claims_train_data.pkl' number_of_samples = 16000 output_dir = 'data/cnn_baseline/citation/'

  • Train the CNN

    ./train.py
  • Evaluate the CNN

    ./eval.py --eval_train --checkpoint_dir="./runs/1459637919/checkpoints/"

    Replace the checkpoint dir with the output from the training. To use your own data, change the eval.py script to load your data.

Run in ipc_citation_dependency with Python3 environment:

  • Preprocess data for training the linear classifier in preprocess_train_test.py
  • Train a linear classifier using SciKitlearn in train_classifier.py

License and copyright

Copyright (c) Siemens AG, 2020

All contributions by Siemens AG in this repository are licensed under Apache-2.0.

Third-party software

We include the following third-party software in this repository:

Bert

Copyright 2018 The Google AI Language Team Authors.

License: Apache-2.0

CNN text classification tf

License: Apache-2.0

About

Code for the paper "Linguistically Informed Masking for Representation Learning in the Patent Domain" https://arxiv.org/abs/2106.05768

License:Apache License 2.0


Languages

Language:Python 75.8%Language:Shell 15.6%Language:Jupyter Notebook 8.6%