rpo19 / pozzi_aixia_2023

NER, NEL, NIL prediction pipeline for the paper Pozzi et al. @ AIxIA 2023

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Named Entity Recognition and Linking for Entity Extraction from Italian Civil Judgements

Riccardo Pozzi, Riccardo Rubini, Christian Bernasconi, Matteo Palmonari

University of Milano-Bicocca, Milan, Italy

Setup and Run the pipeline

cp env-sample.txt .env
> ls
models
models/spacy_it_trf_wikiner			        # spacy model
models/faiss_hnsw_ita_index.pkl			    # dense retrieval index for entities 
                                            # (see BLINK paper)
models/nilp_bi_max_secondiff_model.pickle	# NIL prediction model (features:
                                            # top linking score, top score - second
models/blink_biencoder_base_wikipedia_ita	# BLINK biencoder Italian
models/pg_dump_aixia_2023.dump			    # postgres db dump
                                            # contains entities (it.wikipedia.org)
models/bert-base-italian-xxl-uncased		# BERT Italian model
biencoder
docker-compose.yml
env-sample.txt
indexer
nilpredictor
postgres
Readme.md
spacyner
  • Run (you may need sudo)
docker-compose up -d
  • Import data (you may need sudo)
# executes pg_restore inside the postgres docker container
docker-compose exec -T postgres pg_restore -U postgres -a -Fc -d postgres < models/pg_dump_aixia_2023.dump
# we suggest to run it in a virtualenv
pip install -r requirements.txt
  • Start jupyter and run the notebook
jupyter-notebook --port 9010
  • Open the link provided by jupyter

  • Try the pipeline

Example NER training

  • Prepare training data in spacy format, e.g., with train.spacy, dev.spacy, and test.spacy in the folder data.
  • Prepare the config.cfg (see the example in this repo)
  • Train spacy
spacy train config.cfg --output spacyner --gpu-id 0 --paths.train ./data/train.spacy --paths.dev ./data/valid.spacy
  • Eval model
spacy evaluate --gpu-id 0 spacyner ./data/test.spacy

About

NER, NEL, NIL prediction pipeline for the paper Pozzi et al. @ AIxIA 2023

License:MIT License


Languages

Language:Python 74.2%Language:Jupyter Notebook 23.5%Language:Dockerfile 2.3%