HOSMEL: A Hot Swappable Modulized Entity Linking Toolkit for Chinese

Usage

First,

git clone git@github.com:THUDM/HOSMEL.git

Our toolkit allows 3 different levels of usages

Ready-to-Use

If you are not interested to change our default setup, follow the steps of Mention Filtering, Mention Detection, Disambiguation By Subtitle, and Disambiguation By Relation. A Live demonstration using the same structure is also available.

Partial

We know some users might prefer to design their own entity disambiguation framework or has other needs, some level of high quality candidate entity retrieval is still required. As a result, we support partial usage. To do this, make sure you complete the setups(which is only downloading and extracting zip files), simply import the corresponding part of the toolkit, and use it in your preferred manner. For better illustration a sample usage for the complete pipeline could be found in

https://drive.google.com/drive/folders/1eh-dJnKWJulPuZGsORii4fPW-zCmWS5k?usp=sharing

See setups and Usage of current Modules for more information.

Easy-to-Change

To train your own module, we recommend to copy the NewMudule module, a template Module we created, ideally, for training you only need to make sure your training data satisfies the form of

{
    "sentence": "The input text", 
    "Label": int(k) # Label id showing targetk is the correct value, 
    "mention": "the mention of the entity", # Note: for mention detection, leave the mention empty and make the targets as your candidate mentions
    "target0": "A", # The four candidate values 
    "target1": "B",
    "target2": "C",
    "target3": "D"
}

Then reimplement the generatePair method in the apply{feature}.py file for infer.

Links to the model Checkpoints

https://drive.google.com/file/d/12w12GH5XEVGKYoaWm_sXVFHGFOSFJHnu/view?usp=sharing
https://drive.google.com/file/d/1BZphOj8rS7qHZA3wWz0vcY3H_qbCjTGK/view?usp=sharing
https://drive.google.com/file/d/1pMqN63yy9S9NZJWRV41bc-dASRndLwtr/view?usp=sharing
https://drive.google.com/file/d/1xKvPx0LY6XgVXY7wtSmUwk2iMfBm-9qw/view?usp=sharing

Setting Up

dependencies

Our method requires a few python based dependencies:

pip install flask torch tqdm pyahocorasick datasets transformers

Make sure you have all the dependencies installed to access all of our methods.

Mention Filtering

First Download TriMention.zip to the TriMention directory, then simply extract the zip packages. You should see your directory to look like,

TriMention/
├── bdi2relation.pkl
├── mention.py
├── nameTri
├── subList.json
└── web.py

The TriMention folder not only includes the basic Trie tree, it also comes with subtitle and relationship data which would be used for later sections. We separate the data-loading and processing for the consideration of better development experience since loading such data takes a large amount of time. To load the datas simply run

python mention.py

Mention Detection

Download the MD_checkpoint.zip file and extract it to the MCMention/model folder. It should look like

MCMention/
├── applyMention.py
├── model
│   ├── config.json
│   ├── pytorch_model.bin
│   ├── special_tokens_map.json
│   ├── tokenizer_config.json
│   └── vocab.txt
├── preprocessData.py
└── train.py

Disambiguation By Subtitle

Download the SD_checkpoint.zip to the model directory under MCSubtitle. Then unzip it. The final directory should look like

MCSubtitle/
├── applySubtitle.py
├── model
│   ├── config.json
│   ├── pytorch_model.bin
│   ├── special_tokens_map.json
│   ├── tokenizer.json
│   ├── tokenizer_config.json
│   └── vocab.txt
├── preprocessData.py
└── train.py

Disambiguation By Relation

Download the RD_checkpoint.zip to MCRelation/model/, and unzip to get

MCRelation/
├── applyRelation.py
├── model
│   ├── config.json
│   ├── pytorch_model.bin
│   ├── special_tokens_map.json
│   ├── tokenizer.json
│   ├── tokenizer_config.json
│   └── vocab.txt
├── preprocessData.py
└── train.py

Launching HOSMEL

To launch the complete HOSMEL, we provide a flask based backend, simply run it as

python backend.py

Data Release

We release our training data here

Our test data is also available here

Usage of current Module

It's simple to use our provided modules. After setting up, most modules have their method implemented in a apply{$Module}.py file. With a topk{$Module} method in it. This method is often formed with three parameters:

Parameter	Usage
q	The input text of the entity linking framework.
mentions/entities	The result output from the previous step.
K	The top-K result for linking will be outputted for the current Module to the next. Default we set it to 3.

The mention filtering stage is different, because it is the first step of entity linking and the candidate entity before this set could be viewed as the entire domain, thus we deployed it separately in the TriMention/mention.py file. The usage is to import parse_mentions from TriMention/web.py and call

from TriMention.web import parse_mentions
mentions = parse_mentions(text)

Where text is the input text to the toolkit.

Other modules goes after as

entities = topkMention(text,mentions,K=3)
entities = topkSubtitle(text,entities,K=3)
entities = topkRelation(text,entities,K=1)
print(entities)

Additional/New Module

Training

To train a new module, simply move the training data to corresponding folder and use

python preprocessData.py

Make sure you have the name right, for example the name for training data in the MCSubtitle filder is subtitleData.json. This should give a processedData.json file in the same directory. Then use

python train.py

The model's checkpoint should be saved in the model folder.

Usage after training new Module

Idealy, if you have selected your checkpoint and replaced the model folder with it, you don't need to change anything other than editing the generatePairs method. However, just in case, if you are interested to change model directory. In the applyNew.py folder, change

model_location = os.path.join(os.path.dirname(__file__),"model")

into

model_location = "New checkpoint location"

Will do it.

To use the new module for infer, it is required to reimplement the generatePairs method. The generate Pair method takes the input entity, aka, the output of the previous module, and retrieves a list of "mention|attribute value" pairs. A bdi_list variable, containing the same amount of items as the pairs list with the i'th item being the id of the i'th pair's entity, is required to add the scores back to the corresponding entity.

Now to test your newly implemented module, import the topkNew method and use

from TriMention.web import parse_mentions as mentionFiltering
from ... import ... as DisambiguationBy...
......
from NewModule.applyNew import topkNew as DisambiguationByNew
text = "A test text"
entities = mentionFiltering(text)
entities = DisambiguationBy...(text,entities,K=3)
......
entities = DisambiguationByNew(text,entities,K=3)
print(entities[0])

Live Demonstration

We provided a live demonstration at https://www.aminer.cn/el

Citation

If you found our project helpful please cite our paper

@inproceedings{zhangli2022hosmel,
  title={HOSMEL: A Hot-Swappable Modularized Entity Linking Toolkit for Chinese},
  author={Zhang-Li, Daniel and Zhang, Jing and Yu, Jifan and Zhang, Xiaokang and Zhang, Peng and Tang, Jie and Li, Juanzi},
  booktitle={Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: System Demonstrations},
  year={2022}
}

THUDM / HOSMEL