Capture existing domain knowledge in legal case pseudo-anonymization process from existing rule based system and enhance it through use of Machine learning algorithms.
Build Named Entity Recognition
(NER
) training dataset and learn a model dedicated to French legal case anonymization by leveraging existing rule based system and adding noise and synthetic data.
The projects goes above the scope covered by the rule based system which was limited to address and natural person names.
This model can be used in a pseudo-anonymization system.
Input format is the one from rule based system skill cartridges
.
Measures computed over manually annotated data show strong performance, in particular on natural person and legal professionals names.
The only French legal cases massively acquired by Lefebvre Sarrut
not pseudo-anonymized are those from appeal courts (Jurica database).
The input data are XML files from Jurica as generated by skill cartridges
covering the period 2008-2019.
The project is focused on finding mentions of entities and guessing their types.
It doesn't manage the pseudo-anonymization step, meaning replacing entities found in precedent step by another representation.
Many SOTA
algorithms are available as open source project.
Therefore, developing a NER
algorithm is not in the scope of this project.
The main focus of this work is to generate a large high quality training set being able to leverage all the knownledge put in existing rule based system.
Learning on a dataset only made with rules may only build a weak model repeating these rules.
Therefore, we have tried to include as many tricks as needed to catch / create more complex patterns.
With those, we have been able to produce a robust model, able to find much more entities than the initial rules!
Below strategies used are listed:
skill cartridges
- leveraging the extractions performed by
skill cartridges
embedding the many customization and domain knowledge Lefebvre Sarrut teams
- leveraging the extractions performed by
- Rules
- using some easy to describe patterns to catch some entities (with
regex
) - find some other entities using dictionaries (e.g.: city names, etc.)
- using some easy to describe patterns to catch some entities (with
- Name extension
- extending any discovered entity to the neighbor words when it makes sense
- done carefully otherwise there is a risk of lowering the quality of the training set
- extending any discovered entity to the neighbor words when it makes sense
- Find all occurrences of caught entities
- looking for all occurrences of each entity already found in a document (2 pass process)
- building dictionaries of frequent names using all documents and look for them in each of them (2 pass process)
- Dataset augmentation
- Create some variation of the discovered entities and search for them
- By removing first or last name, changing the case of one or more word in entity, removing key words (M., Mme, la société, ...), etc.
- transformation are randomly applied (20% of entities are transformed)
- make the model more robust to error in the text
- these variations can not be discovered easily with patterns
- eg. : changing the case is an easy way to workaround the creation of patterns to catch entities written in lower case
- Create some variation of the discovered entities and search for them
- Miscellaneous tricks
- removing from train set all paragraphs containing 0 entity
- no entity paragraphs may be due to too simplistic patterns
- Apply some priority rules over the source of the entity offset for cases where there is a conflict of type
- some candidate generators are more safe than others
- a
_1
is added to the end of the tag label when it is safe and it is removed during the offset normalization step
- Look for doubtful MWE candidates and declare them as doubtful
- doubtful MWE candidates are any sequence of words starting with an upper case
- a filter is then applied to keep only those with a first name (based on a dictionary)
- no loss is computed on these entities, meaning they don't influence the model during training
- removing from train set all paragraphs containing 0 entity
The purpose of ML is to smooth the rules and the other tricks, making the whole system much more robust to hard to catch entities. Data augmentation in particular has proved to be very efficient.
Our rule based system only managed PERS
, ADDRESS
and RG
types.
- Persons:
PERS
: natural person (include first name unlikeskill cartridges
), source:skill cartridges
+ name extension + other occurrencesORGANIZATION
: organization, source:skill cartridges
+ rules + extension + other occurrencesPHONE_NUMBER
: phone number, source: rulesLICENCE_PLATE
: licence plate numbers, source: rules
- Lawyers:
LAWYER
: lawyers, source: rules + other occurrencesBAR
: bar where lawyers are registered (not done byskill cartridges
), source: rules + other occurrences
- Courts:
COURT
: names of French courts, source: rules + other occurrencesJUDGE_CLERK
: judges and court clerks, source: rules + other occurrences
- Miscellaneous:
ADDRESS
: addresses (badly done byskill cartridges
), source: rules + other occurrences + dictionary- there is no way to always guess if the address owner is a
PERS
or anORGANIZATION
, therefore this aspect is not managed
- there is no way to always guess if the address owner is a
DATE
: any date, in numbers or letters, source: rules + other occurrencesRG
: ID of the legal case, source:skill cartridges
+ rulesUNKNOWN
: only for train set, indicates that no loss should be apply on the word, whatever the prediction is, source: rules + dictionary
To each type, dataset augmentation and miscellaneous tricks have been applied.
Main NER model is from Spacy library and is best described in this video.
Basically it is a CNN + HashingTrick / Bloom filter + L2S approach over it.
The L2S part is very similar to classical dependency parser algorithm (stack + actions).
Advantages of the Spacy approach:
- no manual feature extraction (done by Spacy: suffix and prefix, 3 letters each, and the word shape)
- quite rapid on
CPU
(ease deployment) - low memory foot print (ease deployment)
- off the shelf algorithm (documented, maintained, large community, etc.)
Project is fully written in Python
and can't be rewritten in something else because Spacy only exists on Python
.
No language related resources are used.
Few open data dictionaries are used:
- a dictionary of French first names (open data)
- a dictionary of postal code and cities of France (open data)
Both resources are stored on the Git repository (
resources/
folder).
Both are not strategic to the success of the learning but provide a little help.
Paths listed below can be modified in the config file resources/config.ini
.
- Cases have to be provided as XML in the format used by
skill cartridges
(example provided inresources
folder). - One XML file represents one week of legal cases.
XML
files should be put in folderresources/training_data/
.- The case used for inference has to be placed in
resources/dev_data/
. - Folder
resources/test/
contains aXML
used for unit tests.
- Resources are to be put in folder
resources/courts
,resources/postal_codes
,resources/first_names
.
- Folder
resources/model/
will contain the Spacy model.
This project uses Python virtual environment to manage dependencies without interfering with those used by the machine.
pip3
and python3
are the only requirements.
To setup a virtual environment on the machine, install virtualenv
from pip3
and install the project dependencies (from the requirements.txt
file).
These steps are scripted in the Makefile
(tested only on Ubuntu
) and can be performed with the following command:
make setup
Variable
VIRT_ENV_FOLDER
can be changed in theMakefile
to change where to installPython
dependencies.
... then you can use the project by running one of the following actions:
- train a model
make train
- Find and export frequent entities (these entities are caught in all documents during the training set creation)
make export_frequent_entities
- view Spacy results on a local web page (
http://localhost:5000
)
make show_spacy_entities
- view
skill cartridges
results on a local web page (http://localhost:5000
)
make show_rule_based_entities
- view differences between Spacy and
skill cartridges
(only for shared entity types)
make list_differences
- run unit tests
make test
Most of the project configuration is done in
resources/config.ini
file.
For tests run from Pycharm, you need to create a Pytest test task.
Then the working folder by default (implicit) is the test folder.
It has to be setup as the project root folder explicitly.
This project is licensed under Apache 2.0 License (found in the LICENSE file in the root directory).