Pseudo-anonymization of French legal cases

Capture existing domain knowledge in legal case pseudo-anonymization process from existing rule based system and enhance it through use of Machine learning algorithms.

Build Named Entity Recognition (NER) training dataset and learn a model dedicated to French legal case anonymization by leveraging existing rule based system and adding noise and synthetic data.
The projects goes above the scope covered by the rule based system which was limited to address and natural person names.
This model can be used in a pseudo-anonymization system.
Input format is the one from rule based system skill cartridges.

Measures computed over manually annotated data show strong performance, in particular on natural person and legal professionals names.

Scope

The only French legal cases massively acquired by Lefebvre Sarrut not pseudo-anonymized are those from appeal courts (Jurica database).
The input data are XML files from Jurica as generated by skill cartridges covering the period 2008-2019.

The project is focused on finding mentions of entities and guessing their types.
It doesn't manage the pseudo-anonymization step, meaning replacing entities found in precedent step by another representation.

Challenges

Many SOTA algorithms are available as open source project.
Therefore, developing a NER algorithm is not in the scope of this project.

The main focus of this work is to generate a large high quality training set being able to leverage all the knownledge put in existing rule based system.
Learning on a dataset only made with rules may only build a weak model repeating these rules.
Therefore, we have tried to include as many tricks as needed to catch / create more complex patterns.
With those, we have been able to produce a robust model, able to find much more entities than the initial rules!

Below strategies used are listed:

skill cartridges
- leveraging the extractions performed by skill cartridges embedding the many customization and domain knowledge Lefebvre Sarrut teams
Rules
- using some easy to describe patterns to catch some entities (with regex)
- find some other entities using dictionaries (e.g.: city names, etc.)
Name extension
- extending any discovered entity to the neighbor words when it makes sense
  - done carefully otherwise there is a risk of lowering the quality of the training set
Find all occurrences of caught entities
- looking for all occurrences of each entity already found in a document (2 pass process)
- building dictionaries of frequent names using all documents and look for them in each of them (2 pass process)
Dataset augmentation
- Create some variation of the discovered entities and search for them
  - By removing first or last name, changing the case of one or more word in entity, removing key words (M., Mme, la société, ...), etc.
  - transformation are randomly applied (20% of entities are transformed)
  - make the model more robust to error in the text
  - these variations can not be discovered easily with patterns
    - eg. : changing the case is an easy way to workaround the creation of patterns to catch entities written in lower case
Miscellaneous tricks
- removing from train set all paragraphs containing 0 entity
  - no entity paragraphs may be due to too simplistic patterns
- Apply some priority rules over the source of the entity offset for cases where there is a conflict of type
  - some candidate generators are more safe than others
  - a _1 is added to the end of the tag label when it is safe and it is removed during the offset normalization step
- Look for doubtful MWE candidates and declare them as doubtful
  - doubtful MWE candidates are any sequence of words starting with an upper case
  - a filter is then applied to keep only those with a first name (based on a dictionary)
  - no loss is computed on these entities, meaning they don't influence the model during training

The purpose of ML is to smooth the rules and the other tricks, making the whole system much more robust to hard to catch entities. Data augmentation in particular has proved to be very efficient.

Recognized entity types

Our rule based system only managed PERS, ADDRESS and RG types.

Persons:
- PERS: natural person (include first name unlike skill cartridges), source: skill cartridges + name extension + other occurrences
- ORGANIZATION: organization, source: skill cartridges + rules + extension + other occurrences
- PHONE_NUMBER: phone number, source: rules
- LICENCE_PLATE: licence plate numbers, source: rules
Lawyers:
- LAWYER: lawyers, source: rules + other occurrences
- BAR: bar where lawyers are registered (not done by skill cartridges), source: rules + other occurrences
Courts:
- COURT: names of French courts, source: rules + other occurrences
- JUDGE_CLERK: judges and court clerks, source: rules + other occurrences
Miscellaneous:
- ADDRESS: addresses (badly done by skill cartridges), source: rules + other occurrences + dictionary
  - there is no way to always guess if the address owner is a PERS or an ORGANIZATION, therefore this aspect is not managed
- DATE: any date, in numbers or letters, source: rules + other occurrences
- RG : ID of the legal case, source: skill cartridges + rules
- UNKNOWN : only for train set, indicates that no loss should be apply on the word, whatever the prediction is, source: rules + dictionary

To each type, dataset augmentation and miscellaneous tricks have been applied.

Model

Main NER model is from Spacy library and is best described in this video.

Basically it is a CNN + HashingTrick / Bloom filter + L2S approach over it.
The L2S part is very similar to classical dependency parser algorithm (stack + actions).

Advantages of the Spacy approach:

no manual feature extraction (done by Spacy: suffix and prefix, 3 letters each, and the word shape)
quite rapid on CPU (ease deployment)
low memory foot print (ease deployment)
off the shelf algorithm (documented, maintained, large community, etc.)

Project is fully written in Python and can't be rewritten in something else because Spacy only exists on Python.

Resources

No language related resources are used.

Few open data dictionaries are used:

a dictionary of French first names (open data)
a dictionary of postal code and cities of France (open data)

Both resources are stored on the Git repository (resources/ folder).
Both are not strategic to the success of the learning but provide a little help.

Data and model paths

Paths listed below can be modified in the config file resources/config.ini.

`XML`

Cases have to be provided as XML in the format used by skill cartridges (example provided in resources folder).
One XML file represents one week of legal cases.
XML files should be put in folder resources/training_data/.
The case used for inference has to be placed in resources/dev_data/.
Folder resources/test/ contains a XML used for unit tests.

Other resources

Resources are to be put in folder resources/courts, resources/postal_codes, resources/first_names.

Model

Folder resources/model/ will contain the Spacy model.

Commands to use the code

This project uses Python virtual environment to manage dependencies without interfering with those used by the machine.
pip3 and python3 are the only requirements.
To setup a virtual environment on the machine, install virtualenv from pip3 and install the project dependencies (from the requirements.txt file).

These steps are scripted in the Makefile (tested only on Ubuntu) and can be performed with the following command:

make setup

Variable VIRT_ENV_FOLDER can be changed in the Makefile to change where to install Python dependencies.

... then you can use the project by running one of the following actions:

train a model

make train

Find and export frequent entities (these entities are caught in all documents during the training set creation)

make export_frequent_entities

view Spacy results on a local web page (http://localhost:5000)

make show_spacy_entities

view skill cartridges results on a local web page (http://localhost:5000)

make show_rule_based_entities

view differences between Spacy and skill cartridges (only for shared entity types)

make list_differences

run unit tests

make test

Most of the project configuration is done in resources/config.ini file.

Setup Pycharm

For tests run from Pycharm, you need to create a Pytest test task.
Then the working folder by default (implicit) is the test folder.
It has to be setup as the project root folder explicitly.

License

This project is licensed under Apache 2.0 License (found in the LICENSE file in the root directory).

aledbf / anonymisation

Pseudo-anonymization of French legal cases

Scope

Challenges

Recognized entity types

Model

Advantages of the Spacy approach:

Resources

Data and model paths

`XML`

Other resources

Model

Commands to use the code

Setup Pycharm

License

About

Languages