coastalcph / danish_legal_lms

Danish Legal Language Models

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Danish Legal Language Models

Available Language Models for Danish

Model Name Layers / Units / Heads Vocab. Parameters Legal
Maltehb/danish-bert-botxo 12 / 768 / 12 32K 110M
xlm-roberta-base 12 / 768 / 12 256K 278M
coastalcph/danish-legal-lm-base 12 / 768 / 12 32K 110M
coastalcph/danish-legal-bert-base 12 / 768 / 12 32K 110M
coastalcph/danish-legal-longformer-base 12 / 768 / 12 32K 134M
coastalcph/danish-legal-xlm-base 12 / 768 / 12 32K 110M

Danish Legal Pile

This model is pre-trained on a combination of the Danish part of the MultiEURLEX (Chalkidis et al., 2021) dataset comprising 65k EU laws and two subsets (retsinformationdk, retspraksis) of the Danish Gigaword Corpus (Derczynski et al., 2021) comprising legal proceedings. It achieves the following results on the evaluation set.

Model Name Loss Accuracy
Maltehb/danish-bert-botxo 22.3 7.038
coastalcph/danish-legal-lm-base 84.8 0.651
coastalcph/danish-legal-bert-base 80.1 0.878
coastalcph/danish-legal-bert-base 82.5 0.768
coastalcph/danish-legal-xlm-base 83.1 0.727

Benchmarking

Model Name EURLEX Val. EURLEX Test
Maltehb/danish-bert-botxo 73.7 / 42.8 67.6 / 38.2
coastalcph/danish-legal-lm-base 75.1 / 46.5 69.1 / 41.9
coastalcph/danish-legal-bert-base 75.0 / 50.4 68.9 / 44.3
coastalcph/danish-legal-xlm-base TBA TBA
coastalcph/danish-legal-longformer-base 75.7 / 52.9 69.6 / 47.0
coastalcph/danish-legal-longformer-base + SD Penalty (Pezeshki et al., 2020) 76.1 / 52.9 69.9 / 47.0

The top-2 best models (coastalcph/danish-legal-longformer-base, coastalcph/danish-legal-longformer-base-sd) are available on HuggingFace Hub with instructions on how can be used as text classifier or feature extractor.

Code Base

Train new RoBERTa LM

sh train_mlm_gpu.sh

Modify pre-trained XLM-R

export PYTHONPATH=.
python src/mod_teacher_model.py --teacher_model_path coastalcph/danish-legal-lm-base --student_model_path coastalcph/danish-legal-lm-base

Longformerize pre-trained RoBERTa LM

export PYTHONPATH=.
python src/longformerize_model.py --roberta_model_path coastalcph/danish-legal-lm-base --max_length 2048 --attention_window 128

About

Danish Legal Language Models


Languages

Language:Python 98.0%Language:Shell 2.0%