There are many achievements in the Natural Language field for English or even other languages, but in the case of Persian, as you can see, there are not many. Perhaps the lack of data resources, non-disclosure of sources by research groups, and decentralized research communities are the main reasons for Persian's current state.
Of course, it should be noted that some of the research groups concern about this matter and share their results, thoughts, and resources with others, but still, we need more.
NLI (known as recognizing textual entailment) resources for Persian are vital for every semantic, extraction, and inference system. I was so excited when FarsTail, as the first NLI dataset for Persian, was released. I used this dataset to train a Sentence-Transformer model (using ParsBERT) as a basis for other applications like Semantic Search, Clustering, Information Extraction, Summarization, Topic Modeling, and some others. However, the model could be achieved remarkable results on recognizing entailment (81.71%) in contrast to what they mentioned in their paper (78.13%), still not adequate for NLI applications.
I dug in the official paper of Sentence-Transformer Reimers and Gurevych, 2019. I found that it used the Wikipedia-Triplet-Sections, introduced by Dor et al., 2018, to train the SBERT for recognizing entailment task. Dor et al., 2018, presume that sentences in the same section are thematically closer than sentences in different sections. They presented the anchor () and the positive () example from the same section, while the negative example () comes from a separate section of the same article. They designated the following steps to generate this sentences-triplet dataset (In each rule, I would specify whether to use the principal or not):
- Exploit Wikipedia partitioning into sections and paragraphs, using OpenNLP for sentence extraction.
- Apply the following rules and filters to reduce noise and to create a high-quality dataset, ‘triplets-sen’:
- The maximal distance between the intra-section sentences is limited to three paragraphs. (Change this rule into two terms, inner and outer part. The sentences from the outer part () must be with a distance of at least two sections. The sentences from the inner part () must be a distance of at most two paragraphs.)
- Sentences with less than 5 or more than 50 tokens are filtered out. (Change this relation into 10 < length of word tokens < 130.)
- The first and the ”Background” sections are removed due to their general nature. (Do the same.)
- The following sections are removed: "External links", "Further reading", "References", "See also", "Notes","Citations" and "Authored books". These sections usually list a set of items rather than discuss a specific subtopic of the article’s title. (Add a few more filters: محتویات-پانویس-منابع-منابع و پانویس-جستارهای وابسته-پیوند به بیرون-یادداشتها-یادداشت ها-جوایز-نگارخانه-روابطخارجی-روابط خارجی-کتابشناسی-کتاب شناسی-فیلمشناسی-فیلم شناسی-دستاندرکاران-دستاندر کاران-دست اندر کاران-فروشهای برگزیدهٔ آلبوم-فروش های برگزیدهٔ آلبوم-فروش های برگزیده آلبوم-نمودارهای فروش-نمودار های فروش-فهرست آهنگها-فهرست آهنگ ها-اعضا-ترانهشناسی-ترانه شناسی-نگارخانه-بازیگران-پروژههای مشابه-پروژه های مشابه)
- Only articles with at least five remaining sections are considered to focus on articles with rich enough content. (Skip this rule.)
Reimers and Gurevych, 2019 use the dataset with a Triplet Objective to train the SBERT.
Eq 1: Triplet Objective Function, try to minimize the above loss function.
Tips: SBERT adds a pooling operation to the output of BERT / RoBERTa to derive a fixed-sized sentence embedding. They experimented with three pooling strategies:
- Using the production of the CLS-token.
- Computing the mean of all output vectors (Mean-Strategy).
- Computing a max-over-time of the output vectors (Max-Strategy).
In this case, I use Mean-Strategy.
In the following parts, I would show you how to do these rules step by step. Before going any further, I noticed that some of the Wikipedia articles are entirely English or other languages than Persian, like "اف_شارپ", "سی_شارپ", and some others which must be removed. So, I add a bunch of preprocessing steps into the above rules.
The preprocessing steps are as follow:
- Remove or filter some special characters which are used more by Wikipedia users or Persian users (_, «, [[, [ [, separated domains, ه ی, هٔ, أ).
- Remove user tag, hashtag, and underscore but keep the text.
- Remove emojis in every mode.
- Preprocess and normalize text at the low level using the clean-text and hazm packages.
- Fix Unicode.
- Filter emails, URLs, numbers, phone numbers, digits, currency symbols, and punctuations.
- Make text lower case.
- Clean HTML tags.
- Normalize text into Persian characters.
- Remove weird Unicode.
- Remove redundant spaces (keep the newlines).
- Remove articles which have more non-Persian characters (with a threshold of 0.7)
A Wikipedia article sample is shown in Fig 1. The red boxes were removed due to the Dor et al., 2018 rules.
Fig 1: Wikipedia Article Sample "جان میلینگتون سینگ"
The following figure (Fig 2) shows the article after passing the mutated Dor et al., 2018 rules and preprocessing steps known as the Wikipedia-Section-Paragraphs.
Fig 2: Wikipedia-Section-Paragraphs.
Then, we need to convert the section-paragraphs into section-sentences in order to have a recognizing entailment dataset. The following steps need to replace with some of the rules defined by Dor et al., 2018.
- Remove these trivial sections (محتویات-پانویس-منابع-منابع و پانویس-جستارهای وابسته-پیوند به بیرون-یادداشتها-یادداشت ها-جوایز-نگارخانه-روابطخارجی-روابط خارجی-کتابشناسی-کتاب شناسی-فیلمشناسی-فیلم شناسی-دستاندرکاران-دستاندر کاران-دست اندر کاران-فروشهای برگزیدهٔ آلبوم-فروش های برگزیدهٔ آلبوم-فروش های برگزیده آلبوم-نمودارهای فروش-نمودار های فروش-فهرست آهنگها-فهرست آهنگ ها-اعضا-ترانهشناسی-ترانه شناسی-نگارخانه-بازیگران-پروژههای مشابه-پروژه های مشابه).
- For each remaining section, split them into paragraphs. If the length of the paragraphs is greater than two, move forward.
- For each paragraph, tokenize text into sentences. If the length of the sentences is greater than two, move forward.
- For each sentence in a paragraph, tokenize text into words. If the words' length is greater than 10, pick the first sentence; otherwise, merge the following sentences until the tokenized words' size be greater than 10.
The following figure (Fig 3) presents the article after passing the mutated rules known as the Wikipedia-Section-Sentences.
Fig 3: Wikipedia-Section-Sentences.
Then, I compose a combination of sections in an article with a distance of at least two segments concerning their orders. Suppose that we have an article with four sections. The outcome of this composition shown as follow:
- Article Sections
sections = ['Section 1', 'Section 2', 'Section 3', 'Section 4']
- Composition Sections
composition = [['Section 1', 'Section 4'], ['Section 1', 'Section 3'], ['Section 2', 'Section 4']]
Each pair-sections shows the order of sentence extraction. For example, the pair ['Section 1', 'Section 4'] specifies that the anchor and positive examples must be chosen from Section 1
and the negative example from Section 4
. Also, consider that the selected anchor and positive examples from Section 1
should be chosen from paragraphs with a distance of at most two in that section, shown in Fig 4.
Figure 4: Wikipedia-Triplet-Sentences.
Examples
As far as this mutated method can understand the thematic, we could use a similar procedure to extract the D/Similar dataset, shown in Fig 5.
Figure 5: Wikipedia-D/Similar.
Examples
Sentence1 | Sentence2 | Label |
---|---|---|
در جریان انقلاب آلمان در سال های ۱۹۱۸ و ۱۹۱۹ او به برپایی تشکیلات فرایکورپس که سازمانی شبه نظامی برای سرکوب تحرکات انقلابی کمونیستی در اروپای مرکزی بود ، کمک کرد . | کاناریس بعد از جنگ در ارتش باقی ماند ، اول به عنوان عضو فرایکورپس و سپس در نیروی دریایی رایش.در ۱۹۳۱ به درجه سروانی رسیده بود . | similar |
در جریان انقلاب آلمان در سال های ۱۹۱۸ و ۱۹۱۹ او به برپایی تشکیلات فرایکورپس که سازمانی شبه نظامی برای سرکوب تحرکات انقلابی کمونیستی در اروپای مرکزی بود ، کمک کرد . | پسر سرهنگ وسل فرییتاگ لورینگوون به نام نیکی در مورد ارتباط کاناریس با بهم خوردن توطئه هیتلر برای اجرای آدمربایی و ترور پاپ پیوس دوازدهم در ایتالیا در ۱۹۷۲ در مونیخ شهادت داده است . | dissimilar |
شهر شیراز در بین سال های ۱۳۴۷ تا ۱۳۵۷ محل برگزاری جشن هنر شیراز بود . | جشنواره ای از هنر نمایشی و موسیقی بود که از سال ۱۳۴۶ تا ۱۳۵۶ در پایان تابستان هر سال در شهر شیراز و تخت جمشید برگزار می شد . | similar |
شهر شیراز در بین سال های ۱۳۴۷ تا ۱۳۵۷ محل برگزاری جشن هنر شیراز بود . | ورزشگاه پارس با ظرفیت ۵۰ هزار تن که در جنوب شیراز واقع شده است . | dissimilar |
- It is crucial to mention that the whole process was done on 21,515 articles due to the lack of computational resources. I believe that the model can achieve excellent results if it is trained on the entire Wikipedia articles.
- What do you think (Let me know, in the repository issues)?
Version 1.0.0
Version | Examples | Titles | Sections |
---|---|---|---|
1.0.0 | 205,768 | 21,515 | 34,298 |
Version | Train | Dev | Test |
---|---|---|---|
1.0.0 | 180,585 | 5,586 | 5,758 |
Version | Train | Dev | Test |
---|---|---|---|
1.0.0 | 126,628 | 5,277 | 5,497 |
The following table summarizes the scores obtained by each dataset and model.
Model | Dataset | Metrics (%) |
---|---|---|
parsbert-base-wikinli-mean-tokens | wiki-d-similar | Accuracy: 76.20 |
parsbert-base-wikinli | wiki-d-similar | F1: 77.84, Accuracy: 77.84 |
parsbert-base-wikitriplet-mean-tokens | wikitriplet | Accuracy Cosinus: 93.33, Accuracy Manhatten: 94.40, Accuracy Euclidean: 93.31 |
parsbert-base-uncased-farstail | farstail | F1: 81.65, Accuracy: 81.71 |
bert-fa-base-uncased-farstail-mean-tokens | farstail | Accuracy: 56.45 |
Application | Notebook |
---|---|
Semantic Search | |
Clustering | |
Text Summarization | |
Information Retrieval | |
Topic Modeling |
2.0.0: New Version 🆕 !
- m3hrdadfi/bert-zwnj-wnli-mean-tokens
- m3hrdadfi/distilbert-zwnj-wnli-mean-tokens
- m3hrdadfi/roberta-zwnj-wnli-mean-tokens
- m3hrdadfi/albert-zwnj-wnli-mean-tokens
1.0.0: Hello World!
- m3hrdadfi/bert-fa-base-uncased-wikinli-mean-tokens
- m3hrdadfi/bert-fa-base-uncased-wikinli
- m3hrdadfi/bert-fa-base-uncased-wikitriplet-mean-tokens
- m3hrdadfi/bert-fa-base-uncased-farstail
- m3hrdadfi/bert-fa-base-uncased-farstail-mean-tokens
Please cite this repository in publications as the following:
@misc{PersianSentenceTransformers,
author = {Mehrdad Farahani},
title = {Persian - Sentence Transformers},
month = dec,
year = 2020,
publisher = {Zenodo},
version = {v1.0.0},
doi = {10.5281/zenodo.4850057},
url = {https://doi.org/10.5281/zenodo.4850057}
}