Lavine24/ArbEngVec

ArbEngVec is an open source ArbEngVec which provides several Arabic-English cross-lingual word embedding models. To train our bilingual models, we use a large dataset with more than 93 million pairs of Arabic-English parallel sentences mainly extracted from the Open Parallel Corpus Project (OPUS) (Tiedemann, 2012).

These scripts were used for training ArbEngVec variants. The used Arabic preprocessing is also provided alongside the different alignment methods. The used Arabic preprocessing is also provided alongside the different alignment methods.

To change alignment method before training, it is required to change the alignment function used while appending sentences to the trained documents list.

For further reading see full paper: https://hal.archives-ouvertes.fr/hal-02150003/file/Lachraf-el-al-WANLP.pdf

In further research usage of this script please use this citation:

@inproceedings{lachraf:hal-02150003, TITLE = {{ArbEngVec : Arabic-English Cross-Lingual Word Embedding Model}}, AUTHOR = {Lachraf, Raki; Nagoudi, El Moatez Billah; Ayachi, Youcef; Abdelali, Ahmed; Schwab, Didier}, URL = {https://hal.archives-ouvertes.fr/hal-02150003}, BOOKTITLE = {{The Fourth Arabic Natural Language Processing Workshop, co-located with ACL}}, ADDRESS = {Florence, Italy}, YEAR = {2019}, MONTH = Jul, PDF = {https://hal.archives-ouvertes.fr/hal-02150003/file/Lachraf-el-al-WANLP.pdf}, HAL_ID = {hal-02150003}, HAL_VERSION = {v1}, }

Lavine24 / ArbEngVec

About

Languages