Hebrew NLP Resources

This repository collects resources for NLP in Hebrew, as part of the NLPH project, which you can read more about here. Resources are divided to folders by type. If you have a resource you can contribute, to be released under some open license, please submit a pull request, or contact us at contact@nlph.org.il. See here for a list of companies operating in the field.

This specific document is meant to be a list of Hebrew NLP resources, both for general use and to be used as reference when discussing what existing tools can be opened, adapted or integrated to help create a good open source foundation for NLP in Hebrew, as part of the NLPH Project.

When contributing to the list, please add a link to the license for all non-paper resources, e.g. {AGPL-3.0}, {?} for an unkonwn licesnse or {X} for unreleased/closed/copyrighted resources. For code resource, please also add the main language in which the tool is written, e.g. [Python] or [?] for an unknown programming language. Please add hosting mirrors with pointy brackets, e.g. <Zenodo mirror>.

Contents

1 Corpora
- 1.1 Structured Corpora
- 1.2 Corpora Sources
2 Linguistic Resources
- 2.1 Lexicons
- 2.2 Dictionaries & Word Lists
- 2.3 Treebanks
- 2.4 Embeddings
- 2.5 Other
3 Code
- 3.1 Tokenization
- 3.2 Morphological and Syntactic Analysis
- 3.3 Tagging Tools
- 3.4 Models
- 3.5 Other
- 3.6 Commercial services
4 Labs & Researchers
- 4.1 Academia
- 4.2 Non-Profit
- 4.3 Industry
5 Papers
6 Courses, presnetations and meetups

1 Corpora

1.1 Structured Corpora

The MILA corpora collection {GPLv3} - The MILA center has 20 different corpora available for free for non-commercial use. All are available in plain text format, and most have tokenized, morphologically-analyzed, and morphologically-disambiguated versions available too.
Hebrew Wikipedia dumps {CC-BY-SA 3.0} - Wikipedia, the free encyclopedia, publishes dumps of its content as XML files on a monthly basis.
שתי שקל {?} - Wikiproject for correcting grammar mistakes. (Heuristic) positive annotions can be derived from query.
Hebrew Speech Databases (HSD) {?} - The HSD contains several Hebrew Speech Databases designed for the development and evaluation of Hebrew Speech Recognition Systems.
CoSIH - The Corpus of Spoken Hebrew {?} - The Corpus of Spoken Israeli Hebrew (CoSIH) is a database of recordings of spoken Israeli Hebrew
hebrew corpus {?} - HebrewCorpus is a new corpus with 150 million words from NMELRC.
The Haifa Corpus of Spoken Hebrew {X} - A corpus of recorded spoken Hebrew and transcripts. Protected under rights reserved to Prof. Yael Maschler.
Eran Tomer's Digital Vocalized Text Corpus {Apache License 2.0} - A corpus of digital vocalized Hebrew texts created by Eran Tomer as part of his Master thesis. The corpus is found in the resources folder.
The SVLM Hebrew Wikipedia Courpus {CC-BY-SA 3.0} - A corpus of 50K sentences from Hebrew Wikipedia chosen to ensure phoneme coverage for the purpose of a sentence recording project.
Knesset 2004-2005 {Public Domain} - A corpus of transcriptions of Knesset (Israeli parliament) meetings between January 2004 and November 2005. Includes tokenized and morphologically tagged versions of most of the documents in the corpus. <MILA> <Zenodo>

1.2 Corpora Sources

JPress {Custom Terms of Use} - The National Library offers a collection of Jewish newspapers published in various countries, languages, and time periods, including digital versions and full-text search. The texts are published under a custom Terms of Use document that prohibits commercial use, and additionally requires checking the copyright status and receiving permission from the copyright-holder of the work for any use requiring such permission according to the Copyright Law.
DICTA {?} - Analytical tools for Jewish texts. They also have a GitHub organization.
Sefaria {Various} - A Living Library of Jewish Texts. 3,000 years of Jewish texts in Hebrew and English translation.
HaArchion {?} - Recording of various Hebrew prose and poetry being read.
Project Ben Yehuda public dumps {Public Domain} - A repository containing dumps of thousands of public domain works in Hebrew, from Project Ben-Yehuda, in plaintext UTF-8 files, with and without diacritics (nikkud), and in HTML files.
ThinkIL {CC-BY-SA 3.0} - An archive of the writings of Zvi Yanai.
"Ha'Olam Ha'Ze" Newspaper Archive {?} - An online archive of issues of "Ha'Olam Ha'Ze" ("This World") Israeli newspaper.

2 Linguistic Resources

2.1 Lexicons

The BGU morphological lexicon {?} - Is it released?
The morphological lexicon of the Israeli National Institute for Testing and Evaluation - Unreleased.
The MILA lexicon of Hebrew words {GPLv3} - The lexicon was designed mainly for usage by morphological analyzers, but is being constantly extended to facilitate other applications as well. The lexicon contains about 25,000 lexicon items and is extended regularly. Free for non-commercial use.
Hebrew WordNet {GPLv3} - Hebrew WordNet uses the MultiWordNet methodology and is aligned with the one developed at IRST (and therefore is aligned with English, Italian and Spanish). Free for non-commercial use.
MILA's Verb Complements Lexicon {GPLv3} - NLPH backup here.

2.2 Dictionaries & Word Lists

MILA's Hebrew Stopwords List - An Excel XLSX file containing 23,327 Hebrew tokens in descending order of frequency. [NLPH backup]
Uniform {?} - An inflection dictionary. Missing details: Creating organization? Is it released?
Hebrew verb lists {CC-BY 4.0} - Created by Eran Tomer (erantom@gmail.com).
Hebrew name lists {CC-BY 4.0} - Lists of street, company, given and last names. Created by Guy Laybovitz.
1000 most frequent words in Hebrew tweets during (roughly) 2018.
KIMA - the Historical Hebrew Gazetteer - Place Names in the Hebrew Script. An open, attestation based, historical database. Kima currently holds 27,239 Places, with 94,650 alternate variants of their names and 236,744 attestations of these variants.

2.3 Treebanks

The Hebrew Treebank {GPLv3} - The Hebrew Treebank Version 2.0 contains 6500 hand-annotated sentences of news items from the MILA HaAretz Corpus, with full word segmentation and morpho-syntactic analysis. Morphological features that are not directly relevant for syntactic structures, like roots, templates and patterns, are not analyzed. This resource can be used freely for research purposes only.
UD Hebrew Treebank {CC BY-NC-SA 4.0} - The Hebrew Universal Dependencies Treebank.
Modern Hebrew Dependency Treebank v.1 {GPLv3} - This is the Modern Hebrew Dependency Treebank which was created and used in Yoav Goldberg's PhD thesis.

2.4 Embeddings

fastText pre-trained word vectors for Hebrew {CC-BY-SA 3.0} - Trained on Wikipedia using fastText. Comes in both the binary and text default formats of fastText: binary+text, text. In the text format, each line contains a word followed by its embedding; Each value is space separated; Words are ordered by their frequency in a descending order.
hebrew-word2vec pre-trained word vectors {Apache License 2.0} - Trained on data from Twitter. Developed by Ron Shemesh in Bar-Ilan University's NLP lab under the instruction of Dr. Yoav Goldberg. Contains vectors for over 1.4M words (as of January 2018). Comes in a zip with two files: a text file with a word list and a NumPy array file (npy file).
LASER Language-Agnostic SEntence Representations {CC BY-NC 4.0} - LASER is a library to calculate and use multilingual sentence embeddings.
NLPL word embeddings - Trained on the Hebrew CoNLL17 corpus using Word2Vec continuous skipgram, with a vecotor dimension of 100 and a window size of 10. The vocabulary includes 672,384 words.
Hebrew word embeddings by Dr. Oren Glickman {?} - Trained on Twitter. Unreleased. Presented in his lecture in yearly conference of The Israel Statistical Association for 2018 (presentation file).

2.5 Other

Hebrew SimLex-999 - A Hebrew version of the Simlex-999 resource for the evaluation of models that learn the meaning of words and concepts. A copy can also be found in the Attract-Repel repository. Another copy is found in this repository.

3 Code

Also see here: https://github.com/iddoberger/awesome-hebrew-nlp

3.1 Tokenization

Yoav Goldberg's Hebrew Tokenizer
Jonathan Laserson's sentence separator [Python] {?}- Not a tokenizer per-se, but an important part in the tokenization of documents. Jonathan is kindly checking the possibility of open sourcing this tool.
The MILA Hebrew Tokenization Tool [?] {GPLv3} - Free for non-commercial use.

3.2 Morphological and Syntactic Analysis

Morphological and Syntactic Analysis of Hebrew Texts by ONLP
yap morpho-syntactic parser [Go] {Apache License 2.0} - Morphological Analysis, disambiguation and dependency Parser. Morphological Analyzer relies on the BGU Lexicon.
Yoav Goldberg's syntactic parsers [Python, Java] {GPLv3} - Two syntactic parsers for Hebrew, one is dependency-based and the other is constituency-based. Distributed under the GPLv3 license, free for academic use only.
- Yoav Goldberg's Hebrew Dependency Parsing [Python, Java] {GPLv3}
- Yoav Goldberg's Hebrew Constituency Parsing [Python, Java] {GPLv3}
The MILA Morphological Analysis Tool [?] {GPLv3} - Takes as input undotted Hebrew text (formatted either as plain text or as tokenized XML following MILA's standards). The Analyzer then returns, for each token, all the possible morphological analyses of the token, reflecting part of speech, transliteration, gender, number, definiteness, and possessive suffix. Free for non-commercial use.
The MILA Morphological Disambiguation Tool [?] {GPLv3} - Takes as input morphologically-analyzed text and uses a Hidden Markov Model (HMM) to assign scores for each analysis, considering contextual information from the rest of the sentence. For a given token, all analyses deemed impossible are given scores of 0; all n analyses deemed possible are given positive scores. Free for non-commercial use.
Hspell [?] {AGPL-3.0} - Free Hebrew linguistic project including spell checker and morphological analyzer.
- HspellPy [Python] {AGPL-3.0} - Python wrapper for hspell.
BGU Tagger: Morphological Tagging of Hebrew [Java] {?} - Morphological Analysis, Disambiguation.

3.3 Tagging Tools

LightTag [?] {not open source} - A tool for managing annotation projects. Handles right-to-left and part-of-word marking. Tutorial video here.
Recogito [Scala, JavaScript, HTML] {Apache License 2.0} - A tool for linked data annotation.
CATMA [HTML, Java] {unclear} - A web-based tool for research and collaboration over text data. Handles right-to-left and part-of-word marking.
- See the system itself here: http://portal.catma.de/catma/
- And the code here: https://github.com/mpetris/catma
WebAnno [Java] {Apache License 2.0} - Web-based. Support RTL and project management.
- Repository: https://github.com/webanno/webanno
Arethusa: Annotation Environment [JavaScript] {MIT} - A backend-independent client-side annotation framework. Repository here.
rasa-nlu-trainer [JavaScript] {MIT} - A tool to edit training examples for rasa NLU. Handles right-to-left and part-of-word marking.
brat [Python, JavaScript] {MIT} - An online environment for collaborative text annotation. Does not support right-to-left. Repository here.
openNLP [Java] {Apache License 2.0} - OpenNLP has a tagging tool.
opeNER [Ruby, HTML, Java, Python] - opeNER has a tagging tool.
pybossa [Python] {AGPL-3.0} - A framework for crowdsourcing of data analysis and enrichment tasks. GitHub.
TextThrasher [JavaScript, Python] - A crowdsourced text annotator. Built with React and Redux (possibly also with pybossa).
SHEBANQ - System for HEBrew Text: ANnotations for Queries and Markup. SHEBANQ is an online environment for studying the Hebrew Bible.

3.4 Models

Neural Sentiment Analyzer for Modern Hebrew [?] {MIT} - This code and dataset provide an established benchmark for neural sentiment analysis for Modern Hebrew.
hebrew-word2vec [C, Python] {Apache License 2.0} - Developed by Ron Shemesh in Bar-Ilan University's NLP lab under the instruction of Dr. Yoav Goldberg. Contains pre-trained vectors and an online demo.
Universal Language Model Fine-tuning for Text Classification (ULMFiT) in Hebrew - The weights (e.g. a trained model) for a Hebrew version for Howard's and Ruder's ULMFiT model. Trained on the Hebrew Wikipedia corpus.
BERT's multilingual model - Trained (also) on Hebrew.

3.5 Other

Verb Inflector [Java] {Apache License 2.0} - A generation mechanism, created as part of Eran Tomer's (erantom@gmail.com) Master thesis, which produces vocalized and morphologically tagged Hebrew verbs given a non-vocalized verb in base-form and an indication of which pattern the verb follows.
HebMorph [Lucene] {AGPL-3.0} - An open-source effort to make Hebrew properly searchable by various IR software libraries. Includes Hebrew Analyzer for Lucene.
Hebrew OCR with Nikud [Python] {?} - A program to convert Hebrew text files (without Nikud) to text files with the correct Nikud. Developed by Adi Oz and Vered Shani.
Text-Fabric [Python] {CC BY-NC 4.0} - A Python package for browsing and processing ancient corpora, focused on the Hebrew Bible Database.
Nakdan - Automatic Nikud for Hebrew texts.
The Automatic Hebrew Transriber - Automatically transcribes text from Hebrew audio and video files.
word2word {Apache License 2.0} - Easy-to-use word-to-word translations for 3,564 language pairs. Hebrew is one of the 62 supported language, and thus word-to-word translation to/from Hebrew is supported for 61 languages.

3.6 Commercial services

Eyfo - A commercial engine for search and entity tagging in Hebrew.
Melingo's ICA (Intelligent Content Analysis) - A text analysis and textual categorized entity extraction API for Hebrew, Arabic and Farsi texts.
Genius - Automatic analysis of free text in Hebrew.
AlmaReader - Online text-to-speech service for Hebrew.

4 Labs & Researchers

This list is meant to cover both researchers in the field of natural language processing, and in various related fields, including neurolinguistics and speech science. It also aims to cover researchers in both academia and industry.

4.1 Academia

The Open University of Israel:
- The ONLP Lab [Twitter]:
  - Dr. Reut Tsarfaty - Head of the ONLP Lab.
  - Dan Bareket - Research assistant.
- The Open Media and Information Lab (OMILab) at the Open University of Israel - An interdisciplinary center for research and for teaching in new media and related areas, such as big data, information science, network cultures and digital sociology.
  - Dr. Vered Silber-Varod - Director of the Open Media and Information Lab (OMILab). Research interests and publications focus on various aspects of speech sciences, with expertise in speech prosody, acoustic phonetics, and speech communication and text analytics.
- Dr. Anat Lerner, Senior Lecturer - Interested in speech prosody analyses, combinatorial auctions and computer Networks (especially Ad-Hoc networks, mobile and cellular networks).
Bar Ilan University:
Ben-Gurion University:
- Natural Language Processing Lab at Ben Gurion University
- Dr. Oren Tzur
University of Haifa:
- Prof. Shuly Wintner
- Dr. Einat Minkov - My main interests are in Information Extraction and Semantics, as well as in other Natural Language Processing applications. I am also interested in Machine Learning - and the application of learning to NLP problems.
Tel Aviv University:
- Dr. Jonathan Berant
The Technion:
- Prof. Alon Itai (retired)
- Dr. Roi Reichart - An Assistant Professor at the faculty of Industrial Engineering and Management of the Technion. Working on Natural Language Processing (NLP). Interested in language learning in its context and design models that integrate domain and world knowledge with data-driven methods.
The Hebrew University of Jerusalem:
- Prof. Ronen Feldman - Feldman's main areas of research are natural language processing, entity extraction and text relations, text sentiment analysis, and language processing for algorithmic trading. He is one of the founder of the discipline of text mining.
- Prof. Ari Rappoport - With his main contribution in the area of Neuroscience, where he developed a comprehensive theory of the brain, Prof. Rappoport's Computer Science area of interest is language (Computational Linguistics, Natural Language Processing (NLP)), from cognitive science and machine learning perspectives.
- Dr. Omri Abend - My fields of interest are Computational Linguistics and Natural Language Processing. Specifically, I conduct research on semantic (meaning) representation from a computational perspective. My research is tightly linked to statistical learning, language technology (such as Machine Translation and Information Extraction), and computational modeling of child language acquisition.
- Dr. Dafna Shahaf - Dr. Shahaf's research focuses on helping people make sense of the world. She designs algorithms that help people understand the underlying structure of complex topics, and connect the dots between different pieces. She also likes to formalize intuitive notions; see recent work on Computational Humor.
- The Neurolinguistics Laboratory at the Edmond and Lily Safra Center for Brain Sciences (ELSC):
  - Prof. Yosef Grodzinsky - Research fields: functional anatomy of language, linguistic theory (syntax, semantics), language acquisition, aphasia, individual variation.

4.2 Non-Profit

Allen Institute for AI - Israel
- Prof. Yoav Goldberg
- Dr. Jonathan Berant

Hebrew Dependency Parsing: Initial Results, IWPT-2009 (Short Paper), Yoav Goldberg and Michael Elhadad.
Itai, A., S. Wintner, and S. Yona: 2006, ‘A Computational Lexicon of Contemporary Hebrew’. In: Proceedings of The fifth international conference on Language Resources and Evaluation (LREC-2006). Genoa, Italy.
Alon Itai and Shuly Wintner. "Language Resources for Hebrew." Language Resources and Evaluation 42(1):75-98, March 2008.
Noam Ordan and Shuly Wintner. "Hebrew WordNet: A Test Case of Aligning Lexical Databases Across Languages." International Journal of Translation 19(1):39-58, 2007.
Noam Ordan and Shuly Wintner. "Representing Natural Gender in Multilingual Lexical Databases." International Journal of Lexicography 18(3):357-370, September 2005.
Khalil Sima'an, Alon Itai, Yoad Winter, Alon Altman and Noa Nativ. "Building a Tree-Bank of Modern Hebrew Text." Traitment Automatique des Langues, 42, 347-380. 2001.

5.2 Morphological Analysis & Disambiguation

Shlomo Yona and Shuly Wintner. "A Finite-State Morphological Grammar of Hebrew." Natural Language Engineering 14(2):173-190, April 2008. Language Resources and Evaluation 42(1):75-98, March 2008.
Meni Adler. Hebrew Morphological Disambiguation: An Unsupervised Stochastic Word-based Approach. Ph.D. Thesis, Ben-Gurion University of the Negev, 2007.
Roy Bar-Haim, Khalil Sima'an and Yoad Winter. Part-of-Speech Tagging of Modern Hebrew Text. Natural Language Engineering 14 (2):223-251. Copyright Cambridge University Press, 2008.
Amir More and Reut Tsarfaty. Data-Driven Morphological Analysis and Disambiguation for Morphologically Rich Languages and Universal Dependencies. Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers. December 2016.

5.3 Word Embeddings

Oded Avraham and Yoav Goldberg. The Interplay of Semantics and Morphology in Word Embeddings. Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2017).

5.4 Methodology

Named Entities Tagging Guidelines for Hebrew {Apache License 2.0} - Written during M.Sc. research by Naama Ben-Mordecai advised by Dr. Michael Elhadad at the Department of Computer Science, Ben-Gurion University.

5.5 Other

Eran Tomer. Automatic Hebrew Text Vocalization. Thesis submitted as part of the requirements for the M.Sc. degree of Ben-Gurion University of the Negev, 2012.

yonatanbitton / NLPH_Resources