yonatanbitton / NLPH_Resources

A comprehensive list of Hebrew NLP resources.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Hebrew NLP Resources

This repository collects resources for NLP in Hebrew, as part of the NLPH project, which you can read more about here. Resources are divided to folders by type. If you have a resource you can contribute, to be released under some open license, please submit a pull request, or contact us at contact@nlph.org.il. See here for a list of companies operating in the field.

This specific document is meant to be a list of Hebrew NLP resources, both for general use and to be used as reference when discussing what existing tools can be opened, adapted or integrated to help create a good open source foundation for NLP in Hebrew, as part of the NLPH Project.

When contributing to the list, please add a link to the license for all non-paper resources, e.g. {AGPL-3.0}, {?} for an unkonwn licesnse or {X} for unreleased/closed/copyrighted resources. For code resource, please also add the main language in which the tool is written, e.g. [Python] or [?] for an unknown programming language. Please add hosting mirrors with pointy brackets, e.g. <Zenodo mirror>.

  • The MILA corpora collection {GPLv3} - The MILA center has 20 different corpora available for free for non-commercial use. All are available in plain text format, and most have tokenized, morphologically-analyzed, and morphologically-disambiguated versions available too.
  • Hebrew Wikipedia dumps {CC-BY-SA 3.0} - Wikipedia, the free encyclopedia, publishes dumps of its content as XML files on a monthly basis.
  • שתי שקל {?} - Wikiproject for correcting grammar mistakes. (Heuristic) positive annotions can be derived from query.
  • Hebrew Speech Databases (HSD) {?} - The HSD contains several Hebrew Speech Databases designed for the development and evaluation of Hebrew Speech Recognition Systems.
  • CoSIH - The Corpus of Spoken Hebrew {?} - The Corpus of Spoken Israeli Hebrew (CoSIH) is a database of recordings of spoken Israeli Hebrew
  • hebrew corpus {?} - HebrewCorpus is a new corpus with 150 million words from NMELRC.
  • The Haifa Corpus of Spoken Hebrew {X} - A corpus of recorded spoken Hebrew and transcripts. Protected under rights reserved to Prof. Yael Maschler.
  • Eran Tomer's Digital Vocalized Text Corpus {Apache License 2.0} - A corpus of digital vocalized Hebrew texts created by Eran Tomer as part of his Master thesis. The corpus is found in the resources folder.
  • The SVLM Hebrew Wikipedia Courpus {CC-BY-SA 3.0} - A corpus of 50K sentences from Hebrew Wikipedia chosen to ensure phoneme coverage for the purpose of a sentence recording project.
  • Knesset 2004-2005 {Public Domain} - A corpus of transcriptions of Knesset (Israeli parliament) meetings between January 2004 and November 2005. Includes tokenized and morphologically tagged versions of most of the documents in the corpus. <MILA> <Zenodo>
  • JPress {Custom Terms of Use} - The National Library offers a collection of Jewish newspapers published in various countries, languages, and time periods, including digital versions and full-text search. The texts are published under a custom Terms of Use document that prohibits commercial use, and additionally requires checking the copyright status and receiving permission from the copyright-holder of the work for any use requiring such permission according to the Copyright Law.
  • DICTA {?} - Analytical tools for Jewish texts. They also have a GitHub organization.
  • Sefaria {Various} - A Living Library of Jewish Texts. 3,000 years of Jewish texts in Hebrew and English translation.
  • HaArchion {?} - Recording of various Hebrew prose and poetry being read.
  • Project Ben Yehuda public dumps {Public Domain} - A repository containing dumps of thousands of public domain works in Hebrew, from Project Ben-Yehuda, in plaintext UTF-8 files, with and without diacritics (nikkud), and in HTML files.
  • ThinkIL {CC-BY-SA 3.0} - An archive of the writings of Zvi Yanai.
  • "Ha'Olam Ha'Ze" Newspaper Archive {?} - An online archive of issues of "Ha'Olam Ha'Ze" ("This World") Israeli newspaper.
  • The BGU morphological lexicon {?} - Is it released?
  • The morphological lexicon of the Israeli National Institute for Testing and Evaluation - Unreleased.
  • The MILA lexicon of Hebrew words {GPLv3} - The lexicon was designed mainly for usage by morphological analyzers, but is being constantly extended to facilitate other applications as well. The lexicon contains about 25,000 lexicon items and is extended regularly. Free for non-commercial use.
  • Hebrew WordNet {GPLv3} - Hebrew WordNet uses the MultiWordNet methodology and is aligned with the one developed at IRST (and therefore is aligned with English, Italian and Spanish). Free for non-commercial use.
  • MILA's Verb Complements Lexicon {GPLv3} - NLPH backup here.
  • The Hebrew Treebank {GPLv3} - The Hebrew Treebank Version 2.0 contains 6500 hand-annotated sentences of news items from the MILA HaAretz Corpus, with full word segmentation and morpho-syntactic analysis. Morphological features that are not directly relevant for syntactic structures, like roots, templates and patterns, are not analyzed. This resource can be used freely for research purposes only.
  • UD Hebrew Treebank {CC BY-NC-SA 4.0} - The Hebrew Universal Dependencies Treebank.
  • Modern Hebrew Dependency Treebank v.1 {GPLv3} - This is the Modern Hebrew Dependency Treebank which was created and used in Yoav Goldberg's PhD thesis.
  • fastText pre-trained word vectors for Hebrew {CC-BY-SA 3.0} - Trained on Wikipedia using fastText. Comes in both the binary and text default formats of fastText: binary+text, text. In the text format, each line contains a word followed by its embedding; Each value is space separated; Words are ordered by their frequency in a descending order.
  • hebrew-word2vec pre-trained word vectors {Apache License 2.0} - Trained on data from Twitter. Developed by Ron Shemesh in Bar-Ilan University's NLP lab under the instruction of Dr. Yoav Goldberg. Contains vectors for over 1.4M words (as of January 2018). Comes in a zip with two files: a text file with a word list and a NumPy array file (npy file).
  • LASER Language-Agnostic SEntence Representations {CC BY-NC 4.0} - LASER is a library to calculate and use multilingual sentence embeddings.
  • NLPL word embeddings - Trained on the Hebrew CoNLL17 corpus using Word2Vec continuous skipgram, with a vecotor dimension of 100 and a window size of 10. The vocabulary includes 672,384 words.
  • Hebrew word embeddings by Dr. Oren Glickman {?} - Trained on Twitter. Unreleased. Presented in his lecture in yearly conference of The Israel Statistical Association for 2018 (presentation file).

Also see here: https://github.com/iddoberger/awesome-hebrew-nlp

  • Verb Inflector [Java] {Apache License 2.0} - A generation mechanism, created as part of Eran Tomer's (erantom@gmail.com) Master thesis, which produces vocalized and morphologically tagged Hebrew verbs given a non-vocalized verb in base-form and an indication of which pattern the verb follows.
  • HebMorph [Lucene] {AGPL-3.0} - An open-source effort to make Hebrew properly searchable by various IR software libraries. Includes Hebrew Analyzer for Lucene.
  • Hebrew OCR with Nikud [Python] {?} - A program to convert Hebrew text files (without Nikud) to text files with the correct Nikud. Developed by Adi Oz and Vered Shani.
  • Text-Fabric [Python] {CC BY-NC 4.0} - A Python package for browsing and processing ancient corpora, focused on the Hebrew Bible Database.
  • Nakdan - Automatic Nikud for Hebrew texts.
  • The Automatic Hebrew Transriber - Automatically transcribes text from Hebrew audio and video files.
  • word2word {Apache License 2.0} - Easy-to-use word-to-word translations for 3,564 language pairs. Hebrew is one of the 62 supported language, and thus word-to-word translation to/from Hebrew is supported for 61 languages.
  • Eyfo - A commercial engine for search and entity tagging in Hebrew.
  • Melingo's ICA (Intelligent Content Analysis) - A text analysis and textual categorized entity extraction API for Hebrew, Arabic and Farsi texts.
  • Genius - Automatic analysis of free text in Hebrew.
  • AlmaReader - Online text-to-speech service for Hebrew.

This list is meant to cover both researchers in the field of natural language processing, and in various related fields, including neurolinguistics and speech science. It also aims to cover researchers in both academia and industry.

  • Allen Institute for AI - Israel
    • Prof. Yoav Goldberg
    • Dr. Jonathan Berant

Researching natural language processing in the industry? Open a pull request and add yourself here now!

About

A comprehensive list of Hebrew NLP resources.

License:Other


Languages

Language:Java 98.7%Language:M4 1.3%