A python package with methods to handle the complexities of Hebrew text, calculate Gematria, and more.
Documentation: https://hebrew.aviperl.me/
Repository: https://github.com/avi-perl/hebrew
$ pip install hebrew
Hebrew
assists in working with Hebrew text by providing methods to handle the text according to user-perceived
characteristics. Additionally, methods for common Hebrew text processing are provided.
from hebrew import Hebrew
from hebrew.chars import HebrewChar, ALEPH
hs = Hebrew('בְּרֵאשִׁ֖ית')
print(list(hs.graphemes)) # ['בְּ', 'רֵ', 'א', 'שִׁ֖', 'י', 'ת']
print(hs.text_only()) # בראשית
print(ALEPH) # HebrewChar(char='א', name='Aleph', hebrew_name='אָלֶף', name_alts=['Alef'], hebrew_name_alts=None, final_letter=False)
print(HebrewChar.search('bet')) # HebrewChar(char='בּ', name='Bet', hebrew_name='בֵּית', name_alts=None, hebrew_name_alts=None, final_letter=False)
The Hebrew
class includes a gematria
function that can return a value for 23 different variations of Gematria!
from hebrew import Hebrew
from hebrew import GematriaTypes
hs = Hebrew('בְּרֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים אֵ֥ת הַשָּׁמַ֖יִם וְאֵ֥ת הָאָֽרֶץ׃')
print(hs.gematria()) # 2701
print(hs.gematria(GematriaTypes.MISPAR_GADOL)) # 4631
Messy inputs, such as strings with english text mixed in, is supported. However, do be careful to work with sanitized strings as much as possible.
from hebrew import Hebrew
hs1 = Hebrew(
'''
Text: "בְּרֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים אֵ֥ת הַשָּׁמַ֖יִם וְאֵ֥ת הָאָֽרֶץ׃"
Translation: "When God began to create heaven and earth"
'''
)
hs2 = Hebrew('בְּרֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים אֵ֥ת הַשָּׁמַ֖יִם וְאֵ֥ת הָאָֽרֶץ׃')
assert hs1.gematria() == hs2.gematria() # 2701
Major kudos goes to TorahCalc whose calculator and explanations were critical to the development of this feature.
Hebrew text comes in different forms, depending on the context. Hebrew text may appear with Niqqudot "a system of diacritical signs used to represent vowels or distinguish between alternative pronunciations of letters of the Hebrew alphabet". 1 Additionally, Hebrew text may appear with extensive punctuation characters that connect words, separate them, and cantillation marks "used as a guide for chanting the text, either from the printed text or, in the case of the public reading of the Torah" 2.
Because of the above, from the perspective of a hebrew reader, the following 3 words are the same:
- בְּרֵאשִׁ֖ית
- בְּרֵאשִׁית
- בראשית
However, as a unicode string, they are entirely different because of the additional characters.
assert len("בְּרֵאשִׁ֖ית") == 12
assert len("בְּרֵאשִׁית") == 11
assert len("בראשית") == 6
This impacts the user is a number of other ways. For example, if I want to get the root of this hebrew word using a slice:
Expected: רֵאשִׁ֖ית
he = "בְּרֵאשִׁ֖ית"
assert he[-5:] == 'ִׁ֖ית'
The solution to this is to handle the unicode string as a list of grapheme3 characters, where each letter and its accompanying characters are treated as a single unit.
Using the grapheme library for python, we can work with the grapheme characters as units. This allows us to get the right number of characters, slice the string correctly, and more.
import grapheme
assert grapheme.length("בְּרֵאשִׁ֖ית") == 6
assert grapheme.slice("בְּרֵאשִׁ֖ית", start=1, end=6) == 'רֵאשִׁ֖ית'
This library includes 2 classes. GraphemeString
is a class that supports all the functions made available by grapheme
.
The 2nd class Hebrew
subclasses GraphemeString
and adds methods for handling Hebrew text. This allows us to
interact with the text like so:
from hebrew import Hebrew
v2 = Hebrew("וְהָאָ֗רֶץ הָיְתָ֥ה תֹ֙הוּ֙ וָבֹ֔הוּ וְחֹ֖שֶׁךְ עַל־פְּנֵ֣י תְה֑וֹם וְר֣וּחַ אֱלֹהִ֔ים מְרַחֶ֖פֶת עַל־פְּנֵ֥י הַמָּֽיִם׃")
print(v2.no_taamim()) # "וְהָאָרֶץ הָיְתָה תֹהוּ וָבֹהוּ וְחֹשֶׁךְ עַל־פְּנֵי תְהוֹם וְרוּחַ אֱלֹהִים מְרַחֶפֶת עַל־פְּנֵי הַמָּיִם׃"
print(v2.text_only()) # והארץ היתה תהו ובהו וחשך על־פני תהום ורוח אלהים מרחפת על־פני המים
assert v2.length == 66
print(v2.words(split_maqaf=True)) # [וְהָאָ֗רֶץ, הָיְתָ֥ה, תֹ֙הוּ֙, וָבֹ֔הוּ, וְחֹ֖שֶׁךְ, עַל, פְּנֵ֣י, תְה֑וֹם, וְר֣וּחַ, אֱלֹהִ֔ים, מְרַחֶ֖פֶת, עַל, פְּנֵ֥י, הַמָּֽיִם׃]
The text in these examples and used in testing were sourced from Sefaria.
hebrew.Chars
contains constants for every letter as well as lists by character category's.
Each value is an instance of a class that represents a character in the Hebrew character set with relevant properties.
Since this library seeks to support the use of the Hebrew language in the way it is used, characters such as "בּ" can be
located (BET
) even though, strictly speaking, "בּ" is not part of the hebrew alphabet; it is a Hebrew letter plus a dot.
from hebrew.chars import FINAL_LETTERS, YIDDISH_CHARS, TSADI
print(TSADI) # HebrewChar(char='צ', name='Tsadi', hebrew_name='צַדִי', name_alts=['Tzadik'], hebrew_name_alts=['צדיק'], final_letter=False)
assert {c.name: c.char for c in FINAL_LETTERS} == {'Chaf Sofit': 'ך', 'Mem Sofit': 'ם', 'Nun Sofit': 'ן', 'Fe Sofit': 'ף', 'Tsadi Sofit': 'ץ'}
assert [c.char for c in YIDDISH_CHARS] == ['ײ', 'װ', 'ױ']
A letter can be retrieved using the CHARS
dict; A dict of all instances of all supported Char types where the key is
the char and the value is an instance of BaseHebrewChar.
from hebrew.chars import CHARS
print(CHARS.get('בּ')) # HebrewChar(char='בּ', name='Bet', hebrew_name='בֵּית', name_alts=None, hebrew_name_alts=None, final_letter=False)
Search is also supported so that letters can be retrieved by their name.
from hebrew.chars import HebrewChar
print(HebrewChar.search('bet')) # HebrewChar(char='בּ', name='Bet', hebrew_name='בֵּית', name_alts=None, hebrew_name_alts=None, final_letter=False)
Contributions in the form of pull requests are very welcome! I'm sure many more helpful methods related to hebrew text could be helpful. More information and instructions for contributing can be found here.