irregular-chars is a library for cleaning text, such as removing zero-width characters or converting full-width characters to half-width.
You can install the package via pip:
pip install irregular_chars
from irregular_chars import remove_zero_width_spaces
text = "Hello\u200BWorld"
clean_text = remove_zero_width(text)
print(clean_text) # Outputs: HelloWorld
- convert alphanumerics width full to small
from irregular_chars import full_to_small_width_alphanumerics
assert full_to_small_width_alphanumerics("0") == "0" # True
- convert kana width small to full
from irregular_chars import half_to_full_width_kanas
assert half_to_full_width_kanas("ア") == "ア" # True
- normalize kana and alphanumerics width
from irregular_chars import normalize_width_all
assert normalize_width_all("ア0") == "ア0" # True
from irregular_chars import combine_sound_symbols
assert combine_sound_symbols("ガギグゲゴ") == "ガギグゲゴ" # True
- detect ivs (unicode ivs) The Unicode code point of the character is in the range of the variant selector (E0100-E01EF). You can just ignore this kind of variant selectors.
from irregular_chars.ivs import is_unicode_ivs
assert is_unicode_ivs(0xE0100) # True
- detect ivs (cjk or supplementary ivs) The range of CJK unified ideographs extension B-F and supplementary ideographic plane (20000-2FA1F).
They are strongly combined with the previous character. So you can not remove or replace just only this characters...
from irregular_chars.ivs import is_cjk_or_supplementary_ivs
assert is_cjk_or_supplementary_ivs(0x20000) # True
- ignore unicode ivs and raise if a cjk or supplementary ivs is found.
from irregular_chars.ivs import remove_ivs
assert remove_ivs("test\U000E0100") == "test" # True