ternaus / ternaus-cleantext

Cleans text as in the CLIP model

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Sourcery

Cleantextclip

Library to prepare text for machine learning and NLP tasks. Originated from CLIP model preparation, but a few more rules were added.

Installation

pip install -U ternaus_cleantext

Cleans text similar, but stricter than in the CLIP model:

  1. Escapes HTML characters
  2. Removes html tags
  3. Removes URLs
  4. Removes extra white spaces
  5. Text to lower case
from ternaus_cleantext.ternaus_cleantext import clean_text
print(clean_text("This is a test https://ternaus.com <b>bold</b>"))

returns this is a test bold

About

Cleans text as in the CLIP model

License:MIT License


Languages

Language:Python 100.0%