Guangyi-Z / text-cleaner

simple text preprocessing tool

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

text-cleaner, simple text preprocessing tool

Introduction

  • Support Python 2.7, 3.3, 3.4, 3.5.
  • Simple interfaces.
  • Easy to extend.

Install

pip install text-cleaner

WARNING FOR PYTHON 2.7 USERS: Only UCS-4 build is supported(--enable-unicode=ucs4), UCS-2 build (see this) is NOT SUPPORTED in the latest version.

Usage

from text_cleaner import remove, keep

from text_cleaner.processor.common import ASCII
from text_cleaner.processor.chinese import CHINESE, CHINESE_SYMBOLS_AND_PUNCTUATION
from text_cleaner.processor.misc import RESTRICT_URL

# remove url and ascii characters.
# return: u'点击  查看 '
remove(
    '点击http://t.cn/RtU0mZ1 查看,123456,test',
    [RESTRICT_URL, ASCII],
)

# remove only Chinese punctuation.
# return: u'点击 http://t.cn/RtU0mZ1  查看,123456,test '
remove(
    '点击:http://t.cn/RtU0mZ1, 查看,123456,test。!?',
    [RESTRICT_URL, ASCII],
)

# keep chinese characters and url.
# return: u'点击 http://t.cn/RtU0mZ1 查看'
keep(
    '点击http://t.cn/RtU0mZ1 查看,123456,test',
    [CHINESE, RESTRICT_URL],
)

# use processor directly.
# return: u'点击  查看'
RESTRICT_URL.remove('点击http://t.cn/RtU0mZ1 查看')
# return: u'点击<URL> 查看'
RESTRICT_URL.replace('<URL>').remove('点击http://t.cn/RtU0mZ1 查看')

Interfaces

text_cleaner.remove(text, processors):

  • text: str or bytes (unicode or str for Python 2).
  • processors: iterable of processors. remove invokes remove of each processor to handle text.

text_cleaner.keep(text, processors):

  • same as remove, but invoke keep method of processors instead.

Processors

DEFAULT_REPLACE_TEXT: ' ', single space.

RegexProcessor(regex, replace_text=DEFAULT_REPLACE_TEXT)

  • contruct a regex processor for regex, replace unmatched components with replace_text.
  • replace(self, new_replace_text): create a new processor, with new replace_text is set.
  • remove(self, text): remove all occurences of regex from text.
  • keep(self, text): keep only the occurences of regex, remove all unmatched components from text.
  • verify(self, text): return True if text match regex, otherwise returns False.

UnicodeRange(begin, end):

  • begin: int, the begin of unicode range.
  • end: int, the end of unicode range.

UnicodeRangeProcessor(ranges, replace_text=DEFAULT_REPLACE_TEXT)

  • subclass of RegexProcessor.
  • ranges: iterable of instances of UnicodeRange.

Built-in Processors

Following processors are defined by UnicodeRange and regex. Read the source code if you are sure about what's going on.

text_cleaner.processor.common, for common usage:

  • ALPHA
  • DIGIT
  • SYMBOLS_AND_PUNCTUATION
  • ASCII
  • ALPHA_EXTENSION
  • DIGIT_EXTENSION
  • SYMBOLS_AND_PUNCTUATION_EXTENSION
  • GENERAL_PUNCTUATION

text_cleaner.processor.misc, misellanious processors:

  • URL
  • RESTRICT_URL
  • ESCAPED_WHITESPACE
  • WECHAT_EMOJI_EN
  • WECHAT_EMOJI_ZHCN
  • WECHAT_EMOJI

text_cleaner.processor.chinese, Chinese processing:

  • CHINESE_CHARACTER: only common characters.
  • CHINESE: common characters + symbols and puntuations.
  • CHINESE_ALL: all CJK characters.
  • CHINESE_EXTENSION
  • CHINESE_COMPATIBILITY
  • CHINESE_SYMBOLS_AND_PUNCTUATION

URL vs. RESTRICT_URL

How to define URLs is a complex problem. We provide two choices for our users.

  • URL: truncate urls till whitespaces.
  • RESTRICT_URL: truncate urls till non-whitespace ASCII ([!-~] in the ASCII table)

For Chinese users, we recommend using RESTRICT_URL.

from text_cleaner.processor.misc import RESTRICT_URL, URL

URL.remove('点击http://t.cn/RtU0mZ1 查看')
# '点击 查看'

URL.remove('点击http://t.cn/RtU0mZ1查看')
# '点击 '

RESTRICT_URL.remove('点击http://t.cn/RtU0mZ1查看')
# '点击 查看'

About

simple text preprocessing tool


Languages

Language:Python 98.6%Language:Shell 1.4%