rhybroy / py-confilter

Confilter is a keyword matching service module based on gevent and ahocorasick

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Author: TroyCheng
Email: frostmourn716@gmail.com

1. Introuction:

    Confilter is a keyword matching service module based on gevent and
    ahocorasick. It receives text content from http post request and matching with
    keywords list in dictionaries, return a json format result which contains the
    hit words.

    For example:

    Http post request body:
        # g: dict group name, here is 'FORBID', it contains two dictionaries:
        #    forbin & anti_forbid
        # t: content need to be filtered, here is '**功(falungong) is forbidden in China'
        g=FORBID&t=%E6%B3%95%E8%BD%AE%E5%A4%A7%E6%B3%95falun%20is%20forbidden%20in%20China

    Http reaponse body:
        # it means hit two words in forbid dict: ** & falun, hit no words in
        # anti_forbid dict.
        {"forbid": ["\u6cd5\u8f6e", "falun"], "anti_forbid": []}

    The dictionaries are located in confilter/data, 1 keyword per line. In
    this example, there are two dictionaries: forbid and anti_forbid, belongs to
    FORBID group. User can define different dictionaries and groups for
    different purposes.

2. Installation:
    
    Confilter is depend on gevent and ahocorasick, they should be installed at
    first:
        gevent: http://www.gevent.org/
        ahocorasick: https://hkn.eecs.berkeley.edu/~dyoo/python/ahocorasick/
    Then, just using commands bellow to make it work:
        tar zxvf confilter.tar.gz
        cd confilter
        sudo python ./bin/confilterd.py start|stop|restart
    
3. Configuration:

    A typical configuration file is looked like this:
        # confilter.cfg
        # define common information
        [info]
        host = 127.0.0.1
        port = 9000
        poolSize = 3000

        [dict_groups]
        keys = FORBID

        [dict_group_FORBID]
        forbid = "../data/forbid.dict"
        anti_forbid = "../data/anti_forbid.dict"

    In 'info' section, you can defined the bind address and the port, and the
    poolSize, which is used in gevent.WSGIServer.

    In 'dict_groups' section, you can add your own dictionary group name,
    using ',' as separator, like: keys=FORBID,YOUR_GROUP,THIRD_GROUP, And
    then, provide the dictionary name and path in 'dict_group_GROUPNAME'
    section. the 'dict_group_' is a prefix and the GROUPNAME is the one you
    specified in keys of dic_groups section. For example:
        
        [dict_groups]
        keys = custom1, CUSTOM2

        [dict_group_custom1]
        dict1 = '../data/dict1.dict'

        [dict_group_CUSTOM2]
        dict1 = '../data/dict1.dict'
        dict2 = '../data/dict2.dict'

    logger.cfg is the configuration file for logger, you can refer to python 
    logging module if you want to modify it.

4. Advanced:

    Confilter run as a daemon process. Using gevent.WSGIServer as default
    server. Also you can use gunicorn to instead. just using:
    
        cd confilter/bin
        gunicorn --workers=4 confilter:confilterApp

    You can refer to gunicorn: http://gunicorn.org/ for more information.

About

Confilter is a keyword matching service module based on gevent and ahocorasick


Languages

Language:Python 95.3%Language:Shell 4.7%