AtmaHou / atma

Light NLP Tool: atma-0.4.1, commonly-used & tested NLP tools: sentence level bleu, tokenizer, proxy crawler included

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Light NLP Tool: atma-0.4.2

Author: Atma (Yutai Hou) | Modified: 5/7/2017 8:29 PM

Introduction

Commonly used NLP tools, which are verified and fast.
Function included:
bleu score, proxy crawler, tokenizer, massive keyword matcher and so on.

Install

pip install atma

Quick Start

  • Calculate BLEU
       Notice: This is bleu for a single sentence not corpus. The result of this code is same as the most popular perl script
    eg:
    from atma.bleu import *
    weight = [0.25, 0.25, 0.25, 0.25]
    can = 'It is a guide to action which ensures that the military always obeys the commands of the party'.lower().split()
    ref1 = 'It is a guide to action that ensures that the military will forever heed Party commands'.lower().split()
    ref2 = 'It is the guiding principle which guarantees the military forces always being under the command of the Party'.lower().split()
    ref = [ref1, ref2]
    print bleu(can, ref, weight)

Content & Description

  • ./bleu.py
    Sentence level bleu score tool, used as a labeling tool.
    The nltk's bleu tool can not get right results, so i wrote this.
    Code is verified by comparing results to commonly used perl-BLEU tool.

  • ./tool.py
    Contain many frequently used & verified small & dirty function,
    such as convert sentence to word list, judge number, remove punctuation...

  • ./crawling/*
    A proxy class & proxy check tool written by me & jinpeng.
    I rewrite Jinpeng's code to enable the proxy to crawl ssl-website.
    This proxy is proved to be stable.

  • ./AcoraMatcher.py
    A multi-keyword match tool base on the package acora.
    It use index method to speed up the process.
    When you need to match a lot of pre-defined keyword in a long text, it
    will be a great help.

  • ./sampler.py Silly Code for split and sample data

  • ./Metrics.py Natty framework for evaluating multi-canditate with multi-candiate. It can't be excute directly for now.

About

Light NLP Tool: atma-0.4.1, commonly-used & tested NLP tools: sentence level bleu, tokenizer, proxy crawler included

License:MIT License


Languages

Language:Python 100.0%