convergence-lab / opus-fast-mosestokenizer

c++ mosestokenizer (OPUS fork)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

opus-fast-mosestokenizer is a fork of fast-mosestokenizer created to ensure compability of the package with current Python environments. Fix Compilation problem of absl.

git clone https://github.com/convergence-lab/opus-fast-mosestokenizer/tree/master
cd opus-fast-mosestokenizer
make build
make download-build-static-deps
pip install "pybind11[global]
python setup.py build_ext install

fast-mosestokenizer is a C++ implementation of the moses tokenizer which is a favourite among the folks in NLP research.

The reason for using this package over the original perl implementation is for the purpose of portability. With the C++ source code, you can use this library basically in every language.

The C++ script was adapted from the mosesdecoder repository contrib/c++tokenizer.

Benchmark

fast-mosestokenizer is also fast. On english, it is about 6x faster than tokenizer.perl and 15x faster than sacremoses.

see ./bench/README.md for more information.

Installation

Python users using linux and osx>=10.15 can install directly from PyPI.

pip install fast-mosestokenizer

See ./INSTALL.md for more information.

Usage (Command-line tool)

# Piping is the standard way to configure input and output stream.
# mosestokenizer would apply tokenization to each line of the input stream.
mosestokenizer en < infile > outfile

# For a full list of options, refer to the help message.
mosestokenizer -h

Usage (Python)

# Usage patterns are mostly the same as sacremoses.
>>> from mosestokenizer import MosesTokenizer

>>> tokenizer = MosesTokenizer('en')
>>> tokenizer.tokenize("""
The English name of Singapore is an anglicisation of the native Malay name for
the country, Singapura, which was in turn derived from the Sanskrit word for
lion city (romanised: Siṃhapura; Brahmi: 𑀲𑀺𑀁𑀳𑀧𑀼𑀭; literally "lion city"; siṃha
means "lion", pura means "city" or "fortress").[8]
""")
[
  'The', 'English', 'name', 'of', 'Singapore', 'is', 'an', 'anglicisation',
  'of', 'the', 'native', 'Malay', 'name', 'for', 'the', 'country', ',',
  'Singapura', ',', 'which', 'was', 'in', 'turn', 'derived', 'from', 'the',
  'Sanskrit', 'word', 'for', 'lion', 'city', '(', 'romanised', ':',
  'Siṃhapura', ';', 'Brahmi', ':', '𑀲𑀺𑀁𑀳𑀧𑀼𑀭', ';', 'literally', '"', 'lion',
  'city', '"', ';', 'siṃha', 'means', '"', 'lion', '"', ',', 'pura', 'means',
  '"', 'city', '"', 'or', '"', 'fortress', '"', ')', '.', '[', '8', ']'
]

About

c++ mosestokenizer (OPUS fork)

License:GNU Lesser General Public License v2.1


Languages

Language:C++ 68.4%Language:Emacs Lisp 10.8%Language:Python 7.7%Language:CMake 4.0%Language:Makefile 2.4%Language:Smalltalk 1.2%Language:Ruby 1.0%Language:NewLisp 1.0%Language:Perl 0.8%Language:Shell 0.7%Language:JavaScript 0.5%Language:SystemVerilog 0.5%Language:OCaml 0.3%Language:ActionScript 0.3%Language:Slash 0.2%