Python Language Detector

Python 3.5 version

Trigram-based language detection built in C++ and wrapped as a python module.

Built for speed and reliability as primary concerns, with some small accuracy sacrifices vs. similar algorithms. For the purpose it was built for, the accuracy still remains very high for even short sample texts.

Identification is performed through basic trigram lookups paired with unicode character set recognition. Performance is aided by custom hash tables using Murmur hash for fast lookups.

This is currently used in a large commercial database application in production, and has proven to be stable, reliable, and fast.

About

Python Language Detector

http://www.decultured.com

MIT License

Languages

Language:Python 40.7%Language:C++ 30.2%Language:C 28.4%Language:Objective-C 0.7%