goncharovi / langdetectpp

C++ library for (human) language detection based on n-grams

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

langdetectpp

C++ port of the Java language-detection library.

It analyzes UTF8-encoded text and returns the most likely human language of the contents.

It uses the same language profiles as the original library, which are based on 1-3 character N-grams. These profiles cover 55 different languages.

#include <string>
#include <iostream>
#include <langdetectpp/langdetectpp.h>

using langdetectpp::Detector;
using langdetectpp::Language;
using std::string;
using std::cout;

int main() {
    auto detector = Detector::create();
    string someEnglishText = "Some english text to analyze.";
    Language lang = detector->detect(someEnglishText);
    cout << langdetectpp::stringOfLanguage(lang) << std::endl;

    string someGermanText = "Im Rahmen der Trainingskontrollen k\u00f6nnen etwa 8.650 Kaderathleten gepr\u00fcft werden, die in drei Testpools aufgeteilt sind und an nationalen und internationalen Wettk\u00e4mpfen teilnehmen.";
    lang = detector->detect(someGermanText);
    cout << langdetectpp::stringOfLanguage(lang) << std::endl;
}
EN
DE

building

mkdir build
cd build
cmake ../
make

usage

The main public-facing part of this library is the Detector class. This class is instantiated through the Detector::create() static method, which returns a shared_ptr. Initializing a Detector is relatively expensive because it needs to build up the initial ngram vs language score matrix. Because of this, a Detector instance should be kept around and reused.

Detector is thread-safe and has no mutable state. You should only need a single instance for anything.

The most likely language for a given string is returned as a langdetect::Language, which is an enum class. There is a utility method for getting the string of the corresponding language code:

std::string stringOfLanguage(langdetectpp::Language);
    auto lang = langdetectpp::Language::EN;
    string langName = langdetectpp::stringofLanguage(lang);

There is also a utility method for getting the English name of the language code:

std::string englishNameOfLanguage(langdetectpp::Language)
    auto lang = langdetectpp::Language::AR;
    string langName = langdetectpp::englishNameOfLanguage(lang);
    cout << langName << endl;
    // "Arabic"

license

Apache License version 2.0 (commercial-friendly) -- see the LICENSE file for the formal version.

Language profiles are taken from the original Java language-detection project. These profiles are (c) 2010-2014 Cybozu Labs, Inc., and are likewise licensed under Apache 2.0 (and are also commercial-friendly). The LICENSE file contains the text of the original license for the profiles.

About

C++ library for (human) language detection based on n-grams

License:Other


Languages

Language:C++ 99.9%Language:CMake 0.1%Language:Python 0.1%Language:Shell 0.0%Language:Makefile 0.0%Language:C 0.0%