language-detection n-grams ruby web-scraping wikipedia

Language Detector

This is simple a implementation of Language Detector in Ruby. It uses n-grams to build language models, and then approximates compability of input with each model to predict the language. You can read about the n-grams here.

Project

models - contains models of languages
trainData - contains corpuses of training data
testData - file with available inputs
buildModel.rb - program that builds the model
detectLanguage.rb - program that detects language of input
demo.sh - program that runs a short demo

Usage

First, you have to build language models.

ruby buildModel.rb

Program will automaticly build 3-grams of every language in trainData folder (names of files should be languages names). You can also run this with an argument to build any n-gram model. For example, to build 4-grams:

ruby buildModel.rb 4

To detect language in the input text (e.g. testData/english.txt):

ruby detectLanguage.rb english.txt

Your input text should be located in the testData folder. Without an extra parameter program works for 3-grams. To run it for custom n-grams, use:

ruby detectLanguage.rb english.txt 4

WikiScraper

WikiScraper is a program which lets you get text corpuses scraped from Wikipedia article, using Nokogiri and httparty. By default its set for https://en.wikipedia.org/wiki/Earth, but you can change it for any Wikipedia side (it should be the english one). To turn on scraper:

ruby wikiScraper.rb

Program works in infinite loop. You have plenty of options to use:

exit - exits the program
languages - lists available languages for wikipedia site
language name - simply type in language corpus you would like to add, e.g. 'polish'. Language name should be typed in english. Program will scrap the text for you and save it into file.

Note: Scraper supports ISO-8859-1 encoding.

Demo

File "demo.sh" contains a program that runs a demonstration. If you want to see how all the programs work without running each one manually, simply run the code and follow the instructions. Demo instructions:

Build models from pre-defined corpuses
Detect language in english and spanish example input files
Scrap additional language corpuses from wikipedia
Build models from scrapped corpuses
Detect language in english, spanish, russian and polish example input files

About

Language Detector developed in ruby + WikiScraper

language-detection n-grams ruby web-scraping wikipedia

Languages

Language:Ruby 76.4%Language:Shell 23.6%