Language Classifier

Tries to predict which language a word belongs using LSTMs.

Dataset contains 100K words. 50K from each: Turkish and English. I haven't included accented letters in Turkish like ü, ö, ı, ç, ş, because most of the Turkish words have one of them so it'd make the prediction lot easier. Instead, I've used the most approximate standard Latin letter - like o for ö.

English X, Q and W (Turkish alphabet doesn't present them) weren't touched since their frequency amongst English words are low.

Installing

Clone the repository.
cd language-classifier
pipenv install -r %% pipenv shell
Then run language-classifier/classifier.py

Requires Python 3.6+

Explanation

Imports

Keras: Deep learning framework
Numpy: Data storing and manipulation
Random: Just to shuffle the dataset.

import numpy as np
import random
from keras.models import Sequential
from keras.layers import Dense, Activation, Dropout, LSTM
from keras.optimizers import RMSprop

Exploring and Preparing the Data

Importing the dataset.

data = []

with open('turkish.txt') as textfile:
    for word in textfile:
        data.append((word.replace('\n', ''), 0))

with open('english.txt') as textfile:
    for word in textfile:
        data.append((word.replace('\n', ''), 1))

random.shuffle(data)

Exploring

words = [record[0] for record in data]
labels = [record[1] for record in data]

char_pool = sorted(set(''.join(words)))
longest = sorted(words, key=len)[-1]
maxlen = len(longest)
word_count = len(data)

n_classes = 2

print('Character pool: {}'.format(", ".join(char_pool)))
print('Longest word: {}'.format(longest))
print('Length of the longest word: {}'.format(maxlen))
print('Data size: {} words.'.format(word_count))

So the result is..

Character pool: a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s, t, u, v, w, x, y, z
Longest word: trinitrophenylmethylnitramine
Length of the longest word: 29
Data size: 99957

Tokenizing char-wise.

char_indices = dict((c, i) for i, c in enumerate(char_pool))
indices_char = dict((i, c) for i, c in enumerate(char_pool))

Prepearing the training data. Basically creating a whole size 0 filled tensor, and then filling it with data as the data contains sequential one-hot arrays. Makes it easier for me.

x_data = np.zeros((word_count, maxlen, len(char_pool)), dtype=np.bool)
y_data = np.zeros((word_count, n_classes))

for i_word, word in enumerate(words):
    for i_char, char in enumerate(word):
        x_data[i_word, i_char, char_indices[char]] = 1

for i_label, label in enumerate(labels):
    y_data[i_label, label] = 1

The Predictive Model

model = Sequential()
model.add(LSTM(16, input_shape=(maxlen, len(char_pool))))
model.add(Dense(n_classes))
model.add(Activation('softmax'))

optimizer = RMSprop(lr=0.01)

model.compile(loss='categorical_crossentropy', optimizer=optimizer)

Training

for iteration in range(3):
    model.fit(x_data, y_data, batch_size=128, nb_epoch=1)

Launch!

def predict(word):
    processed_word = np.zeros((1, maxlen, len(char_pool)))
    for i_char, char in enumerate(word):
        processed_word[0, i_char, char_indices[char]] = 1
    prediction = model.predict(processed_word, verbose=0)[0]
    
    result = {'Turkish': prediction[0], 'English': prediction[1]}

    return result

Throw any word you want inside this list. It'll be our playing dataset.

# [!] be sure they are all lower-case.
word_list = [
    # supposed to be Turkish
    'altinvarak',
    'bulutsuzluk',
    'farmakoloji',
    'toprak',
    'hanimeli',
    'imkansiz',

    # supposed to be English
    'tensorflow',
    'jabba',
    'magsafe',
    'pharmacology',
    'parallax',
    'wabby',
    'querein',

    # curiosity
    'terminal', # an actual word in both languages
    'ahahahah',
    'ahahahahahahahah',
    'rawr',
]

for word in word_list:
    prediction = predict(word)
    print('{}: {}'.format(word, prediction))

Results

altinvarak:	    TUR: 0.98	ENG: 0.02
bulutsuzluk:    TUR: 0.99	ENG: 0.01
farmakoloji:    TUR: 0.97	ENG: 0.03
toprak:	        TUR: 0.90	ENG: 0.10
hanimeli:	    TUR: 0.97	ENG: 0.03
imkansiz:	    TUR: 0.99	ENG: 0.01
tensorflow:	    TUR: 0.00	ENG: 1.00
jabba:	        TUR: 0.75	ENG: 0.25
magsafe:	    TUR: 0.59	ENG: 0.41
pharmacology:   TUR: 0.00	ENG: 1.00
parallax:	    TUR: 0.00	ENG: 1.00
wabby:	        TUR: 0.00	ENG: 1.00
querein:	    TUR: 0.00	ENG: 1.00
terminal:	    TUR: 0.20	ENG: 0.80
ahahahah:	    TUR: 0.83	ENG: 0.17
ahahahahahahahah:TUR: 0.80	ENG: 0.20
rawr:	        TUR: 0.00	ENG: 1.00

Overall Accuracy: 457/500 (91.4)

toprakozturk / language-classifier