robert-d-schultz / encoding-recurrent

LSTM neural networks that classify byte sequences by their encoding.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

PyTorch LSTM neural networks that classify byte sequences as either UTF-8 or Windows-1252.

A training and an evaluation set was created from the News Crawl 2009 corpus (available http://www.statmt.org/wmt11/translation-task.html). This corpus is created from English-language newswire data.

Chardet (version 4.0.0), an existing Python character encoding detector, was used for comparison.

Select results:

32 hidden unit model accuracy: 0.9770

Chardet accuracy: 0.7340

About

LSTM neural networks that classify byte sequences by their encoding.


Languages

Language:Python 100.0%