Heavy vs normal and spelling mistakes
thewilkybarkid opened this issue · comments
We're using tinyld to detect the language on some external content, which generally works great. I'm looking to move to the heavy version to fix a detection error, but I noticed that one of our test cases started to fail. It's based on the title of https://www.scienceopen.com/hosted-document?doi=10.14293/S2199-1006.1.SOR-.PPL3VEC.v2, which contains a couple of spelling mistakes.
The normal version correctly identifies it as Spanish:
tinyld --verbose "Notas para una aproximacin al punk espaol y sus contextos"
Analize chunks [ 'notas para una aproximacin al punk espaol y sus contextos' ]
[Pass 2] DetectPotentialGrams notas [
'n', 'o', 't', 'a',
's', ' n', 'no', 'ot',
'ta', 'as', 's ', ' no',
'not', 'ota', 'tas', 'as ',
' not', 'nota', 'otas', 'tas ',
' nota', 'notas', 'otas '
]
Result notas [
{ lang: 'af', accuracy: 0, score: 0 },
{ lang: 'am', accuracy: 0, score: 0 },
{ lang: 'ber', accuracy: 0, score: 0 },
{ lang: 'rn', accuracy: 0, score: 0 },
{ lang: 'ja', accuracy: 0, score: 0 },
{ lang: 'zh', accuracy: 0, score: 0 },
{ lang: 'ko', accuracy: 0, score: 0 },
{ lang: 'my', accuracy: 0, score: 0 }
]
[Pass 2] DetectPotentialGrams para [
'p', 'a', 'r',
'a', ' p', 'pa',
'ar', 'ra', 'a ',
' pa', 'par', 'ara',
'ra ', ' par', 'para',
'ara ', ' para', 'para '
]
Gram 'para ' [
'ind = 4.78515625%',
'tgl = 13.4765625%',
'spa = 38.37890625%',
'por = 54.78515625%',
'ita = 0.390625%',
'lat = 0.29296875%',
'srp = 1.7578125%',
'swe = 0.29296875%',
'pol = 0.1953125%',
'tur = 4.98046875%'
]
Result para [
{ lang: 'pt', accuracy: 1, score: 701.25 },
{ lang: 'es', accuracy: 0.7005, score: 491.25 },
{ lang: 'tl', accuracy: 0.246, score: 172.5 },
{ lang: 'tr', accuracy: 0.09089999999999998, score: 63.75 },
{ lang: 'id', accuracy: 0.08730000000000004, score: 61.25 },
{ lang: 'sr', accuracy: 0.03210000000000002, score: 22.5 },
{ lang: 'it', accuracy: 0.007099999999999995, score: 5 },
{ lang: 'la', accuracy: 0.005299999999999971, score: 3.75 }
]
[Pass 2] DetectPotentialGrams una [
'u', 'n', 'a',
' u', 'un', 'na',
'a ', ' un', 'una',
'na ', ' una', 'una ',
' una '
]
Gram ' una ' [
'tgl = 0.390625%',
'spa = 57.32421875%',
'ita = 39.35546875%',
'lat = 2.24609375%',
'ron = 3.41796875%'
]
Result una [
{ lang: 'es', accuracy: 1, score: 733.75 },
{ lang: 'it', accuracy: 0.6865, score: 503.75 },
{ lang: 'ro', accuracy: 0.059599999999999986, score: 43.75 },
{ lang: 'la', accuracy: 0.03920000000000001, score: 28.75 },
{ lang: 'tl', accuracy: 0.006800000000000028, score: 5 },
{ lang: 'af', accuracy: 0, score: 0 },
{ lang: 'am', accuracy: 0, score: 0 },
{ lang: 'ber', accuracy: 0, score: 0 }
]
[Pass 2] DetectPotentialGrams aproximacin [
'a', 'p', 'r', 'o', 'x',
'i', 'm', 'a', 'c', 'i',
'n', ' a', 'ap', 'pr', 'ro',
'ox', 'xi', 'im', 'ma', 'ac',
'ci', 'in', 'n ', ' ap', 'apr',
'pro', 'rox', 'oxi', 'xim', 'ima',
'mac', 'aci', 'cin', 'in ', ' apr',
'apro', 'prox', 'roxi', 'oxim', 'xima',
'imac', 'maci', 'acin', 'cin ', ' apro',
'aprox', 'proxi', 'roxim', 'oxima', 'ximac',
'imaci', 'macin', 'acin '
]
Gram 'oxi' [
'fra = 0.1953125%',
'eng = 0.1953125%',
'spa = 0.390625%',
'por = 0.68359375%',
'lat = 1.66015625%',
'ron = 0.48828125%'
]
Gram 'xim' [
'fra = 0.1953125%',
'spa = 1.26953125%',
'por = 2.63671875%',
'lat = 3.3203125%',
'ron = 0.5859375%'
]
Result aproximacin [
{ lang: 'la', accuracy: 1, score: 38.25 },
{ lang: 'pt', accuracy: 0.6667000000000001, score: 25.5 },
{ lang: 'es', accuracy: 0.33330000000000004, score: 12.75 },
{ lang: 'ro', accuracy: 0.2157, score: 8.25 },
{ lang: 'fr', accuracy: 0.07840000000000003, score: 3 },
{ lang: 'en', accuracy: 0.03920000000000001, score: 1.5 },
{ lang: 'af', accuracy: 0, score: 0 },
{ lang: 'am', accuracy: 0, score: 0 }
]
[Pass 2] DetectPotentialGrams al [
'a', 'l',
' a', 'al',
'l ', ' al',
'al ', ' al '
]
Result al [
{ lang: 'af', accuracy: 0, score: 0 },
{ lang: 'am', accuracy: 0, score: 0 },
{ lang: 'ber', accuracy: 0, score: 0 },
{ lang: 'rn', accuracy: 0, score: 0 },
{ lang: 'ja', accuracy: 0, score: 0 },
{ lang: 'zh', accuracy: 0, score: 0 },
{ lang: 'ko', accuracy: 0, score: 0 },
{ lang: 'my', accuracy: 0, score: 0 }
]
[Pass 2] DetectPotentialGrams punk [
'p', 'u', 'n',
'k', ' p', 'pu',
'un', 'nk', 'k ',
' pu', 'pun', 'unk',
'nk ', ' pun', 'punk',
'unk ', ' punk', 'punk '
]
Gram 'unk ' [ 'eng = 0.78125%', 'hun = 33.10546875%', 'lit = 0.29296875%' ]
Result punk [
{ lang: 'hu', accuracy: 1, score: 339 },
{ lang: 'en', accuracy: 0.023599999999999954, score: 8 },
{ lang: 'lt', accuracy: 0.00880000000000003, score: 3 },
{ lang: 'af', accuracy: 0, score: 0 },
{ lang: 'am', accuracy: 0, score: 0 },
{ lang: 'ber', accuracy: 0, score: 0 },
{ lang: 'rn', accuracy: 0, score: 0 },
{ lang: 'ja', accuracy: 0, score: 0 }
]
[Pass 2] DetectPotentialGrams espaol [
'e', 's', 'p', 'a',
'o', 'l', ' e', 'es',
'sp', 'pa', 'ao', 'ol',
'l ', ' es', 'esp', 'spa',
'pao', 'aol', 'ol ', ' esp',
'espa', 'spao', 'paol', 'aol ',
' espa', 'espao', 'spaol', 'paol '
]
Result espaol [
{ lang: 'af', accuracy: 0, score: 0 },
{ lang: 'am', accuracy: 0, score: 0 },
{ lang: 'ber', accuracy: 0, score: 0 },
{ lang: 'rn', accuracy: 0, score: 0 },
{ lang: 'ja', accuracy: 0, score: 0 },
{ lang: 'zh', accuracy: 0, score: 0 },
{ lang: 'ko', accuracy: 0, score: 0 },
{ lang: 'my', accuracy: 0, score: 0 }
]
[Pass 2] DetectPotentialGrams y [ 'y', ' y', 'y ', ' y ' ]
Gram ' y ' [ 'fra = 8.88671875%', 'spa = 30.2734375%', 'lat = 0.1953125%' ]
Result y [
{ lang: 'es', accuracy: 1, score: 232.5 },
{ lang: 'fr', accuracy: 0.2935, score: 68.25 },
{ lang: 'la', accuracy: 0.00649999999999995, score: 1.5 },
{ lang: 'af', accuracy: 0, score: 0 },
{ lang: 'am', accuracy: 0, score: 0 },
{ lang: 'ber', accuracy: 0, score: 0 },
{ lang: 'rn', accuracy: 0, score: 0 },
{ lang: 'ja', accuracy: 0, score: 0 }
]
[Pass 2] DetectPotentialGrams sus [
's', 'u', 's',
' s', 'su', 'us',
's ', ' su', 'sus',
'us ', ' sus', 'sus ',
' sus '
]
Gram ' sus ' [ 'spa = 12.59765625%', 'ron = 1.26953125%', 'tlh = 1.953125%' ]
Result sus [
{ lang: 'es', accuracy: 1, score: 161.25 },
{ lang: 'tlh', accuracy: 0.15500000000000003, score: 25 },
{ lang: 'ro', accuracy: 0.1008, score: 16.25 },
{ lang: 'af', accuracy: 0, score: 0 },
{ lang: 'am', accuracy: 0, score: 0 },
{ lang: 'ber', accuracy: 0, score: 0 },
{ lang: 'rn', accuracy: 0, score: 0 },
{ lang: 'ja', accuracy: 0, score: 0 }
]
[Pass 2] DetectPotentialGrams contextos [
'c', 'o', 'n', 't', 'e',
'x', 't', 'o', 's', ' c',
'co', 'on', 'nt', 'te', 'ex',
'xt', 'to', 'os', 's ', ' co',
'con', 'ont', 'nte', 'tex', 'ext',
'xto', 'tos', 'os ', ' con', 'cont',
'onte', 'ntex', 'text', 'exto', 'xtos',
'tos ', ' cont', 'conte', 'ontex', 'ntext',
'texto', 'extos', 'xtos '
]
Gram 'cont' [
'fra = 14.6484375%',
'eng = 2.83203125%',
'spa = 18.359375%',
'por = 25.48828125%',
'ita = 14.94140625%',
'nld = 1.953125%',
'lat = 5.6640625%',
'ron = 9.27734375%'
]
Gram ' cont' [
'fra = 17.1875%',
'eng = 3.80859375%',
'spa = 16.50390625%',
'por = 17.3828125%',
'ita = 11.03515625%',
'nld = 2.34375%',
'lat = 6.0546875%',
'ron = 11.328125%'
]
Result contextos [
{ lang: 'pt', accuracy: 1, score: 483.5 },
{ lang: 'es', accuracy: 0.8257, score: 399.25 },
{ lang: 'fr', accuracy: 0.7653, score: 370 },
{ lang: 'it', accuracy: 0.6086, score: 294.25 },
{ lang: 'ro', accuracy: 0.49639999999999995, score: 240 },
{ lang: 'la', accuracy: 0.2802, score: 135.5 },
{ lang: 'en', accuracy: 0.16080000000000005, score: 77.75 },
{ lang: 'nl', accuracy: 0.10340000000000005, score: 50 }
]
Merge Results [
{ lang: 'es', accuracy: 0.10123958333333333 },
{ lang: 'pt', accuracy: 0.05555625 },
{ lang: 'la', accuracy: 0.027733333333333332 },
{ lang: 'it', accuracy: 0.027129166666666666 },
{ lang: 'fr', accuracy: 0.023691666666666666 },
{ lang: 'hu', accuracy: 0.020833333333333332 },
{ lang: 'ro', accuracy: 0.018177083333333333 },
{ lang: 'tl', accuracy: 0.005266666666666667 },
{ lang: 'en', accuracy: 0.004658333333333334 },
{ lang: 'tlh', accuracy: 0.003229166666666667 },
{ lang: 'nl', accuracy: 0.0021541666666666675 },
{ lang: 'tr', accuracy: 0.0018937499999999996 },
{ lang: 'id', accuracy: 0.001818750000000001 },
{ lang: 'sr', accuracy: 0.0006687500000000004 },
{ lang: 'lt', accuracy: 0.00018333333333333396 }
]
[
{ lang: 'es', accuracy: 0.10123958333333333 },
{ lang: 'pt', accuracy: 0.05555625 },
{ lang: 'la', accuracy: 0.027733333333333332 },
{ lang: 'it', accuracy: 0.027129166666666666 },
{ lang: 'fr', accuracy: 0.023691666666666666 },
{ lang: 'hu', accuracy: 0.020833333333333332 },
{ lang: 'ro', accuracy: 0.018177083333333333 },
{ lang: 'tl', accuracy: 0.005266666666666667 },
{ lang: 'en', accuracy: 0.004658333333333334 },
{ lang: 'tlh', accuracy: 0.003229166666666667 },
{ lang: 'nl', accuracy: 0.0021541666666666675 },
{ lang: 'tr', accuracy: 0.0018937499999999996 },
{ lang: 'id', accuracy: 0.001818750000000001 },
{ lang: 'sr', accuracy: 0.0006687500000000004 },
{ lang: 'lt', accuracy: 0.00018333333333333396 }
]
But the heavy version seems to decide it's Galician pretty quickly:
tinyld-heavy --verbose "Notas para una aproximacin al punk espaol y sus contextos"
Analize chunks [ 'notas para una aproximacin al punk espaol y sus contextos' ]
[Pass 1] detectUniqueGrams 4-grams - match 'aol ' to ga
Merge Results [ { lang: 'ga', accuracy: 1 } ]
[ { lang: 'ga', accuracy: 1 } ]
I need to learn more about the library internals, and spelling mistakes will always be a problem, but is there anything worth tweaking?
Can it run in a VM without internet access/restricted access?