src-d / ml

sourced.ml is a library and command line tools to build and apply machine learning models on top of Universal Abstract Syntax Trees

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Improve TokenParser in cases containing abbreviations

irinakhismatullina opened this issue · comments

While using TokenParser to correct typos in identifiers I constantly bump into mistakes like
HTMLElement -> htmle lement.

To me it looks like in that case (several uppercase letters in a row) it would be better to add the last letter to the next token. I've seen many cases when this would be wise, and almost no when it would break the logic.

E.g. token 'lement' is one of the most frequent typoed ones that gets to be split-out. And here's where it comes from (top-10 examples):

data[data.token_split.str.contains(" lement")]
pos  num_occ    num_repos    identifier    token_split    num_files
3993    66995    4764    HTMLElement    htmle lement    13079
14139    16425    103    NSXMLElement    nsxmle lement    1741
47404    4496    85    JAXBElement    jaxbe lement    453
64825    3276    16    HTMLElementEventMap    htmle lement event map    42
66583    3182    41    IHTMLElement    ihtmle lement    209
86788    2389    471    SVGSVGElement    svgsvge lement    784
107285    1895    653    HTMLLIElement    htmllie lement    967
123871    1618    548    HTMLHRElement    htmlhre lement    811
126724    1579    551    HTMLBRElement    htmlbre lement    825
128322    1556    418    SVGGElement    svgge lement    718
144583    1365    19    BSONElement    bsone lement    198
150084    1309    33    IXMLDOMElement    ixmldome lement    178

And here're the right parses for comparison:

data[data.token_split.str.contains(" element")]
pos    num_occ    num_repos    identifier    token_split    num_files
194    1608035    27484    createElement    create element    185424
458    740521    22    as_fusion_element    as fusion element    628
604    568326    19962    documentElement    document element    90360
618    555927    20933    getElementsByTagName    get elements by tag name    91772
794    407035    22788    getElementById    get element by id    97313
1888    155867    12876    getElementsByClassName    get elements by class name    29477
2182    131254    13040    activeElement    active element    37437
2538    111209    3936    getElement    get element    19493
3153    87404    137    FieldElement    field element    449
3221    85380    1091    _currentElement    current element    2370
3306    83096    6811    parentElement    parent element    18498
3550    76270    1698    domElement    dom element    5811
3765    71809    1496    buttonElement    button element    2572
3843    69912    12145    srcElement    src element    35165

TLDR Can I add this case to the TokenParser? It will be possible to switch it off in the beginning, and I would want to try it with typos.

And here's the same with another very frequent typo esponse:

data[data.token_split.str.contains(" esponse")]

num_occ num_repos identifier token_split num_files
20298 1406 NSHTTPURLResponse nshttpurlr esponse 6433
19973 1641 NSURLResponse nsurlr esponse 6328
14019 661 HTTPResponse httpr esponse 3612
5661 345 AFHTTPResponseSerializer afhttpr esponse serializer 2079
4456 291 AFURLResponseSerialization afurlr esponse serialization 2919
3386 289 HTTPURLResponse httpurlr esponse 963

data[data.token_split.str.contains(" response")]

num_occ num_repos identifier token_split num_files
88423 12188 getResponseHeader get response header 27731
74625 564 ServerResponse server response 2165
42935 11165 getAllResponseHeaders get all response headers 23173
34716 2640 HttpResponse http response 11372
32443 130 CheckResponse check response 1453
30016 2845 getResponse get response 11978

Yes this is a bug.