Improve TokenParser in cases containing abbreviations
irinakhismatullina opened this issue · comments
While using TokenParser to correct typos in identifiers I constantly bump into mistakes like
HTMLElement -> htmle lement
.
To me it looks like in that case (several uppercase letters in a row) it would be better to add the last letter to the next token. I've seen many cases when this would be wise, and almost no when it would break the logic.
E.g. token 'lement' is one of the most frequent typoed ones that gets to be split-out. And here's where it comes from (top-10 examples):
data[data.token_split.str.contains(" lement")]
pos num_occ num_repos identifier token_split num_files
3993 66995 4764 HTMLElement htmle lement 13079
14139 16425 103 NSXMLElement nsxmle lement 1741
47404 4496 85 JAXBElement jaxbe lement 453
64825 3276 16 HTMLElementEventMap htmle lement event map 42
66583 3182 41 IHTMLElement ihtmle lement 209
86788 2389 471 SVGSVGElement svgsvge lement 784
107285 1895 653 HTMLLIElement htmllie lement 967
123871 1618 548 HTMLHRElement htmlhre lement 811
126724 1579 551 HTMLBRElement htmlbre lement 825
128322 1556 418 SVGGElement svgge lement 718
144583 1365 19 BSONElement bsone lement 198
150084 1309 33 IXMLDOMElement ixmldome lement 178
And here're the right parses for comparison:
data[data.token_split.str.contains(" element")]
pos num_occ num_repos identifier token_split num_files
194 1608035 27484 createElement create element 185424
458 740521 22 as_fusion_element as fusion element 628
604 568326 19962 documentElement document element 90360
618 555927 20933 getElementsByTagName get elements by tag name 91772
794 407035 22788 getElementById get element by id 97313
1888 155867 12876 getElementsByClassName get elements by class name 29477
2182 131254 13040 activeElement active element 37437
2538 111209 3936 getElement get element 19493
3153 87404 137 FieldElement field element 449
3221 85380 1091 _currentElement current element 2370
3306 83096 6811 parentElement parent element 18498
3550 76270 1698 domElement dom element 5811
3765 71809 1496 buttonElement button element 2572
3843 69912 12145 srcElement src element 35165
TLDR Can I add this case to the TokenParser? It will be possible to switch it off in the beginning, and I would want to try it with typos.
And here's the same with another very frequent typo esponse
:
data[data.token_split.str.contains(" esponse")]
num_occ | num_repos | identifier | token_split | num_files |
---|---|---|---|---|
20298 | 1406 | NSHTTPURLResponse | nshttpurlr esponse | 6433 |
19973 | 1641 | NSURLResponse | nsurlr esponse | 6328 |
14019 | 661 | HTTPResponse | httpr esponse | 3612 |
5661 | 345 | AFHTTPResponseSerializer | afhttpr esponse serializer | 2079 |
4456 | 291 | AFURLResponseSerialization | afurlr esponse serialization | 2919 |
3386 | 289 | HTTPURLResponse | httpurlr esponse | 963 |
data[data.token_split.str.contains(" response")]
num_occ | num_repos | identifier | token_split | num_files |
---|---|---|---|---|
88423 | 12188 | getResponseHeader | get response header | 27731 |
74625 | 564 | ServerResponse | server response | 2165 |
42935 | 11165 | getAllResponseHeaders | get all response headers | 23173 |
34716 | 2640 | HttpResponse | http response | 11372 |
32443 | 130 | CheckResponse | check response | 1453 |
30016 | 2845 | getResponse | get response | 11978 |
Yes this is a bug.