Improve TokenParser in cases containing abbreviations

Question

Improve TokenParser in cases containing abbreviations

irinakhismatullina opened this issue 5 years ago · comments

Irina Khismatullina commented 5 years ago

While using TokenParser to correct typos in identifiers I constantly bump into mistakes like
HTMLElement -> htmle lement.

To me it looks like in that case (several uppercase letters in a row) it would be better to add the last letter to the next token. I've seen many cases when this would be wise, and almost no when it would break the logic.

E.g. token 'lement' is one of the most frequent typoed ones that gets to be split-out. And here's where it comes from (top-10 examples):

data[data.token_split.str.contains(" lement")]
pos  num_occ    num_repos    identifier    token_split    num_files
3993    66995    4764    HTMLElement    htmle lement    13079
14139    16425    103    NSXMLElement    nsxmle lement    1741
47404    4496    85    JAXBElement    jaxbe lement    453
64825    3276    16    HTMLElementEventMap    htmle lement event map    42
66583    3182    41    IHTMLElement    ihtmle lement    209
86788    2389    471    SVGSVGElement    svgsvge lement    784
107285    1895    653    HTMLLIElement    htmllie lement    967
123871    1618    548    HTMLHRElement    htmlhre lement    811
126724    1579    551    HTMLBRElement    htmlbre lement    825
128322    1556    418    SVGGElement    svgge lement    718
144583    1365    19    BSONElement    bsone lement    198
150084    1309    33    IXMLDOMElement    ixmldome lement    178

And here're the right parses for comparison:

data[data.token_split.str.contains(" element")]
pos    num_occ    num_repos    identifier    token_split    num_files
194    1608035    27484    createElement    create element    185424
458    740521    22    as_fusion_element    as fusion element    628
604    568326    19962    documentElement    document element    90360
618    555927    20933    getElementsByTagName    get elements by tag name    91772
794    407035    22788    getElementById    get element by id    97313
1888    155867    12876    getElementsByClassName    get elements by class name    29477
2182    131254    13040    activeElement    active element    37437
2538    111209    3936    getElement    get element    19493
3153    87404    137    FieldElement    field element    449
3221    85380    1091    _currentElement    current element    2370
3306    83096    6811    parentElement    parent element    18498
3550    76270    1698    domElement    dom element    5811
3765    71809    1496    buttonElement    button element    2572
3843    69912    12145    srcElement    src element    35165

TLDR Can I add this case to the TokenParser? It will be possible to switch it off in the beginning, and I would want to try it with typos.

Irina Khismatullina · Answer 1 · Fri Apr 12 2019 18:54:21 GMT+0800 (China Standard Time)

And here's the same with another very frequent typo esponse:

data[data.token_split.str.contains(" esponse")]

num_occ	num_repos	identifier	token_split	num_files
20298	1406	NSHTTPURLResponse	nshttpurlr esponse	6433
19973	1641	NSURLResponse	nsurlr esponse	6328
14019	661	HTTPResponse	httpr esponse	3612
5661	345	AFHTTPResponseSerializer	afhttpr esponse serializer	2079
4456	291	AFURLResponseSerialization	afurlr esponse serialization	2919
3386	289	HTTPURLResponse	httpurlr esponse	963

data[data.token_split.str.contains(" response")]

num_occ	num_repos	identifier	token_split	num_files
88423	12188	getResponseHeader	get response header	27731
74625	564	ServerResponse	server response	2165
42935	11165	getAllResponseHeaders	get all response headers	23173
34716	2640	HttpResponse	http response	11372
32443	130	CheckResponse	check response	1453
30016	2845	getResponse	get response	11978

Vadim Markovtsev · Answer 2 · Fri Apr 12 2019 21:16:14 GMT+0800 (China Standard Time)

Yes this is a bug.