Question about camel and lower case handling in raw data processing

Question

Question about camel and lower case handling in raw data processing

JoeQiao666 opened this issue 5 years ago · comments

In method name and token extraction, we split the camel case into lowercase words.
I am wondering whether in the api sequence and description extraction, we should also split the camel case and some snake case, and convert them to lower case. Since the java code we crawled has some method call such as 'setIncludingFilterTopNode', 'getOriginalSavedSearch', if we do not split it, in the vocab file, it contains this long word, I think this would be too specific, not that general. I am not sure whether the performance will improve if we do so.

Xiaodong Gu · Answer 1 · Tue Jul 16 2019 16:59:29 GMT+0800 (China Standard Time)

That's definitely worth a try and I think it might improve the performance.
My concern is that different API methods could have totally different meanings though sharing common tokens. In that sense, sharing embeddings over split tokens could affect the representation of APIs.