guxd / deep-code-search

DeepCS: Deep Code Search

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Question about camel and lower case handling in raw data processing

JoeQiao666 opened this issue · comments

In method name and token extraction, we split the camel case into lowercase words.
I am wondering whether in the api sequence and description extraction, we should also split the camel case and some snake case, and convert them to lower case. Since the java code we crawled has some method call such as 'setIncludingFilterTopNode', 'getOriginalSavedSearch', if we do not split it, in the vocab file, it contains this long word, I think this would be too specific, not that general. I am not sure whether the performance will improve if we do so.

That's definitely worth a try and I think it might improve the performance.
My concern is that different API methods could have totally different meanings though sharing common tokens. In that sense, sharing embeddings over split tokens could affect the representation of APIs.