Implement multiple algorithms to choose from besides Karp Rabin

Question

Implement multiple algorithms to choose from besides Karp Rabin

olleharstedt opened this issue 3 years ago · comments

Is your feature request related to a problem? Please describe.
The Karp Rabin - Greedy String Tiling algorithm might not necessarily be the most optimal or up-to-date algorithm to detect clones with. It's actually from 1987 (Wikipedia source), and a lot of research has happened since.

Describe the solution you'd like
Discussion about implementing other algorithms, and how that could be done. Which would be chosen as a candidate, and why? Risk, benefit, effort.

Describe alternatives you've considered
Seems like basically all available open-source clone detection tools use the Karp Rabin algo. Maybe we can improve this situation?

Andrey Kucherenko · Answer 1 · Fri Jun 04 2021 22:34:07 GMT+0800 (China Standard Time)

Hi Olle

Thank you for the request. I have some information to share about the algorithms: I've made few PoC with other algorithms and combination of algorithms (for example I tried Blums filters), as result I can say that we have most optimal algorithm in released version of jscpd. If you have the algorithm what you are interested in I can try to create PoC based on it, just let me know.

Also I have one question, what you are going to achieve with new one algorithm?

Olle Härstedt · Answer 2 · Fri Jun 04 2021 23:23:27 GMT+0800 (China Standard Time)

Hi!

Yeah? Did you upload any of those, @kucherenko ?

What I want to achieve:

~~No false positives on comments~~ Sorry, already possible.
No false positives on function calls spread out on multiple lines
No false positives on classes with similar properties

Olle Härstedt · Answer 3 · Fri Jun 04 2021 23:28:37 GMT+0800 (China Standard Time)

Different scenarios and how they are handled by different algorithms:

https://reader.elsevier.com/reader/sd/pii/S0167642309000367?token=4CEC8ED633248046C8ECF054461C1077C4931D0142A7EA14087C2C27E8764BA65CB89CE01575246C4DDE06E64302B9D1&originRegion=eu-west-1&originCreation=20210604152708

Andrey Kucherenko · Answer 4 · Sun Jun 06 2021 22:54:06 GMT+0800 (China Standard Time)

Thank you for the docs, in jscpd you can try to customize detection process with modes, in modes you can filter or modify all tokens before starting detection

Olle Härstedt · Answer 5 · Mon Jun 07 2021 00:56:13 GMT+0800 (China Standard Time)

If you wanna support multiple languages, it's easiest with text or token based approaches, I guess, and not graph based. I'm looking a bit at the algorithm here, which is available as open-source (2015, Java): https://www.cqse.eu/fileadmin/content/news/publications/2009-do-code-clones-matter.pdf

Guess I'll close the issue now, and open a PR if I ever see some progress. :) Thanks for the feedback!

Olle Härstedt · Answer 6 · Sat Jun 12 2021 22:57:17 GMT+0800 (China Standard Time)

I think I made some progress.

Have a look at the clone in this file:

https://github.com/LimeSurvey/LimeSurvey/blob/master/application/models/QuestionTheme.php#L363

https://github.com/LimeSurvey/LimeSurvey/blob/master/application/models/QuestionTheme.php#L836

if (\PHP_VERSION_ID < 80000) {
    $bOldEntityLoaderState = libxml_disable_entity_loader(true);
}
$sQuestionConfigFilePath = App()->getConfig('rootdir') . DIRECTORY_SEPARATOR . $pathToXML . DIRECTORY_SEPARATOR . 'config.xml';
if (!file_exists($sQuestionConfigFilePath)) {
    throw new Exception(gT('Extension configuration file is not valid or missing.'));
}
$sQuestionConfigFile = file_get_contents($sQuestionConfigFilePath);  // @see: Now that entity loader is disabled, we can't use simplexml_load_file; so we must read the file with file_get_contents and convert it as a string

and

if (\PHP_VERSION_ID < 80000) {
    $bOldEntityLoaderState = libxml_disable_entity_loader(true);
}
$sQuestionConfigFilePath = App()->getConfig('rootdir') . DIRECTORY_SEPARATOR . $sConfigPath;
if (!file_exists($sQuestionConfigFilePath)) {
    throw new Exception('Found no config.xml file at ' . $sQuestionConfigFilePath);
}
$sQuestionConfigFile = file_get_contents($sQuestionConfigFilePath);  // @see: Now that entity loader is disabled, we can't use simplexml_load_file; so we must read the file with file_get_contents and convert it as a string

As you can see it's slightly edited. The new algorithm I'm copying has a parameter called edit distance, which makes it possible to detect clones with slight differences like these, even on token level (already disregarding variable and function names etc). I did not manage to detect the whole 8-line clone with jscpd nor phpcpd. When I configured phpcpd with higher sensitivity, I also got nonsense reports for clones at end of function calls. Have a go if you like.

My repo is here: https://github.com/olleharstedt/suffixtree

Main entry point:

/**
 * Finds all clones in the string (List) used in the constructor.
 * @param minLength the minimal length of a clone
 * @param maxErrors the maximal number of errors/gaps allowed
 * @param headEquality the number of elements which have to be the same at the beginning of a clone
 */
public void findClones(int minLength, int maxErrors, int headEquality) throws ConQATException {

Ping me if you want instructions on how to run it. It's not tidy at all, I just bashed on it until it worked, but the core of the algorithm is not changed. I'd like to convert it to PHP and get it merged into phpcpd, but would be nice if jscpd can benefit from it also. :)

Andrey Kucherenko · Answer 7 · Sun Jun 13 2021 03:21:09 GMT+0800 (China Standard Time)

Thank you, it is looks interesting, as disadvantage of the solution I can say that I should implement the algorithm for each of supported languages. Also I can suggest you to investigate jsinspect project - https://github.com/danielstjules/jsinspect, jsinspect identify code with a similar structure.

Olle Härstedt · Answer 8 · Sun Jun 13 2021 04:00:39 GMT+0800 (China Standard Time)

as disadvantage of the solution I can say that I should implement the algorithm for each of supported languages

No, the solution is still token-based! That's the good thing. :) But instead of comparing a hash, they build suffix-trees and compare. I don't know the details so much, but it was no problem to feed it PHP tokens instead of whatever they used (their internal tests used characters instead of token). Can I assume you have a token class you convert all languages to?

Here's the PHP token class I use: https://github.com/olleharstedt/suffixtree/blob/phptoken/PhpToken.java

Will check the link, thanks!

Andrey Kucherenko · Answer 9 · Sun Jun 13 2021 04:04:09 GMT+0800 (China Standard Time)

thank you, will try

Olle Härstedt · Answer 10 · Mon Jun 14 2021 19:22:52 GMT+0800 (China Standard Time)

This topic might be moot due to license issues. The license for the new algorithm is Apache 2.0, but you're using MIT. IIRC, you'd have to re-license to Apache 2.0 too if you'd want to include it. :( Or GPL v3

https://en.wikipedia.org/wiki/Apache_License