spdx / Spdx-Java-Library

Java library which implements the Java object model for SPDX and provides useful helper functions

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Official GPL-2.0 license text not recognized

sdheh opened this issue · comments

For the license text https://www.gnu.org/licenses/old-licenses/gpl-2.0.txt I get the following:

System.out.println(Arrays.toString(LicenseCompareHelper.matchingStandardLicenseIds(licenseText)));
System.out.println(LicenseCompareHelper.matchingStandardLicenseIdsWithinText(licenseText));

outputs

[]
[GPL-2.0, GPL-2.0-or-later, GPL-2.0-only]

The two outputs should be the same since the GPL-2.0 license spans the whole file.
Tested with version 1.1.11
This problem is similar to 217

PR #236 includes unit tests that reproduce this problem, albeit with other license texts - the issue is not limited to just GPL-2.0, or indeed even just GPL family licenses.

See also #234.

I figured out a problem that could explain this case. I think the tokenization does not work properly.
Example:

String license1 = "<one";
String template1 = "<<beginOptional>><<<endOptional>>one";
String license2 = "< one";
System.out.println("template1, license1: " + LicenseCompareHelper.isTextMatchingTemplate(template1, license1).getDifferenceMessage());
System.out.println("template1, license2: " + LicenseCompareHelper.isTextMatchingTemplate(template1, license2).getDifferenceMessage());

Returns

template1, license1: Normal text of license does not match at end of text when comparing to template text "one
".  Last optional text was not found due to the optional difference: 
	Normal text of license does not match at end of text when comparing to template text "<"
template1, license2: No difference found

When I debug I see that for the first case in org.spdx.utility.compare.CompareTemplateOutputHandler.compareText the matchTokens parameter is ["<one"]. I think it should instead be ["<", "one"] like in the second case.

Also if I remove all < and > from the https://www.gnu.org/licenses/old-licenses/gpl-2.0.txt text (
gpl-2.0-removed-angle-brackets.txt
) or if I add a space before and after every < and > (
gpl-2.0-spaces-between-angle-brackets-and-text.txt
) I get the following result for the code in the issue description:

[GPL-2.0, GPL-2.0-only]
[GPL-2.0, GPL-2.0-or-later, GPL-2.0-only]

Thanks @sdheh for the analysis! I agree, the tokenization is the issue. I'm still working on the 3.0 update, so I won't have much time over the next week or so to look for a fix, but if you want to create a pull request I can review / merge.