Perl-Toolchain-Gang / Software-License

perl representation of common software licenses

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Try word trigrams or something like that for identifying licenses

neilb opened this issue · comments

Reading one of Ishigaki-san's PRs prompted an idea.

I wonder if we did word trigram (or possibly even just bigram) analysis of the snippets, could that be used to more reliably guess the license, even if things like the title etc were left out? But a lot of terminology in licenses is similar, so it might not work at all.

Wouldn't take too long to try this out, but I resisted the yak luring me in to give it a go right now, and put it onto my play-time backlog.

Recording the idea here, in case someone has more experience with this sort of thing and knows it won't work.

It is not clear to me what issue you would try to fix with your idea.
If this is about detecting licences, "trying" with an algorithm based on statistics do not seem a good idea. License are about legal things, and in this area a false detection (reporting the detection of an incorrect licence) is worse than failing detection (not reporting anything).

The current approach is pattern matching, looking for specific strings. Tiny changes in the text can result in a licence not being recognised, even though it's clearly that licence.

The idea suggested could allow an approach where the most likely licence is identified, but we could also point out that it doesn't exactly match the expected snippet text.

I'm also dubious, because in some cases, all that matters is just a character of version number in a short license name.

N-gram may be useful to improve the current matching pattern, but it seems a terrible idea to maintain patterns generated by N-gram, because they probably wouldn't make much sense to us human.

I wasn't suggesting we'd maintain patterns generated by n-gram analysis. That said I'm going to close this ticket here and just leave it on my todo list as an idea to play with. Maybe.