Try word trigrams or something like that for identifying licenses

Question

Try word trigrams or something like that for identifying licenses

neilb opened this issue 8 years ago · comments

Reading one of Ishigaki-san's PRs prompted an idea.

I wonder if we did word trigram (or possibly even just bigram) analysis of the snippets, could that be used to more reliably guess the license, even if things like the title etc were left out? But a lot of terminology in licenses is similar, so it might not work at all.

Wouldn't take too long to try this out, but I resisted the yak luring me in to give it a go right now, and put it onto my play-time backlog.

Recording the idea here, in case someone has more experience with this sort of thing and knows it won't work.

Olivier Mengué · Answer 1 · Thu Mar 03 2016 18:22:28 GMT+0800 (China Standard Time)

It is not clear to me what issue you would try to fix with your idea.
If this is about detecting licences, "trying" with an algorithm based on statistics do not seem a good idea. License are about legal things, and in this area a false detection (reporting the detection of an incorrect licence) is worse than failing detection (not reporting anything).

Neil Bowers · Answer 2 · Sat Mar 05 2016 18:20:59 GMT+0800 (China Standard Time)

The current approach is pattern matching, looking for specific strings. Tiny changes in the text can result in a licence not being recognised, even though it's clearly that licence.

The idea suggested could allow an approach where the most likely licence is identified, but we could also point out that it doesn't exactly match the expected snippet text.

Kenichi Ishigaki · Answer 3 · Sat Apr 23 2016 22:04:20 GMT+0800 (China Standard Time)

I'm also dubious, because in some cases, all that matters is just a character of version number in a short license name.

N-gram may be useful to improve the current matching pattern, but it seems a terrible idea to maintain patterns generated by N-gram, because they probably wouldn't make much sense to us human.

Neil Bowers · Answer 4 · Sun Apr 24 2016 00:58:44 GMT+0800 (China Standard Time)

I wasn't suggesting we'd maintain patterns generated by n-gram analysis. That said I'm going to close this ticket here and just leave it on my todo list as an idea to play with. Maybe.