h2non / filetype

Fast, dependency-free Go package to infer binary file types based on the magic numbers header signature

Home Page:https://pkg.go.dev/github.com/h2non/filetype?tab=doc

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

switching filetype to use Ragel

jtarchie opened this issue · comments

commented

Thanks for the library.

The benchmark was of particular interest to me. When matching the contents of a file, there are more efficient ways to detect binary patterns.

I did a proof of concept using Ragel. It is an external dependency, but it generates the final golang code as an efficient state machine.

At the time of writing this issue, I was able to support your benchmarks for images, zip, and tar. The documents that have XML were skipped at the moment because I cannot discern their patterns as easily as the others.

The benchmarks were run with the same fixtures.

These are the results. A test was used to validate that the correct file types were being returned, too.

goos: darwin
goarch: amd64
BenchmarkMatchTar-4    	50000000	       183 ns/op
BenchmarkMatchZip-4    	1000000000	         6.23 ns/op
BenchmarkMatchJpeg-4   	2000000000	         4.98 ns/op
BenchmarkMatchGif-4    	2000000000	         4.47 ns/op
BenchmarkMatchPng-4    	1000000000	         6.80 ns/op

This happened on a 1.7 GHz Intel Core i7 Macbook Air 2014.

I'd like to contribute the work back. It seems that we can get this to be really fast.

Ragel machines can be language agnostic, so the same machine could be used for C-Python.

Some thoughts:

  1. Without comparative benchmarks, these numbers don't mean a whole lot.
  2. Ragel is super cool, but it almost seems like you should start your own library. Unless I missed something, converting this library to use Ragel would be a major design shift and could potentially orphan contributors and even the owner if they don't know Ragel.
  3. Similarly, unless you can stage your changes, you can't really submit this work back to the library until you achieve full parity or beyond. Another good reason to start your own lib, maybe?
commented
  1. Comparative benchmarks from the same machine? Or just numbers in general, as those exist on the main README.md. These are the same benchmarks with the same date fixtures.

  2. I'm not sure I understand your concern here. Is it more that the API of the library could change? APIs change, we've all used libraries that have. Or that contributors may need extra tooling? This can be solved by documentation and a go generate. No one will be orphaned.

  3. I introduced this issue as a proof of concept. To show that it could be possible, if more people were interested in the work, I'm happy to spend some more time. At the moment, though, investing in discussion is worth more than making a PR that could easily get rejected.

Creating my own library feels counter productive to the community as a whole. There is already a widely used library. Why create a competing one?

  1. Yeah, just to show how badass Ragel is. What I'm saying is that if you show them side-by side from the same machine you get that "wow" factor. I'd be really interested to see how awesome Ragel does.
  2. More like using ragel would completely change the library and how it's built. But I guess it's not a major concern overall.
  3. Exactly! Hence the discussion.

I think the most important question is: what would switching to Ragel bring more than just speed? Speed is nice but, IMHO, file magic isn't the most speed-sensitive thing, unless you're writing a service that spends a lot of its time matching and sorting files. Speed is nice, but implementing a major change like using Ragel should be driven by other concerns, and I'd like to hear more about that.

I'm only passingly familiar with Ragel, so I'd be interested in seeing what it can provide in terms of easily adding new patterns and efficiently matching other types of files that might be unmatchable given the current constraints of the current implementation. For example, filetype cannot uniquely identify docx, xlsx, etc because it only reads the first 262 bytes of any file (#13). If ragel can make it trivial to add more types and also easier to specify and match deeper locations in a file, that's a much bigger win versus speed.

Identifying xlsx, zip, etc. must be done by checking the contents, not just looking for a binary signature.
For example, a Java .jar file looks like a zip file, but if you just look through the names of the contents, and look for...

IF IT HAS IT IS


META-INF/APPLICATION.XML a J2EE Application
META-INF/EJB-JAR.XML an EAR
META-INF/*.(DSA|RSA|SF) digitally signed
PORTLET-INF/, PORTLET.XML, or WEB-INF/PORTLET.XML a PAR
WEB-INF/ a WAR
META-INF/MANIFEST.MF a Jar

If it looks like a zip file, but has a CATALOG.XML file in it, it is a Shockwave Flash Component.

Etc., etc., etc.

commented

@FerdieBerfle, understood that a binary signature may not always be possible. That does not devalue of the scanner switching to Ragel for cases when binary signature can be used. There can be “if binary signature not detected via Ragel, perform custom signature lookups.”

It seems that no one with authority has actually commented on this issue.
I’m closing until they chose to reopen, if/when there’s interest.
Thanks.