jshttp / etag

Create simple HTTP ETags

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

weak etags can collide (for body <= 1KB)

stuartpb opened this issue · comments

See expressjs/express#2667 - weak hashing is completely inappropriate for etags.

I altered your issue's title because it's wrong; there is no spec violation in using weak hashing for weak ETags. We consider the bytes of the body to be metadata about the request.

There is absolutely a spec violation in using weak hashing for weak etags! What part of "semantically equivalent" do you not understand? Two colliding hashes are not semantically equivalent, and that's all a "weak hash" is - something more prone to hash collisions!

The "weak" part of "weak etags" is in that they might change when they don't have to, not that they might not change when they do have to. ETags have to change when the content changes - that's how caching works.

(This comment has been edited from its original format to look slightly less haggard and insane.)

Clearly we disagree on this point. If you email me, we can always speak in person, which may help; otherwise you may have to use a different eco system that agrees. That entire section you keep quoting is just informational. A speci violation only occurs when you don't do something that is a MUST or do something that is a MUST NOT; everything in that section was just description or SHOULD.

Okay, that's true: the RFC notes around Etag don't say anything about how weak Etag "MUST" work. You could literally reply to every request with Etag: W/"xyzzy" and technically not be in violation of any MUST directives in section 2. That doesn't mean it's fine to ignore the implications in light of everything else the RFC "just informational"ly says.

That doesn't mean ignoring everything else the RFC says about the implications as "just informational" is good practice.

I agree, but at the level of Express, it's the best we can do, as we do not know what is symantically equiv. or not. Our only choice is to not offer ANY Etags by default in Express (and so nothing can be cached by default) or try to provide something out of the box that is fast and helps. This is why we choose the out-of-the-box ETags to use weak hashing, and since they can collide, we mark them as weak (and leave it up to the browser how it's going to treat that). Clearly, we can do better at making the weak ETags collide less.

I still am willing to talk about this with you if you are interested (just email me). It's not always easy to convey both sides through text and you never know what may come out of a phone conversation, so I would be interested if you are.

Ok, so I've done a bit of looking so far and here is what I'm thinking as the first steps:

  1. Migrate from CRC32 (which is just a checksum) to something like xxHash (which is a cryptographically-weak hash); the difference here is that xxHash is actually a hashing function, unlike CRC32. After extensive searching, the lowest bar we can really go is MD5.
  2. Start expanding on #4 , specifically including at least Content-* headers as inputs to the weak ETag; this will at least ensure a change of the type is considered to be semantically different.
  3. Remove "strong ETags" from stats objects all together.

And from Express's point of view:

  1. Provide a way in the express <-> view engine interface for the view engine to calculate ETags in the first place, which can come up with much better semantically-defined tags (like perhaps just using the locals).

These are just initial thoughts right now. The main thing is that different levels see semantics differently; for example the current interface of this module only sees body bytes; expanding our interface to capture res object could help expand our low-level semantic view (though it's still low-level). Absolutely true "semantically weak ETags" can really only be determined by the programmer who is programming the response; Express itself does not stop you from making that decision and setting your own calculated ETag response header. Rather, libraries like these are here to at least give you something.

I don't disagree, seeing the example posted to the Express issue, that it is way too easy for this library to provide a colliding ETag--I mean, adjusting 4 bytes in the right location seems to be a pretty low bar. Now granted, this can be a really annoying issue where a web browser (or an intermediary proxy) may deny you the view of the current page due to sending a conditional request to the server and it matching.

I have updated #11 (comment) to note that we really just have to move to at least MD5, no matter what the slow down is, because it's the right thing to do. I also added a note that we should start throwing if you try and create a "strong ETag" off a stats object, since we cannot guarantee anything about the bytes.

I am struggling really hard on if this should be a major version bump or a minor version bump :/

It's going to be minor; I have decided mainly based on the documented public API: you get an ETags and it's marked weak or not based on an option; the specific structure of the ETags is supposed to be opaque to the consumer of this lib. There is certainly the possibility of tests where the resulting ETag is hard-coded, but I guess eh. We have previously made minor releases where the ETag structure changed and I never heard anything back (though there seems to be more dependent modules now).

Mainly, I do really want the dependents to pick up this valuable change (could really be classified as a "bug fix").