whatwg / mimesniff

MIME Sniffing Standard

Home Page:https://mimesniff.spec.whatwg.org/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

MP3 without ID3 algorithm has various problems

dd8 opened this issue · comments

The algorithm looks like it's based on mp3sniff.c
https://dxr.mozilla.org/mozilla-central/source/toolkit/components/mediasniffer/mp3sniff.c
which was added to fix this issue:
https://bugzilla.mozilla.org/show_bug.cgi?id=862088

The algorithm is described in
https://mimesniff.spec.whatwg.org/#signature-for-mp3-without-id3

Bitfield operator precedence

The bitfield extraction operations like sequence[s + 2] & 0x0c >> 2 should have brackets (sequence[s + 2] & 0x0c) >> 2 (like mp3sniff.c) since in C/C++/C#/Java/JavaScript >> has higher precedence than & so 0x0c >> 2 is evaluated before the &

Step 2 in match-an-mp3-header should read 'or' instead of 'and'

This describes is_mp3() in mp3sniff.c. Step 2 reads "If sequence[s] is not equal to 0xff and sequence[s + 1] & 0xe0 is not equal to 0xe0, return false." It should read "If sequence[s] is not equal to 0xff or sequence[s + 1] & 0xe0 is not equal to 0xe0, return false."

Step 7 looks wrong because s - length is always negative and skipped-bytes >= 0 so the algorithm always returns false at this step.

Step 7 reads "If skipped-bytes is less than 4, or skipped-bytes is greater than s - length, return false."

Confusion over naming

Raw MP3s are structured like this:

[ 4-byte frame header ]
[ variable length frame data ]
[ 4-byte frame header ]
[ variable length frame data ]
…

The "match an mp3 header" and "parse an mp3 frame" computations both operate on the same data structure - the 4-byte MP3 frame header. This would be much clearer if they were named "match an mp3 frame header" and "parse an mp3 frame header". There are also variables named sample-rate which is the same as samplerate-index and samplerate which is not the same as sample-rate.

Confusion over return values

The "match an mp3 header" is equivalent to is_mp3() in mp3sniff.c which returns true/false and "parse an mp3 frame" is equivalent to mp3_parse() which returns an mp3_header struct. All the values used by the compute-an-mp3-frame-size calculation are computed in parse-an-mp3-frame except freq which is computed in match-an-mp3-header (freq was missing from the original commit 998b959 of this algorithm, and was added later in commit d954dab).

Note: Chrome doesn't implement any of this and detects MP3s using two signatures:
"ID3" (for MP3 in ID3)
FF E0 / FF E0 (match raw MP3 frame header - equivalent to step 2 in match-an-mp3-header)

The precision of this algorithm looks odd compared to the other heuristics - it detects MP3s very accurately - and requires 1.5K of input data as a result. The rest of the sniffing specification is expected to produce less accurate results - including MP3 with ID3 (see #69). As mentioned above Chrome just uses a 2-byte signature match for this - Chrome uses a 1024 byte sniffing buffer which was chosen to fit in 1 ethernet packet along with headers: https://src.chromium.org/viewvc/chrome/trunk/src/net/base/mime_sniffer.cc?revision=HEAD&view=markup#l680

Thanks for your comments, I'll have a look and fix the various issues you've outlined.

We need a bit more precision here, because AudioContext.decodeAudioData explicitly say that UA should sniff the bitstream, determine the type, and decode the buffer into PCM. No content-type can be set to help the API.

This reasoning applies to WebM as well.

@padenot - that makes sense. Does that mean there's a problem with MP3 with ID3 detection in the spec? Currently a text file starting "ID3 is easy to detect" is detected as "audio/mpeg" (mp3sniff.c does it much more accurately). Also if it's important to detect MP3s as accurately as this should MP3 detection happen earlier in the algorithm to prevent any of the other heuristics matching? It looks like an MP3 without ID3 will be detected by the FF FE pattern for UTF-16LE BOM in identifying-a-resource-with-an-unknown-mime-type.

I filed WebAudio/web-audio-api#1563 to get more clarity on that method.

Similar to the concerns I mentioned with WebM in #93, I also wonder if this algorithm is too complex for the purpose it is intending to serve. The mimesniff spec is not meant to be replicating the logic on how to process or play an MP3 file; it is merely supposed to be identifying that a file is indeed an MP3 file.

Neither #4 nor the documents it links to give much insight into whether this algorithm is as minimal as it needs to be. In fact, the initial Firefox implementation had multiple hiccups that suggest how complex this algorithm is.

For reference, here are the implementation bugs for Firefox:
https://bugzilla.mozilla.org/show_bug.cgi?id=862088
https://bugzilla.mozilla.org/show_bug.cgi?id=865553

And the current implementation:
https://hg.mozilla.org/mozilla-central/file/tip/toolkit/components/mediasniffer/mp3sniff.c

This document suggests that there are very few pieces of information that signal "MPEG Audio Layer III":
http://www.mp3-tech.org/programmer/frame_header.html

How much parsing of "MPEG" do we really have to do to identify that "Audio Layer III" exists?

Come to think of it, the resulting MIME type is "audio/mpeg". We don't actually have to identify the "Layer III" part at all.

Come to think of it, the resulting MIME type is "audio/mpeg". We don't actually have to identify the "Layer III" part at all.

Well, hmm...

This kind of comes down to what the point of the MIME sniffing spec is. It can't be purely to turn byte streams into MIME types, can it? Because the MIME types aren't directly what anyone cares about.

Instead I think the MIME types are being used as a proxy for various decisions. Probably most notably, "what decoder to use" when trying to actually present the user with this content.

So I was kind of under the impression you'd use a different decoder for MP3 than for other MPEG audio formats, and so it might be worth differentiating?

Come to think of it, the resulting MIME type is "audio/mpeg". We don't actually have to identify the "Layer III" part at all.

Well, hmm...

This kind of comes down to what the point of the MIME sniffing spec is. It can't be purely to turn byte streams into MIME types, can it? Because the MIME types aren't directly what anyone cares about.

Instead I think the MIME types are being used as a proxy for various decisions. Probably most notably, "what decoder to use" when trying to actually present the user with this content.

The introduction (mostly written by @abarth) suggests otherwise.

My opinion is that once you identify what type of resource something is, how to actually deal with that resource is outside the scope of this spec (usually because another spec already exists). It's why, for example, we documented the idiosyncrasies of GIF on the wiki instead of here.

If this point is contentious, though, let's spin off a separate issue for discussion.

So I was kind of under the impression you'd use a different decoder for MP3 than for other MPEG audio formats, and so it might be worth differentiating?

For MPEG audio in particular, this point may be moot. My naive understanding is that MP1, MP2, and MP3 have significant enough similarities that they could be decoded with the same decoder.

There's two places where we have a need for sniffing in the web platform as I understand it:

  1. Navigation. Here the point is to determine whether it's HTML, plain text, media, PDF, or a download. For media there is of course some decision about what decoder to use at some point, but the most important decision here is whether it will be send to the decoder at all.
  2. Individual endpoints. Here it's mostly about documenting what decoders accept and should reject. We don't really care whether an implementation uses a single decoder for all of media or has separate pipelines for various media formats. That's indeed the realm of other specifications that define what a byte stream results in. (And note that often there is a step before invoking this in the individual endpoints where the MIME type is checked first. E.g., image/svg+xml is handled in a special manner in the image endpoint pipeline.)

And there will be a third: https://github.com/annevk/orb.

That seems sufficient enough to spin off #152.

I filed WebAudio/web-audio-api#1563 to get more clarity on that method.

Looking through that issue and what became of it, I've concluded that the current "MP3 without ID3" algorithm is misplaced, confusing, and potentially harmful. We should not be sniffing MP3 data so deeply in this context.