MP3 without ID3 algorithm has various problems

Question

MP3 without ID3 algorithm has various problems

dd8 opened this issue 6 years ago · comments

The algorithm looks like it's based on mp3sniff.c
https://dxr.mozilla.org/mozilla-central/source/toolkit/components/mediasniffer/mp3sniff.c
which was added to fix this issue:
https://bugzilla.mozilla.org/show_bug.cgi?id=862088

The algorithm is described in
https://mimesniff.spec.whatwg.org/#signature-for-mp3-without-id3

Bitfield operator precedence

The bitfield extraction operations like sequence[s + 2] & 0x0c >> 2 should have brackets (sequence[s + 2] & 0x0c) >> 2 (like mp3sniff.c) since in C/C++/C#/Java/JavaScript >> has higher precedence than & so 0x0c >> 2 is evaluated before the &

Step 2 in match-an-mp3-header should read 'or' instead of 'and'

This describes is_mp3() in mp3sniff.c. Step 2 reads "If sequence[s] is not equal to 0xff and sequence[s + 1] & 0xe0 is not equal to 0xe0, return false." It should read "If sequence[s] is not equal to 0xff or sequence[s + 1] & 0xe0 is not equal to 0xe0, return false."

Step 7 looks wrong because s - length is always negative and skipped-bytes >= 0 so the algorithm always returns false at this step.

Step 7 reads "If skipped-bytes is less than 4, or skipped-bytes is greater than s - length, return false."

Confusion over naming

Raw MP3s are structured like this:

[ 4-byte frame header ]
[ variable length frame data ]
[ 4-byte frame header ]
[ variable length frame data ]
…

The "match an mp3 header" and "parse an mp3 frame" computations both operate on the same data structure - the 4-byte MP3 frame header. This would be much clearer if they were named "match an mp3 frame header" and "parse an mp3 frame header". There are also variables named sample-rate which is the same as samplerate-index and samplerate which is not the same as sample-rate.

Confusion over return values

The "match an mp3 header" is equivalent to is_mp3() in mp3sniff.c which returns true/false and "parse an mp3 frame" is equivalent to mp3_parse() which returns an mp3_header struct. All the values used by the compute-an-mp3-frame-size calculation are computed in parse-an-mp3-frame except freq which is computed in match-an-mp3-header (freq was missing from the original commit 998b959 of this algorithm, and was added later in commit d954dab).

Note: Chrome doesn't implement any of this and detects MP3s using two signatures:
"ID3" (for MP3 in ID3)
FF E0 / FF E0 (match raw MP3 frame header - equivalent to step 2 in match-an-mp3-header)

Anne van Kesteren · Answer 1 · Thu Apr 12 2018 17:05:11 GMT+0800 (China Standard Time)

cc @padenot

Mark Rogers · Answer 2 · Thu Apr 12 2018 17:53:24 GMT+0800 (China Standard Time)

The precision of this algorithm looks odd compared to the other heuristics - it detects MP3s very accurately - and requires 1.5K of input data as a result. The rest of the sniffing specification is expected to produce less accurate results - including MP3 with ID3 (see #69). As mentioned above Chrome just uses a 2-byte signature match for this - Chrome uses a 1024 byte sniffing buffer which was chosen to fit in 1 ethernet packet along with headers: https://src.chromium.org/viewvc/chrome/trunk/src/net/base/mime_sniffer.cc?revision=HEAD&view=markup#l680

Paul Adenot · Answer 3 · Thu Apr 12 2018 19:07:16 GMT+0800 (China Standard Time)

Thanks for your comments, I'll have a look and fix the various issues you've outlined.

We need a bit more precision here, because AudioContext.decodeAudioData explicitly say that UA should sniff the bitstream, determine the type, and decode the buffer into PCM. No content-type can be set to help the API.

This reasoning applies to WebM as well.

Mark Rogers · Answer 4 · Thu Apr 12 2018 19:14:04 GMT+0800 (China Standard Time)

@padenot - that makes sense. Does that mean there's a problem with MP3 with ID3 detection in the spec? Currently a text file starting "ID3 is easy to detect" is detected as "audio/mpeg" (mp3sniff.c does it much more accurately). Also if it's important to detect MP3s as accurately as this should MP3 detection happen earlier in the algorithm to prevent any of the other heuristics matching? It looks like an MP3 without ID3 will be detected by the FF FE pattern for UTF-16LE BOM in identifying-a-resource-with-an-unknown-mime-type.

Anne van Kesteren · Answer 5 · Thu Apr 12 2018 19:14:37 GMT+0800 (China Standard Time)

I filed WebAudio/web-audio-api#1563 to get more clarity on that method.

Gordon P. Hemsley · Answer 6 · Mon Aug 30 2021 06:20:47 GMT+0800 (China Standard Time)

Similar to the concerns I mentioned with WebM in #93, I also wonder if this algorithm is too complex for the purpose it is intending to serve. The mimesniff spec is not meant to be replicating the logic on how to process or play an MP3 file; it is merely supposed to be identifying that a file is indeed an MP3 file.

Neither #4 nor the documents it links to give much insight into whether this algorithm is as minimal as it needs to be. In fact, the initial Firefox implementation had multiple hiccups that suggest how complex this algorithm is.

For reference, here are the implementation bugs for Firefox:
https://bugzilla.mozilla.org/show_bug.cgi?id=862088
https://bugzilla.mozilla.org/show_bug.cgi?id=865553

And the current implementation:
https://hg.mozilla.org/mozilla-central/file/tip/toolkit/components/mediasniffer/mp3sniff.c

This document suggests that there are very few pieces of information that signal "MPEG Audio Layer III":
http://www.mp3-tech.org/programmer/frame_header.html

How much parsing of "MPEG" do we really have to do to identify that "Audio Layer III" exists?

Gordon P. Hemsley · Answer 7 · Mon Aug 30 2021 06:35:13 GMT+0800 (China Standard Time)

Come to think of it, the resulting MIME type is "audio/mpeg". We don't actually have to identify the "Layer III" part at all.

Domenic Denicola · Answer 8 · Mon Aug 30 2021 23:47:19 GMT+0800 (China Standard Time)

Come to think of it, the resulting MIME type is "audio/mpeg". We don't actually have to identify the "Layer III" part at all.

Well, hmm...

This kind of comes down to what the point of the MIME sniffing spec is. It can't be purely to turn byte streams into MIME types, can it? Because the MIME types aren't directly what anyone cares about.

Instead I think the MIME types are being used as a proxy for various decisions. Probably most notably, "what decoder to use" when trying to actually present the user with this content.

So I was kind of under the impression you'd use a different decoder for MP3 than for other MPEG audio formats, and so it might be worth differentiating?

Gordon P. Hemsley · Answer 9 · Tue Aug 31 2021 10:44:44 GMT+0800 (China Standard Time)

Come to think of it, the resulting MIME type is "audio/mpeg". We don't actually have to identify the "Layer III" part at all.

Well, hmm...

This kind of comes down to what the point of the MIME sniffing spec is. It can't be purely to turn byte streams into MIME types, can it? Because the MIME types aren't directly what anyone cares about.

Instead I think the MIME types are being used as a proxy for various decisions. Probably most notably, "what decoder to use" when trying to actually present the user with this content.

The introduction (mostly written by @abarth) suggests otherwise.

My opinion is that once you identify what type of resource something is, how to actually deal with that resource is outside the scope of this spec (usually because another spec already exists). It's why, for example, we documented the idiosyncrasies of GIF on the wiki instead of here.

If this point is contentious, though, let's spin off a separate issue for discussion.

So I was kind of under the impression you'd use a different decoder for MP3 than for other MPEG audio formats, and so it might be worth differentiating?

For MPEG audio in particular, this point may be moot. My naive understanding is that MP1, MP2, and MP3 have significant enough similarities that they could be decoded with the same decoder.

Anne van Kesteren · Answer 10 · Fri Sep 03 2021 16:26:24 GMT+0800 (China Standard Time)

There's two places where we have a need for sniffing in the web platform as I understand it:

Navigation. Here the point is to determine whether it's HTML, plain text, media, PDF, or a download. For media there is of course some decision about what decoder to use at some point, but the most important decision here is whether it will be send to the decoder at all.
Individual endpoints. Here it's mostly about documenting what decoders accept and should reject. We don't really care whether an implementation uses a single decoder for all of media or has separate pipelines for various media formats. That's indeed the realm of other specifications that define what a byte stream results in. (And note that often there is a step before invoking this in the individual endpoints where the MIME type is checked first. E.g., image/svg+xml is handled in a special manner in the image endpoint pipeline.)

And there will be a third: https://github.com/annevk/orb.

Gordon P. Hemsley · Answer 11 · Fri Sep 03 2021 19:54:02 GMT+0800 (China Standard Time)

That seems sufficient enough to spin off #152.

Gordon P. Hemsley · Answer 12 · Fri Sep 10 2021 05:55:39 GMT+0800 (China Standard Time)

I filed WebAudio/web-audio-api#1563 to get more clarity on that method.

Looking through that issue and what became of it, I've concluded that the current "MP3 without ID3" algorithm is misplaced, confusing, and potentially harmful. We should not be sniffing MP3 data so deeply in this context.