Serial-ATA / lofty-rs

Audio metadata library

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

IDv2.4: Parsing multi-valued UTF-16 text fields fails

uklotzde opened this issue · comments

Reproducer

Parsing fails with BadFrameId errors.

Summary

gerbera/gerbera#759

According to http://id3.org/id3v2.4.0-frames ID3v2.4 uses the null character to separate multiple values, which are allowed for all text information frames (tags beginning with T like TEP1 and TCOM).

Expected behavior

t.b.d.

Might require API changes to handle those multiple values properly.

A lossy implementation that is compatible with the current API could only read the first, non-empty string and silently discard all subsequent strings. While this might be suitable for an application (Example: Mixxx), it is inappropriate for a general purpose library.

Assets

ID3v2.4 example:
txxx_utf16_multi_value_id3v24.zip

@Serial-ATA v0.16.1 could be released before fixing this bug. It is a known issue that affects all previous versions.

I've looked into it, and the issue is that we simply don't parse UTF-16 values correctly.

The first issue is that it stops on a null terminator no matter what:

[0, 0] => None,

And secondly, the tag you had actually encoded the strings properly, which I've never seen before. Normally a UTF-16 encoded frame has its BOM specified in only one of the values and the rest are just meant to be inferred. Your tag actually has a BOM for every value, which simply isn't handled.

When handling multiple values, we retain all of the null separators, treating the frame content as one big string, and simply splitting/replacing the separators in the background. This means that there shouldn't have to be any API changes, rather we just have to strip the BOM(s) and (of course) stop halting the reader at the null terminator.

And secondly, the tag you had actually encoded the strings properly, which I've never seen before. Normally a UTF-16 encoded frame has its BOM specified in only one of the values and the rest are just meant to be inferred. Your tag actually has a BOM for every value, which simply isn't handled.

The repeated BOM at the start of each substring in the attached example is indeed uncommon and could be considered an error. But an application must have created it somehow. Not unlikely that others stumble over it if lofty is adopted more widely. Unfortunately, I am not aware of the actual origin of this file.