regosen / get_cover_art

Batch cover art downloader and embedder for audio files

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Improving search results

classicjazz opened this issue · comments

I ran get_cover_art on my entire music collection and noted the following by comparing error logs with manual searches on Apple Music's web site:

Searches are very literal (and, thus, subject to failure):

  • Punctuation differences in either the artist or album fields will cause get_cover_art searches to fail. For example, colons instead of dashes or vice versa. Or the presence or absence of periods or commas. Or # signs.
  • "and" instead of "&" and vice versa
  • Articles in artist or album tags like "a" or "the" may differ, causing searches to fail. For example, "The Art of Noise", "The Vitamin String Quartet", "The Fugees", "The Raspberries", "The Thompson Twins", "The Carpenters".
  • Roman v Arabic numerals

Thanks for summarizing your findings. This looks like several different issues / requests rolled into one, so I would recommend the following:

  • splitting this up into one issue for each ask (e.g. one issue for more powerful exclusion strings, one issue to handle inverted names, and so on).
  • providing examples in the errors (e.g. the artist / album you had where the "a" or "the" didn't match up)

That will help me and others investigate and address these more easily.

Thank you. I split it into four issues.

Ok, I just pushed a new version of get_cover_art (v 1.4.4) that should address all these, except roman numerals. That one would be really risky (could cause unintentional matches). I did add a TODO comment in the code about revisiting roman numerals in the future. Let me know what you think.

Thank you. With your changes, I re-ran get_cover_art against my iTunes collection and it found over 300 incremental matches.

Note: I believe you need to delete skip_artwork.txt to enable get_cover_art to find new artwork following your code changes.

In the meantime, I took a stab at roman numerals vs regular numbers in this PR (#13), but it's actually causing previous matches to fail, because the word "I" changes to "1" everywhere, "mix" changes to "1009", etc. It needs to be rethought such that we can search correctly and efficiently.

A. Could search itunes for each variation, but that could result in excessive searches, and not sure how to correctly choose from both result lists.
B. Could simply strip out the numerals when searching iTunes altogether, but then you could have too many unrelated matches.

Open to other suggestions.

Are you normalizing all artists and album names regardless of whether they match or not?

Assuming so, wouldn't at least part of the issue with Roman numerals (and likely artist names with commas) be solved by only swapping Arabic for Roman numerals or inverting the names if there wasn't a prior match?

Admittedly all of this is tricky because (1.) there is so much variability in naming to begin with; (2.) there is so much bad data from music databases over the years; (3.) Apple's naming isn't necessarily the correct one; (4.) users have their own preferences about music naming, regardless

Ah, that's a good idea- matching without the conversion first, and only using roman numeral conversion as a fallback to minimize the side effects.

Ok, I just deployed a new version with my changes (1.4.5), please re-test at your convenience.

There were 13 incremental matches from 1.4.4. to 1.4.5 (after deleting skip_artwork.txt).

That's great! Do you think this issue can be closed out now?

Closing this issue. Thank you!