seanap / Audiobooks.bundle

Audiobook metadata agent for Plex

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

How to parse copyright year

rabelux opened this issue · comments

I'm getting an error when parsing the copyright line of this book:
©Knaus Verlag (P)2002 Mango Studios Köln

The error says AttributeError: 'NoneType' object has no attribute 'group' in line 674 executing helper.date = re.match(".?(\d{4}).*", cstring).group(1)

I had a look at the code and wanted to write a fix but don't understand the cases you're trying to catch.
Maybe we could collect different examples and expected output?

As far as I understand you're stripping the string down to the part before (P) and extract the date from that part only.
What compells against matching the first 4-digit part in the whole copyright?

This code came from unending's fork: Unending/Audiobooks.bundle@85694cb

I didn't personally test it. I can try and help a bit later. The regex is saying something along the lines of "match 4 digits in a row from the given string". 101regex is a great tool to learn more about regexes. Since copyrights only contain years, all it needs to match is those 4 digits.

My code starting in line 658 currently looks like this:

        if cstring:
            if "Public Domain" in cstring:
                helper.date = re.match(".*\(P\)(\d{4})", cstring).group(1)
            else:
                if cstring.startswith(u'\xA9'):
                    cstring = cstring[1:]
                helper.date = re.search(r'\d{4}', cstring).group()
                #if "(P)" in cstring:
                #    cstring = re.match("(.*)\(P\).*", cstring).group(1)
                #if ";" in cstring:
                #    helper.date = str(
                #        min(
                #            [int(i) for i in cstring.split() if i.isdigit()]
                #        )
                #    )
                #else:
                #    helper.date = re.match(".?(\d{4}).*", cstring).group(1)

It matches the first 4 digits it finds after the (c).
But I see what Unending did there. He tried to prioritize whereas I don't see any reason to do that at this point.
I think the (P) stands for sound recording copyright and should be equivalent to (c).

I'm just guessing here so everybody is invited to enlighten me.

Audible isn't very consistent but the way I've noticed the most common use is that (C) is the original copyright year of the work, and (P) is the copyright year of the specific publication. See here for reference: https://www.audible.com/pd/East-of-Eden-Audiobook/B00546SXO0

I personally prioritize (C) year, as I think that sorting by year, or filtering by decade works better when the original copyright year is used, but the (P) is also important and should be equivalent to the release date. Both dates need to be used, but I don't know of any player that takes advantage of them. Ideally the id3 tags should be:
ORIGYEAR = (C) year
YEAR = (P) year
RELEASETIME = (P) date

Regarding the example you posted: Would you prefer to have the year set to 1952, or 1980?

As we only have one year to set in Plex I'd suggest to simplify copyright-parsing and do it in the following order:
Take the first year that can be found, unless there is ; in the string, then take the first year after ;

For (C) it should be the original year, so 1952. The actual plex tag is "Release Date" so I think (P) 2011 should be the year/date actually imported into plex.

The part of the code I'm talking about is only called if the preferences are set to "use copyright year instead of date published".
So in that case it should be correct to use the first year found - unless the setting has to be renamed or changed to a dropdown list.