scttcper / video-filename-parser

Scene release name parser

Home Page:https://video-filename-parser.vercel.app

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Ambigious TV-Shows such as "Wilfred.US" wrecks title parsing

dezza opened this issue · comments

Hello.

Nice lib, but there is one issue I found that I think needs to be fixed, I'll gladly help as long as we can agree on the issue.

For example Wilfred exists as both an AU and US show.

AU (first released, 2007)

https://www.themoviedb.org/tv/3297

US (2011)

https://www.themoviedb.org/tv/39525-wilfred

This means that now the title is parsed as Wilfred US.

It would be a safe assumption to think that any tag in capitalized country-code US|UK|AU|NZ|CA would mean ambigous titles and narrowing down to the specific show in respective country.

Of course the rare occassion could happen that some title would be.. Toys.R.Us, but unlikely that it would be capitalized.. If so thats a real corner-case not worth optimizing for!

https://scenerules.org/html/2020_WDX_unformatted.html

    19.8) Different shows with the same title produced in different countries must have the ISO 3166-1 alpha 2 country code in  the show name.
        19.8.1) Except for UK shows, which must use UK, not GB.
        19.8.2) This rule does not apply to an original show, only shows that succeed the original.
                e.g. The.Office.S01E01 and The.Office.US.S01E01.

I've mostly ignored the tv show parsing, if you want to improve it feel free. I think i fixed a similar issue in movies by looking for the movie year and assuming things before it were the title. I'm sure something similar can be done for tv

I guess there could be a small possibility for a cornercase something like:

Food.in.the.US # the country

But then.. Why would a show end with "the" (given that the default is to strip the US country at the end).. Thats something you could check for if that ever became a thing, which is unlikely but chance never zero.. Assuming something about titles in the first place is flaky at best, there is always possibility for another weird title.

It kind of sucks the scene does it like this, because there is no way to discern if its actually part of the title or not except for the small clues such as the case of the as mentioned above.

I wrote some logic for this that I think makes sense. I think you will be able to tell from it how I think the most reasonable way to handle it would be.

If next last word is not the its definetily not "referring to an actual country"

/**
 * @param {SceneTags} scenetags 
 */
function stripTVShowCountry(scenetags) {
  const lastElement = -1
  const words = scenetags.title.split(' ')
  if (scenetags.type === 'tvshow' &&
      words.at(lastElement)?.match(/(?<country>US|UK|NZ|AU|CA)/u) &&
      words.at(lastElement-1) !== 'the'
   ) {
    scenetags.title = words.slice(0, lastElement).join(' ')
  }
  return scenetags
}

// Ends with country
console.log("Ends with country")
console.log(stripTVShowCountry(null, {title: 'Wilfred US', type: 'tvshow'}))
console.log(stripTVShowCountry(null, {title: 'Oy mate Crocodile Hunter AU', type: 'tvshow'}))

console.log()

// Ends with actual country, next last is "the". Concludes its a real title
console.log("Ends with country, next last is 'the'. Concludes its a real title")
console.log(stripTVShowCountry(null, {title: 'Soldiers in the US', type: 'movie'}))
console.log(stripTVShowCountry(null, {title: 'Food in the US', type: 'tvshow'}))
console.log(stripTVShowCountry(null, {title: 'Queen of the UK', type: 'tvshow'}))

Example:

Output

Ends with country
{ title: 'Wilfred', type: 'tvshow' }
{ title: 'Oy mate Crocodile Hunter', type: 'tvshow' }

Ends with country, next last is 'the'. Concludes its a real title
{ title: 'Soldiers in the US', type: 'movie' }
{ title: 'Food in the US', type: 'tvshow' }
{ title: 'Queen of the UK', type: 'tvshow' }