platelminto / parse-torrent-title

Extract media information from a torrent-like filename

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Multiple matches before title starts leads to empty title

platelminto opened this issue · comments

Most torrents start with the title, or sometimes have a website beforehand. This is handled fine. When there are multiple matches before the title though, the title will be empty, as the logic to set where the title begins is much simpler than when it ends (and this is by design - you can't have the same logic for both). Example:

[06] Documentry -BBC - The Ottomans: Europe\'s Muslim Emperors (2013) eng.ara sub [Etcohod]

A way to fix this would be to change how the title is parsed, by using what process_excess() does in parse.py - rather than using markers (title_start and title_end) on the string that move closer to the title, which forces the current behaviour of one side of the title having to be preferred over the other, we would find the longest non-matched string by match_slices and assume that's the title (would have to strip punctuation).

This is risky - the current behaviour of favouring field matching after the title is very effective as very few torrents have more than 1 field (usually website) before the title. Changing this could break more titles than it would fix. In particular, short titles with long unmatched strings would pick that excess over the title itself.

An alternative could be a similar approach, but rather than picking the longest, picking the first (ignoring unmatched segments that are solely punctuation).

Went with the second approach as there are many cases of titles being shorter than unmatched items. Using the first (non-punctuation) non-match is essentially the same as the current method when there are no matches before the title, and now supports multiple matches before it.

There isn't really a way to improve this further - unmatched strings before the 'real' title would be considered the title themselves, which makes sense, as what else should be - the longest unmatched strings can often not be the title, and there aren't a lot of other ways to pick what the title should be that make sense.

This change also impacted how excess is obtained, and generally cleaned up some parts of the code as we now have information about all matches' locations, unmatched parts of the release title, and which parts correspond to which locations.