yt-dlp / yt-dlp

A feature-rich command-line audio/video downloader

Home Page:https://discord.gg/H5MNcFW63r

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[Help needed] Merging pending changes from youtube-dl

pukkandan opened this issue · comments

Upstream commits that have not been merged


Why not copy as-is from youtube-dl?

There were some modifications made to these extractors in youtube-dlc. I don't want to blindly remove those changes. That is why someone who understands inner workings of these extractors needs to resolve the conflicts correctly

For example, let's consider the youtube extractor. This fork supports search URLs, can download multiple feed pages, automatically redirects channel pages to \video, has better age gate bypass etc. If we were to just copy the code from youtube-dl when they make any change, we could lose these.

I understand the youtube extractor enough that I am able to pick the best of both worlds when I do the merger. But for me, the same isn't true for the above mentioned extractors.

commented

@pukkandan the archive.org extractor is completely different from the youtube-dl or dlc version because it comes from here ytdl-org/youtube-dl#27156

so the others commits are useless. The PR I linked is a almost complete rewrite of the extractor to add support for more media type and playlist etc.

the archive.org extractor is completely different from the youtube-dl or dlc version because it comes from here ytdl-org/youtube-dl#27156

Yes, I'm aware. I am the one who merged it after all :D

But since youtube-dl devs didn't merge it and instead made their own changes to the extractor, I assumed that the changes they made must have some merit. But if our version can do everything youtube-dl's version can, then great

commented

all the tests in the extractor worked fine and I tested a couple of other random pages and they worked as well.

There's also this extensive list of new extractors written but not merged into youtube-dl (which may make more sense as a new issue, but this seemed related)

ytdl-org/youtube-dl#28054

commented

@pukkandan The TMZ extractor is a similar situation of the archive.org extractor.

They are different but the one in here all tests works and tested with the urls added in the test of the other version works as well.

This one has a single class handling everything. The other has two classes managing different type of the website pages. Both version of the extractor works.

Considering yt-dlp's tmz extractor is tiny, that is impressive.

@nixxo I really appreciate you taking time to go through and test these extractors :D

commented

@pukkandan the stitcher extractor merge fail bacause you missed one previous commit that added show support and not only the single episode.

I made a PR with the two commits from youtube-dl #175

Something strange about archive.org: The archived copies of videos' webpages sometimes have usable links to manifests, thumbnails, subtitles, etc. The links need to have the prefix http://web.archive.org/web/[0-9]+ removed from them. Even if a webpage has been deleted or updated, the actual video sometimes can still be accessed on the original server. Note that this is different than videos hosted by archive.org.

commented

@2ShedsJackson can you provide a url that gives this behaviour?

@2ShedsJackson can you provide a url that gives this behaviour?

The nightly opera streams from https://www.metopera.org/ sometimes still work for a couple of days after they've been replaced.

@Ashish0804 Could you have a look at the differences b/w your NDR implementation 23dd2d9 and youtube-dl's ytdl-org/youtube-dl#30531

Related: #2337

commented

The major difference is in the NDRIE._extract_embed() method, whose signature was also changed. I don't recall but I'm not sure that the yt-dlp target still exists, or I would have included it. The pages that the PR was fixing use the sophoraID scheme and that may now be all of them (eg #2337 (comment)). Any thumbnail is found by _search_json_ld().

NJoyIE._extract_embed() is also somewhat different and again I believe that the yt-dl version is more up-to-date.

In the yt-dlp extractor NDRIE._VALID_URL has a pattern for daserste.de but no test (yt-dl handles that domain in ard.py).

The tests should show which versions work.

Everything else is conventional or fixing tests.

yt-dl handles that domain in ard.py

yt-dlp mentions Daserste in the regexes for both ard.py and ndr.py. I don't know how to read regexes, but an online regex tester I used suggests that the ndr.py one is for daserste.ndr.de.

It was added in dc9d8f4, which was part of blackjack4494/youtube-dlc#95 (yt-dlp is a fork of youtube-dlc), a PR which fixed ytdl-org/youtube-dl#26563

commented

ITV extractor was based on the last yt-dlp PR and should be OK with these considerations:

  • remove _sort_formats()
  • tweak any removed compat_xxx
  • optionally, delete _search_nextjs() shim in favour of yt-dlp native method
  • possibly investigate if subtitles are available in the m3u8 manifest and use _extract_m3u8_formats_and_subtitles() if so.

If there is a test port or PR I could exercise it in the UK.

[jsinterp] Improve parsing

@dirkf Could you explain the point of the regex rework in jsinterp?

We don't have . implemented except for some specific cases. So this does nothing. Even if it were to be implemented, the method names are generally different b/w JS and Python. Some fields like split/search/flags etc will work, but the test seems to imply you are expecting the others as well?

Actually, none of YT players are actually using the regexes. We only needed to parse them out. The .compile can even be completely removed (possible speed improvement?)

The newly added replace also don't work. When passed a string "a".replace(".", "b"), JS interprets . literally, while the code considers it a regex. On the other hand "a".replace(\.\, "b") also doesn't work and raises error, since JS_Regex doesn't inherit from re.Pattern

commented

That was a bit of WIP, apparently, even though the specification of String.prototype.replace[All]() has all the logic and consistency that one has come to expect from Brendan's busy week.

The linked test was really checking that the SO hack behaved sensibly.

So far there hasn't been any use of replace() but if it happens there's now this: dirkf/youtube-dl@3be072e.

The .compile could indeed be lazy, like this: dirkf/youtube-dl@5f0eea7. It doesn't seem to affect the time for test/test_youtube_signature.py much: maybe 5% quicker. Py2.7/Py3.9 runtime ratio still ~1.6, though.