--cookies option, get cookies dynamically

Question

--cookies option, get cookies dynamically

Iridium-Lo opened this issue 3 months ago · comments

Iridium-Lo commented 3 months ago

Checklist

I'm reporting a site feature request
I've verified that I'm running youtube-dl version 2021.12.17
I've searched the bugtracker for similar site feature requests including closed ones

Description

When having to use the --cookies option for sites with cloud fare protection:

rather than having users install a third party extension (stated in guide) then manually run it each time they need to download

Suggestion

get the cookies dynamically, use curl as it gets cookies in netscape format, otherwise you have to convert to netscape which takes some work
capture the cookie in a variable or something, don't write to file (more work)

I done this for a script I use, using curl here is a module from it:

IFS=$'\n'

getCookies() {
    curl $site \
      --silent \
      --output /dev/null \
      --user-agent $userAgent \
      --cookie-jar ~/cookies.txt 
}

ytdl() {
    youtube-dl \
      --no-part \
      --no-check-certificate \
      --cookies ~/cookies.txt \
      --user-agent $userAgent \
      --download-archive arc.txt \
      $@
}

downloadSimultaneously() {
    local userAgent site urlArray
    IFS=$'\n'
    userAgent='Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:123.0) Gecko/20100101 Firefox/123.0'
    site=$1
    urlArray=($@)

    getCookies $site 
    
    parallel -j 0 \
      ytdl ::: ${urlArray[@]}
}

export -f ytdl

For Python something like this:

from requests import *

headers = {
    "User-Agent": #site from download url
}

cookies = get('site from download url', headers=headers)

dirkf · Answer 1 · Sun Feb 25 2024 20:13:39 GMT+0800 (China Standard Time)

If you can just replay cookies set by visiting some URL with no user interaction, we can easily do that in the extractor itself.

Iridium-Lo · Answer 2 · Mon Feb 26 2024 04:16:04 GMT+0800 (China Standard Time)

I'm not sure what you mean by that?

getCookies gets the new/updated cookies each time downloadSimulataneously is executed

dirkf · Answer 3 · Mon Feb 26 2024 10:53:11 GMT+0800 (China Standard Time)

When yt-dl receives a Set-Cookie header from the site the cookie is stashed in a Python cookielib/http.cookiejar CookieJar accessible to extractor code as the cookiejar attribute of the extractor. The file specified by --cookies ... is updated with the received cookie, and could be passed to a subsequent yt-dl invocation. So yt-dl already includes the same functionality that curl offers (it doesn't have a --download-archive ... option, does it?).

Sometimes a site that rejects requests or redirects requests to a captcha page will send authorisation cookies to bypass this blockage if a specific site page or API URL is visited. An extractor for the site can do that as part of its _real_initialize() method.

Iridium-Lo · Answer 4 · Wed Feb 28 2024 01:52:20 GMT+0800 (China Standard Time)

oops, obviously curl doesn't have that option, I'll edit that.

So the option --cookies option doesn't just read the txt file it actually writes to it?

Meaning I could just specify --cookies cookies.txt and it would set the cookies to the file? If so the docs need updated. Or does cookies.txt need to be set once then yt-dl will set it fresh every subsequent run?

The only other hurdle is that even with the cookies set, you have to go to the site and get past the cloudfare protection, or it will still give a 404.

Does yt-dl have a parallel download option like my script?

dirkf · Answer 5 · Wed Feb 28 2024 09:12:40 GMT+0800 (China Standard Time)

cookies.txt starts as you wrote it and is updated by yt-dl as new cookies and values are set by the site.

yt-dl doesn't support the syntax/library modules needed for parallel execution but but there is some support for it in yt-dlp.

Iridium-Lo · Answer 6 · Thu Feb 29 2024 06:12:15 GMT+0800 (China Standard Time)

so you set cookies.txt once then don't need to keep setting it.

you say some support for parallel downloads?

Can I add a PR for my script?

if you read the original comment it's a better way of doing things (less work) than creating playlists, aside from the parallel part.

Iridium-Lo · Answer 7 · Thu Feb 29 2024 06:19:39 GMT+0800 (China Standard Time)

or add a parallel download option?

dirkf · Answer 8 · Thu Feb 29 2024 10:48:10 GMT+0800 (China Standard Time)

Running multiple instances of yt-dl in parallel using the same output directory or download archive (etc) is not really supported. See #350 and the yt-dlp thread linked there.

Iridium-Lo · Answer 9 · Thu Feb 29 2024 14:20:37 GMT+0800 (China Standard Time)

Alright so essentially parallel downloads just isn't going to integrate well with the existing code?

That thread is a discussion between users mostly I don't see any dev input (might've missed it), anyway for me it's not an issue.

From what you have said it will be easy to carry out this feature request. Could I help?

Caveat

Even when you have the latest cookies you still have to visit the site and get through the cloudfare protection manually, or it will give a 404, so we (if you point me in the right difection) or yt-dl maintainers will need to look into that.

just a note I install yt-dl from the repo not using a package manager, so I can get access to branches with fixes before they are merged to master (it can take a while sometimes)

dirkf · Answer 10 · Thu Feb 29 2024 20:36:20 GMT+0800 (China Standard Time)

The caveat is the real problem that needs to be solved. yt-dlp hopes to do it with curlCFFI. An implementation of that solution here would require so much shimming and/or imposition of limiting dependencies that just using yt-dlp instead would be more sensible.

As you may observe, anyone can be a dev, but almost all the knowledgable and active contributors are working on yt-dlp. Features and relevant fixes here generally get pulled downstream, and downstream improvements, especially extractor modules, may also be pulled here. I'm not likely to merge a PR that adds a feature already implemented downstream unless it behaves in the same way (API, CLI) for cases that are covered by both implementations.

Iridium-Lo · Answer 11 · Fri Mar 01 2024 02:53:46 GMT+0800 (China Standard Time)

Oh right I see... That's why I thought yt-dlp is better, I had things the wrong way around I thought yt-dl is the one with more support.

Although you said yt-dlp has some parallel support. GNU parallel is also some as it only goes to 60 instances max.

Could you let me know the next best thing (the cmd) for generating an archive from urls with yt-dlp (with minimal download) like you did for yt-dl please?

Iridium-Lo · Answer 12 · Sun Mar 03 2024 02:49:55 GMT+0800 (China Standard Time)

ignore that the archive generate cmd is the same, btw yt-dlp is much faster