aliparlakci / bulk-downloader-for-reddit

Downloads and archives content from reddit

Home Page:https://pypi.org/project/bdfr

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[BUG] Imgur - Response code 429

gemini0x2 opened this issue · comments

  • I am reporting a bug.
  • I am running the latest version of BDfR
  • I have read the Opening an issue

Description

Imgur links are returning response code 429.
Note: I'm able to browse Imgur normally in my browser and even access the direct links of files that return 429 in bdfr. This error continues even after waiting 24 hours. Never had this issue before.

Command

python bdfr --user reddituser --submitted

Environment

  • OS: [MacOS]
  • Python version: [3.10.6]

Logs

[2023-05-23 22:53:15,330 - bdfr.connector - DEBUG] - Disabling the following modules: 
[2023-05-23 22:53:15,330 - bdfr.connector - Level 9] - Created download filter
[2023-05-23 22:53:15,331 - bdfr.connector - Level 9] - Created time filter
[2023-05-23 22:53:15,331 - bdfr.connector - Level 9] - Created sort filter
[2023-05-23 22:53:15,331 - bdfr.connector - Level 9] - Create file name formatter
[2023-05-23 22:53:15,331 - bdfr.connector - DEBUG] - Using unauthenticated Reddit instance
[2023-05-23 22:53:15,332 - bdfr.connector - Level 9] - Created site authenticator
[2023-05-23 22:53:15,332 - bdfr.connector - Level 9] - Retrieved subreddits
[2023-05-23 22:53:15,332 - bdfr.connector - Level 9] - Retrieved multireddits
[2023-05-23 22:53:15,744 - bdfr.connector - Level 9] - Retrieved user data
[2023-05-23 22:53:15,744 - bdfr.connector - Level 9] - Retrieved submissions for given links
[2023-05-23 22:53:27,185 - bdfr.downloader - DEBUG] - Attempting to download submission 13nbj2e
[2023-05-23 22:56:28,016 - bdfr.downloader - DEBUG] - Attempting to download submission 13jz8vx
[2023-05-23 22:56:28,016 - bdfr.downloader - DEBUG] - Using Imgur with url https://i.imgur.com/XpO4ZNm.gifv
[2023-05-23 22:56:28,293 - bdfr.resource - WARNING] - Error occured downloading from https://i.imgur.com/XpO4ZNm.mp4, waiting 60 seconds: Response code 429
[2023-05-23 22:57:28,485 - bdfr.resource - WARNING] - Error occured downloading from https://i.imgur.com/XpO4ZNm.mp4, waiting 120 seconds: Response code 429
[2023-05-23 22:59:28,586 - bdfr.resource - ERROR] - Max wait time exceeded for resource at url https://i.imgur.com/XpO4ZNm.mp4
[2023-05-23 22:59:28,586 - bdfr.downloader - ERROR] - Failed to download resource https://i.imgur.com/XpO4ZNm.mp4 in submission 13jz8vx with downloader Imgur: Could not download resource: Response code 429

Having the same issue

Imgur is nuking all NSFW content from reddit, not sure if this can be fixed but that's most likely the cause. It must be scrweing with the API.

@GarethFreeman I know about that, but if thats the cause for response code 429 then why we can still access any content in the browser without any problem? that makes no sense, unless somehow they detect something peculiar on how bdfr is making the download requests.

commented

HTTP code 429 is a rate limiting error code. It means that Imgur has received too many requests from the browser/application. There's not really any way for us to deal with this or get around it. It just means that you have to be slower or get less Imgur posts.

I was having the same issue but noticed that curl was able to download the same urls with no problem.

Adding the curls default headers to the resource method below fixed the issue for me.

    @staticmethod
    def http_download(url: str, download_parameters: dict) -> Optional[bytes]:
        # headers = download_parameters.get("headers")
        headers = {
            "user-agent": "curl/7.84.0",
            "accept": "*/*"
        }
        ...

I expect it's the accept more than the user-agent, but haven't tried without both.

Adding curl to the headers fixed the issue. Too bad I didn't figured this out sooner. Thanks, @eawooten for the solution!

I was having the same issue but noticed that curl was able to download the same urls with no problem.

Adding the curls default headers to the resource method below fixed the issue for me.

    @staticmethod
    def http_download(url: str, download_parameters: dict) -> Optional[bytes]:
        # headers = download_parameters.get("headers")
        headers = {
            "user-agent": "curl/7.84.0",
            "accept": "*/*"
        }
        ...

I expect it's the accept more than the user-agent, but haven't tried without both.

Thank you! Adding curl fixes the issues with imgur, but breaks redgifs.

@GGaroufalis Right, I didn't noticed that! A conditional statement will help.

@Gavriik I think this one fixes it

def http_download(url: str, download_parameters: dict) -> Optional[bytes]:
     domain = urlparse(url).hostname
     if fnmatch.fnmatch(domain, "*.redgifs.com"):
         headers = download_parameters.get("headers")
     else:
         headers = {
             "user-agent": "curl/8.1.1",
             "accept": "*/*"
         }

you need to add

import urllib.parse
from urllib.parse import urlparse

and

import fnmatch

at the top

Not sure why you wouldn't put it here rather than make the download function super janky...

@GGaroufalis thanks! I can confirm that the conditional statement is working as expected, but wouldn't it be better to switch the condition? that way the modified header is only used for imgur, and not for all other sites.

def http_download(url: str, download_parameters: dict) -> Optional[bytes]:
     headers = download_parameters.get("headers")
     domain = urlparse(url).hostname
     if fnmatch.fnmatch(domain, "*.imgur.com"):
         headers = {
             "user-agent": "curl/8.1.1",
             "accept": "*/*"
         }

Not sure why you wouldn't put it here rather than make the download function super janky...

@Soulsuck24 That does not work. If I'm not wrong that header is only used to retrieve the direct links of an imgur posts. It is not used for the actual download.

@Soulsuck24 That does not work. If I'm not wrong that header is only used to retrieve the direct links of an imgur posts. It is not used for the actual download.

You're right, I was thinking this one instead, my bad.

Weird though if your connection to the API and through a browser is working but the script is getting 429s on the direct link download. The changes here are just changing the downloader from using the default requests user-agent to the curl one. This would be the first I've seen them limiting on something other than IP, but it's then not solely the user-agent as the requests one is used to access the API and it's not getting 429s. Odd.

@GGaroufalis thanks! I can confirm that the conditional statement is working as expected, but wouldn't it be better to switch the condition? that way the modified header is only used for imgur, and not for all other sites.

def http_download(url: str, download_parameters: dict) -> Optional[bytes]:
     headers = download_parameters.get("headers")
     domain = urlparse(url).hostname
     if fnmatch.fnmatch(domain, "*.imgur.com"):
         headers = {
             "user-agent": "curl/8.1.1",
             "accept": "*/*"
         }

How do you implement this? Apologies for not being an experienced coder.

@GGaroufalis thanks! I can confirm that the conditional statement is working as expected, but wouldn't it be better to switch the condition? that way the modified header is only used for imgur, and not for all other sites.

def http_download(url: str, download_parameters: dict) -> Optional[bytes]:
     headers = download_parameters.get("headers")
     domain = urlparse(url).hostname
     if fnmatch.fnmatch(domain, "*.imgur.com"):
         headers = {
             "user-agent": "curl/8.1.1",
             "accept": "*/*"
         }

How do you implement this? Apologies for not being an experienced coder.

rename the attached to "resource.py" and drop it in the bdfr folder
resource.txt

@GGaroufalis thanks! I can confirm that the conditional statement is working as expected, but wouldn't it be better to switch the condition? that way the modified header is only used for imgur, and not for all other sites.

def http_download(url: str, download_parameters: dict) -> Optional[bytes]:
     headers = download_parameters.get("headers")
     domain = urlparse(url).hostname
     if fnmatch.fnmatch(domain, "*.imgur.com"):
         headers = {
             "user-agent": "curl/8.1.1",
             "accept": "*/*"
         }

How do you implement this? Apologies for not being an experienced coder.

rename the attached to "resource.py" and drop it in the bdfr folder resource.txt

Do you still use the bdfr function or is curl different? Could you provide an example of a reddit user download?

@GarethFreeman same, just make sure the modified resource.py is in the right location.

@Gavriik C:\Users\AppData\Local\BDFR\bdfr right? It's still giving me the 429 response code.

@GarethFreeman mine is in C:\Users\Administrator\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\bdfr

yours might differ a bit depending on your python version

@Gavriik I literally don't have that folder at all. I'm on 310, I just don't understand the problem. There are no other folders where I can place that file.

The following command should give you the correct location:
python3 -m pip show bdfr

@Gavriik Finally got it working, thanks for all the help mate.