LazyRemoteFile sometimes raises 403 forbidden error because of urlretrieve headers
joeyjurjens opened this issue · comments
First of all; It does not raise a 403 all the time, but lately I've stumbled upon it quite a few times.
The LazyRemoteFile uses urlretrieve to download images from a given url and saves it to a file.
However, the user-agent it uses by default seems to get blocked by quite a few websites.
Unfortunately, urlretrieve doesn't allow us setting requests headers.
If we want to pass headers with urllib, we could do so as following:
import urllib.request
req = urllib.request.Request('http://www.example.com/')
req.add_header('User-Agent', 'Mozilla/5.0')
r = urllib.request.urlopen(req)
# Now we need to read the response content and save to a file
We could also use the requests library which would look a bit cleaner (in my opinion):
import requests
r = requests.get(self.url, headers={'User-Agent', 'Mozilla/5.0'})
# Now we need to read the response content and save to a file
Is this something I can make a PR for, and if so what method would be preferred?
Please use just urllib, we don't have a lot of requests we are doing and keeping the dependencies minimal is a goal of this project. Please make sure the User-Agent has a sane default, but can be overridden by a setting. Provide some example settings to emulate common browsers in the documentation.
☝️