Option to set Request User-Agent string
mafrosis opened this issue · comments
I've been trialling riko and it seems great. I do have a small request however, that an option be added to change the User-Agent
on outgoing requests. Some servers will block the default User-Agent: Python-urllib/3.5
.
Alternatively, have you considered using urllib3
instead of the mess that's in Python core? In that case you can easily pass headers into the PoolManager
constructor.
https://urllib3.readthedocs.io/en/latest/
Thanks for your works!
Glad you are enjoying riko and thanks for the suggestion! I agree this is a useful feature, but I'm not sure when I will be able to work on it since will touch multiple files and require a bit of time to properly integrate into the entire project. If this is something you are willing to take a stab at, I can happily point you in the right direction :).
My initial thought is to add a ua
key to the conf
kwarg of the appropriate pipes. Then you could do, e.g., pipe(conf={'url': 'example.com', 'ua': 'Special-Agent'})
. There would also need to be an option added to SyncPipe
(plus the async versions of both).
IIRC, I don't think urllib3
can read local files (file://
), only remote (http://
).
Hi! I just took a look through the source code, and the part I'm not really clear on is what would need change in SyncPipe
. It seems the new "ua" field would just be passed down into each module via kwargs
?
Also, which modules will want this feature? I was looking specifically at fetchpage
, but I guess fetchdata
and xpathfetchpage
are obvious candidates.
It seems the new "ua" field would just be passed down into each module via kwargs?
This would be true if ua
were passed as
SyncPipe('fetch', conf={'url': 'example.com'}, ua='Special-Agent')
instead of
SyncPipe('fetch', conf={'url': 'example.com', 'ua': 'Special-Agent'})
The choice of whether ua
should be in conf
or kwargs
essentially boils down to how you want extract the value:
Also, which modules will want this feature?
I would say almost all of the source pipes, with the exceptions being itembuilder
s and input
. The non-source pipe exchangerate
is also a candidate. For the sources pipes, we could intercept the parse_rss
function call and pass the required kwargs to urlopen
. There may be some edge cases as well but a simple search for all uses of urlopen
should suffice. Plus the async variant async_url_open
. Plus py2 and py3 compatibility. Plus the appropriate unit tests.... Phew!
I hope that didn't overwhelm you :)
Hey I'm sorry I really haven't had time to take on this work. I wish I did! It's looking more complicated than I can realistically tackle right now, so please close this issue if you are unlikely to implement it yourself.
Thanks!
I'll keep it open since it's a valid request. Any area in particular causing difficulty?
CR #45