nerevu / riko

A Python stream processing engine modeled after Yahoo! Pipes

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Option to set Request User-Agent string

mafrosis opened this issue · comments

I've been trialling riko and it seems great. I do have a small request however, that an option be added to change the User-Agent on outgoing requests. Some servers will block the default User-Agent: Python-urllib/3.5.

Alternatively, have you considered using urllib3 instead of the mess that's in Python core? In that case you can easily pass headers into the PoolManager constructor.

https://urllib3.readthedocs.io/en/latest/

Thanks for your works!

Glad you are enjoying riko and thanks for the suggestion! I agree this is a useful feature, but I'm not sure when I will be able to work on it since will touch multiple files and require a bit of time to properly integrate into the entire project. If this is something you are willing to take a stab at, I can happily point you in the right direction :).

My initial thought is to add a ua key to the conf kwarg of the appropriate pipes. Then you could do, e.g., pipe(conf={'url': 'example.com', 'ua': 'Special-Agent'}). There would also need to be an option added to SyncPipe (plus the async versions of both).

IIRC, I don't think urllib3 can read local files (file://), only remote (http://).

Hi! I just took a look through the source code, and the part I'm not really clear on is what would need change in SyncPipe. It seems the new "ua" field would just be passed down into each module via kwargs?

Also, which modules will want this feature? I was looking specifically at fetchpage, but I guess fetchdata and xpathfetchpage are obvious candidates.

It seems the new "ua" field would just be passed down into each module via kwargs?

This would be true if ua were passed as

SyncPipe('fetch', conf={'url': 'example.com'}, ua='Special-Agent')

instead of

SyncPipe('fetch', conf={'url': 'example.com', 'ua': 'Special-Agent'})

The choice of whether ua should be in conf or kwargs essentially boils down to how you want extract the value:

Also, which modules will want this feature?

I would say almost all of the source pipes, with the exceptions being itembuilders and input. The non-source pipe exchangerate is also a candidate. For the sources pipes, we could intercept the parse_rss function call and pass the required kwargs to urlopen. There may be some edge cases as well but a simple search for all uses of urlopen should suffice. Plus the async variant async_url_open. Plus py2 and py3 compatibility. Plus the appropriate unit tests.... Phew!

I hope that didn't overwhelm you :)

Hey I'm sorry I really haven't had time to take on this work. I wish I did! It's looking more complicated than I can realistically tackle right now, so please close this issue if you are unlikely to implement it yourself.

Thanks!

I'll keep it open since it's a valid request. Any area in particular causing difficulty?