scrapy / scrapely

A pure-python HTML screen-scraping library

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

safehtml omit some important (all) attributes of tags

SirbitoX opened this issue · comments

Let's consider that someone (like me) want to keep an img tag so the src attribute of this tag would be important for him/her. But safehtml() function omit all the attributes of the relevant tag.
I think it would better to keep attributes of allowed_tags or add another param named allowed_attributes to specify which attributes to keep.

Hi @SirbitoX. I was having a discussion about this last week and we were thinking about adding a new less strict version of safe html. The new type would be somewhere between raw html and safe html keeping img tags and possibly other tags too.

Other than img tags what other tags do you add? Would you mind explaining your specific use case? Are you extracting articles or products or leads?

Hi @ruairif,
I'm extracting articles and I keep all the images in the description of scraped article so to do this I would need the src attribute or even height and width attributes of the img tag.
Probably I plan to keep the embed videos in the description, either. But it wouldn't be an issue if we support something like allowed_attributes.