oalders / html-restrict

HTML::Restrict - Strip away unwanted HTML tags

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

HTML::Restrict is stripping too many things if the number of '<' and '>' don't match.

brunobuss opened this issue · comments

Passing 'test<string' to process(), it return only 'test'. It should return the complete string.

Some test cases (sorry, I didn't know where to put them on the t/, this is why I'm pasting them here):
is( $hr->process( '<' ), '<', 'ok' );
is( $hr->process( 'a<' ), 'a<', 'ok' );
is( $hr->process( '<a' ), '<a', 'ok' );

Also, if the number of '<' and '>' don't match, it seems like HTML::Restrict parser is using some greed strategy to get tags. For example, in 'a<s<d>b' I expected it to return "a<sb", but it returned only "ab".

Some more test cases:

is( $hr->process( '<<' ), '<<', 'ok' ); #This is working now.
is( $hr->process( '<<a' ), '<<a, 'ok' ); #This doesn't work, return only '<'.

is( $hr->process( 'a<<' ), 'a<<, 'ok' ); #This is working now.
is( $hr->process( 'a<<a' ), 'a<<a, 'ok' ); #This doesn't work, return only 'a<'.

is( $hr->process( '<a<' ), '<a<', 'ok' ); #This doesn't work, return an empty string'.

It seems as though HTML::Parser is interpreting those strings as comments. You can see this by enabling debug mode. Set allow_comments, all your test cases pass:

my $hr = HTML::Restrict->new( debug => 1, allow_comments => 1 );

The same will be true for the other sanitizing modules that are based on HTML::Parser:

use HTML::Scrubber;
my $s = HTML::Scrubber->new;
say $s->scrub('test<string');
# output: test

I recommend you try to encode unwanted HTML entities, such as '<', before processing the HTML. If that is not possible, then you could pre-process the data using a markup abstraction, such as Markdown, instead of trying to process arbitrary strings as HTML.

If I encode/escape all html entities before passing it to HTML::Restrict, it will do nothing as there is not any tags left. Don't know what I'll do to solve my problem... but I'm closing this issue, as it's clearly not a HTML::Restrict problem. Thank you for your time 👍 .

@brunoboss: Broken HTML is going to be a problem, but I guess opening an issue with HTML::Parser might be helpful. Thanks for adding those test cases to make it clear. And thanks to @perlpong for finding the problem. :)