mibe / FeedWriter

PHP Universal Feed Generator

Home Page:http://ajaxray.com/blog/php-universal-feed-generator-supports-rss-10-rss-20-and-atom

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Filter for removing invalid XML chars returns empty string on regex error

mibe opened this issue · comments

Pull request #23 introduced a filter to remove characters, which are invalid in the XML context. This is implemented by using a regular expression replace operation, which is done by the PCRE library.

The problem here is that the result of that operation is not checked. The preg_replace() function returns NULL in an error condition. NULL would be then casted to a string, which results in an empty string.

This behaviour was firstly noticed by @NeoCsatornaja in issue #28 by setting the feed encoding to ISO-8859-2 and supplying data with this encoding.

The best solution probably to use the regular expression functions from the Multibyte String extension mbstring. The problem with that is this extension is not enabled by default. This would make FeedWriter incompatible with installations without this extensions. I don't know how common this is, but I could imagine this is the case on cheap shared webhosters or so.

So a compatible solution is IMHO to check if preg_replace() failed and then in this case use a regular expression without multibyte chars.

If you want to reproduce the problem by yourself, here's the code:

header("Content-Type: text/plain");

$string = "\x54\x65\x73\x74\x09\xc1\xe9\x75\xc3";

mb_regex_encoding('UTF-8');
$after = mb_ereg_replace('/[^\x{0009}\x{000a}\x{000d}\x{0020}-\x{D7FF}\x{E000}-\x{FFFD}]+/u', '_', $string);
var_dump($after);

$after = preg_replace('/[^\x{0009}\x{000a}\x{000d}\x{0020}-\x{D7FF}\x{E000}-\x{FFFD}]+/u', '_', $string);
var_dump($after);
var_dump(preg_last_error());

Result is

string(9) "Test ÁéuÃ"
NULL
int(4)

As you can see the regex is identical, but preg_replace exited with an PREG_BAD_UTF8_ERROR error.

Fixed with release of v1.0.3.