Filter for removing invalid XML chars returns empty string on regex error
mibe opened this issue · comments
Pull request #23 introduced a filter to remove characters, which are invalid in the XML context. This is implemented by using a regular expression replace operation, which is done by the PCRE library.
The problem here is that the result of that operation is not checked. The preg_replace()
function returns NULL in an error condition. NULL would be then casted to a string, which results in an empty string.
This behaviour was firstly noticed by @NeoCsatornaja in issue #28 by setting the feed encoding to ISO-8859-2 and supplying data with this encoding.
The best solution probably to use the regular expression functions from the Multibyte String extension mbstring
. The problem with that is this extension is not enabled by default. This would make FeedWriter incompatible with installations without this extensions. I don't know how common this is, but I could imagine this is the case on cheap shared webhosters or so.
So a compatible solution is IMHO to check if preg_replace()
failed and then in this case use a regular expression without multibyte chars.
If you want to reproduce the problem by yourself, here's the code:
header("Content-Type: text/plain");
$string = "\x54\x65\x73\x74\x09\xc1\xe9\x75\xc3";
mb_regex_encoding('UTF-8');
$after = mb_ereg_replace('/[^\x{0009}\x{000a}\x{000d}\x{0020}-\x{D7FF}\x{E000}-\x{FFFD}]+/u', '_', $string);
var_dump($after);
$after = preg_replace('/[^\x{0009}\x{000a}\x{000d}\x{0020}-\x{D7FF}\x{E000}-\x{FFFD}]+/u', '_', $string);
var_dump($after);
var_dump(preg_last_error());
Result is
string(9) "Test ÁéuÃ"
NULL
int(4)
As you can see the regex is identical, but preg_replace
exited with an PREG_BAD_UTF8_ERROR
error.
Fixed with release of v1.0.3.