Harvest hangs on UTF-8 errors
mjlassila opened this issue · comments
I'm wondering, is it intentional that harvesting hangs when invalid UTF-8 is encountered? I'm getting the following error and the harvesting stops.
XML syntax error on line 567: invalid UTF-8
If it is possible, it would be nice if harvesting could continue even in the case of UTF-8 errors as it is in the case of HTTP errors if user has provided -ignore-http-errors flag. I'm using metha 0.1.15 installed via go get.
Thank for the bug report!
is it intentional that harvesting hangs when invalid UTF-8 is encountered
No, not intentional.
If it is possible, it would be nice if harvesting could continue even in the case of UTF-8 errors
Yes, this might be a good idea. Do you have an example endpoint URL, where this problem occurs?
@mjlassila, I actually found one example endpoint myself:
$ metha-sync http://firstmonday.org/ojs/index.php/fm/oai
....
2016/11/24 17:29:40 http://firstmonday.org/ojs/index.php/fm/oai?from=2...
2016/11/24 17:29:41 XML syntax error on line 273: invalid UTF-8
Firefox doesn't like it either, XML Parsing Error: not well-formed.
<dc:creator>Lugano, Giuseppe; University of Jyv�skyl�</dc:creator>
-------------------------------------------------------^
Unfortunately the endpoint I'm using is in access restricted network but I'm glad that you found a suitable open endpoint for testing.
I cannot reproduce this error on my previous example http://firstmonday.org/ojs/index.php/fm/oai. In case of not well formed XML, it might be OK to reject it. If you do not object, I would close this issue for now.
Thanks!