miku / metha

Command line OAI-PMH harvester and client with built-in cache.

Home Page:https://lab.ub.uni-leipzig.de/metha/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Harvest hangs on UTF-8 errors

mjlassila opened this issue · comments

I'm wondering, is it intentional that harvesting hangs when invalid UTF-8 is encountered? I'm getting the following error and the harvesting stops.

XML syntax error on line 567: invalid UTF-8

If it is possible, it would be nice if harvesting could continue even in the case of UTF-8 errors as it is in the case of HTTP errors if user has provided -ignore-http-errors flag. I'm using metha 0.1.15 installed via go get.

Thank for the bug report!

is it intentional that harvesting hangs when invalid UTF-8 is encountered

No, not intentional.

If it is possible, it would be nice if harvesting could continue even in the case of UTF-8 errors

Yes, this might be a good idea. Do you have an example endpoint URL, where this problem occurs?

@mjlassila, I actually found one example endpoint myself:

$ metha-sync http://firstmonday.org/ojs/index.php/fm/oai
....
2016/11/24 17:29:40 http://firstmonday.org/ojs/index.php/fm/oai?from=2...
2016/11/24 17:29:41 XML syntax error on line 273: invalid UTF-8

Firefox doesn't like it either, XML Parsing Error: not well-formed.

	<dc:creator>Lugano, Giuseppe; University of Jyv�skyl�</dc:creator>
-------------------------------------------------------^

Unfortunately the endpoint I'm using is in access restricted network but I'm glad that you found a suitable open endpoint for testing.

I cannot reproduce this error on my previous example http://firstmonday.org/ojs/index.php/fm/oai. In case of not well formed XML, it might be OK to reject it. If you do not object, I would close this issue for now.

Thanks!