ERROR:b'UserId' is not in list
kelson42 opened this issue · comments
Unable to scrape 3dprinting https://farm.openzim.org/recipes/3dprinting.stackexchange.com_en
[ThreadPoolExecutor-0_0::2024-04-08 21:21:41,161] INFO:Extracting 3dprinting.stackexchange.com.7z
[MainThread::2024-04-08 21:21:43,073] INFO:removed badges headers
[MainThread::2024-04-08 21:21:43,073] ERROR:FAILED. An error occurred: b'UserId' is not in list
[MainThread::2024-04-08 21:21:43,074] ERROR:b'UserId' is not in list
Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/sotoki-2.1.0-py3.8.egg/sotoki/entrypoint.py", line 348, in main
sys.exit(scraper.run())
File "/usr/local/lib/python3.8/site-packages/sotoki-2.1.0-py3.8.egg/sotoki/scraper.py", line 164, in run
ark_manager.check_and_prepare_dumps()
File "/usr/local/lib/python3.8/site-packages/sotoki-2.1.0-py3.8.egg/sotoki/archives.py", line 160, in check_and_prepare_dumps
merge_users_with_badges(workdir=self.build_dir, delete_src=self.delete_src)
File "/usr/local/lib/python3.8/site-packages/sotoki-2.1.0-py3.8.egg/sotoki/utils/preparation.py", line 511, in merge_users_with_badges
sort_dump_by_id(
File "/usr/local/lib/python3.8/site-packages/sotoki-2.1.0-py3.8.egg/sotoki/utils/preparation.py", line 94, in sort_dump_by_id
func(src=src, dst=dst, field_num=get_index_in(src, id_attr), delete_src=delete_src)
File "/usr/local/lib/python3.8/site-packages/sotoki-2.1.0-py3.8.egg/sotoki/utils/preparation.py", line 51, in get_index_in
return re.split(rb'\s([a-zA-Z]+)="', line).index(id_attr.encode(UTF8))
ValueError: b'UserId' is not in list
[MainThread::2024-04-08 21:21:43,075] DEBUG:Removing /3dprinting.stackexchange.com_7i2zdhz6
This issue seems to be in fact impacting all stackexchange. At least all new tasks seems to be failing. I'm investigating.
Looks like issue is linked to the fact that XML dumps are stored in UTF-16-LE while most code seems to expect UTF-8 files.
@rgaudin does it ring any bell in your memory?
No ; what's happening exactly? Nothing gets parsed at all?
Yup, not parsed at all. Reencoding allows to go a little bit further but still many issues to fix. Obviously SO dumper has been updated + there are "maybe" too many magic values in sotoki ^^