openzim / sotoki

StackExchange websites to ZIM scraper

Home Page:https://library.kiwix.org/?category=stack_exchange

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

ERROR:b'UserId' is not in list

kelson42 opened this issue · comments

Unable to scrape 3dprinting https://farm.openzim.org/recipes/3dprinting.stackexchange.com_en

[ThreadPoolExecutor-0_0::2024-04-08 21:21:41,161] INFO:Extracting 3dprinting.stackexchange.com.7z
[MainThread::2024-04-08 21:21:43,073] INFO:removed badges headers
[MainThread::2024-04-08 21:21:43,073] ERROR:FAILED. An error occurred: b'UserId' is not in list
[MainThread::2024-04-08 21:21:43,074] ERROR:b'UserId' is not in list
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/sotoki-2.1.0-py3.8.egg/sotoki/entrypoint.py", line 348, in main
    sys.exit(scraper.run())
  File "/usr/local/lib/python3.8/site-packages/sotoki-2.1.0-py3.8.egg/sotoki/scraper.py", line 164, in run
    ark_manager.check_and_prepare_dumps()
  File "/usr/local/lib/python3.8/site-packages/sotoki-2.1.0-py3.8.egg/sotoki/archives.py", line 160, in check_and_prepare_dumps
    merge_users_with_badges(workdir=self.build_dir, delete_src=self.delete_src)
  File "/usr/local/lib/python3.8/site-packages/sotoki-2.1.0-py3.8.egg/sotoki/utils/preparation.py", line 511, in merge_users_with_badges
    sort_dump_by_id(
  File "/usr/local/lib/python3.8/site-packages/sotoki-2.1.0-py3.8.egg/sotoki/utils/preparation.py", line 94, in sort_dump_by_id
    func(src=src, dst=dst, field_num=get_index_in(src, id_attr), delete_src=delete_src)
  File "/usr/local/lib/python3.8/site-packages/sotoki-2.1.0-py3.8.egg/sotoki/utils/preparation.py", line 51, in get_index_in
    return re.split(rb'\s([a-zA-Z]+)="', line).index(id_attr.encode(UTF8))
ValueError: b'UserId' is not in list
[MainThread::2024-04-08 21:21:43,075] DEBUG:Removing /3dprinting.stackexchange.com_7i2zdhz6

This issue seems to be in fact impacting all stackexchange. At least all new tasks seems to be failing. I'm investigating.

Looks like issue is linked to the fact that XML dumps are stored in UTF-16-LE while most code seems to expect UTF-8 files.

@rgaudin does it ring any bell in your memory?

No ; what's happening exactly? Nothing gets parsed at all?

Yup, not parsed at all. Reencoding allows to go a little bit further but still many issues to fix. Obviously SO dumper has been updated + there are "maybe" too many magic values in sotoki ^^