algolia / docsearch-scraper

DocSearch - Scraper

Home Page:https://docsearch.algolia.com/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Python error when running own scrapper

pmenichelli opened this issue · comments

I have made the documentation site for the company I work for using Docusaurus. I integrated Algolia's Docsearch with it and I'm having some strange python errors when our CI runs the scraper.

These errors do not happen consistently, sometimes the CI runs the scrapper successfully. The website we scrap is https://docs.surfly.com and the config file I'm using for the crawler is the following

DocSearch config
{
  "index_name": "surfly-docs",
  "start_urls": [
    "https://docs.surfly.com/"
    ],
  "sitemap_urls": [
    "https://docs.surfly.com/sitemap.xml"
  ],
  "sitemap_alternate_links": true,
  "stop_urls": [
    "/tests"
  ],
  "selectors": {
    "lvl0": {
      "selector": "(//ul[contains(@class,'menu__list')]//a[contains(@class, 'menu__link menu__link--sublist menu__link--active')]/text() | //nav[contains(@class, 'navbar')]//a[contains(@class, 'navbar__link--active')]/text())[last()]",
      "type": "xpath",
      "global": true,
      "default_value": "Documentation"
    },
    "lvl1": "header h1",
    "lvl2": "article h2",
    "lvl3": "article h3",
    "lvl4": "article h4",
    "lvl5": "article h5, article td:first-child",
    "lvl6": "article h6",
    "text": "article p, article li, article td:last-child"
  },
  "strip_chars": " .,;:#",
  "custom_settings": {
    "separatorsToIndex": "_",
    "attributesForFaceting": [
      "language",
      "version",
      "type",
      "docusaurus_tag"
    ],
    "attributesToRetrieve": [
      "hierarchy",
      "content",
      "anchor",
      "url",
      "url_without_anchor",
      "type"
    ]
  },
  "conversation_id": [
    "833762294"
  ],
  "nb_hits": 46250
}

If anyone has some tip of what could be going wrong it would really help me. So far googling these errors didn't help much as the stack trace doesn't reference any of the scraper source files, they seem more like python errors so I end up looking at random results in google.

Here are the stack traces when the CI fails running the job.

The command is:

podman run --rm --env-file=.env -e "CONFIG=$(cat ./docsearch-config.json | jq -r tostring)" algolia/docsearch-scraper

Error stack traces:

unsupported operand type
[       2ms] > Running command: podman run --rm --env-file=.env -e "CONFIG=$(cat ./docsearch-config.json | jq -r tostring)" algolia/docsearch-scraper
[     396ms] Traceback (most recent call last):
[     396ms]   File "/usr/local/bin/pipenv", line 7, in <module>
[     396ms]     from pipenv import cli
[     396ms]   File "/usr/local/lib/python3.6/dist-packages/pipenv/__init__.py", line 22, in <module>
[     396ms]     from pipenv.vendor.urllib3.exceptions import DependencyWarning
[     396ms]   File "/usr/local/lib/python3.6/dist-packages/pipenv/vendor/urllib3/__init__.py", line 7, in <module>
[     396ms]     import logging
[     396ms]   File "/usr/lib/python3.6/logging/__init__.py", line 28, in <module>
[     396ms]     from string import Template
[     396ms]   File "/usr/lib/python3.6/string.py", line 77, in <module>
[     396ms]     class Template(metaclass=_TemplateMetaclass):
[     396ms]   File "/usr/lib/python3.6/string.py", line 74, in __init__
[     396ms]     cls.pattern = _re.compile(pattern, cls.flags | _re.VERBOSE)
[     396ms]   File "/usr/lib/python3.6/re.py", line 233, in compile
[     396ms]     return _compile(pattern, flags)
[     396ms]   File "/usr/lib/python3.6/re.py", line 301, in _compile
[     396ms]     p = sre_compile.compile(pattern, flags)
[     396ms]   File "/usr/lib/python3.6/sre_compile.py", line 562, in compile
[     396ms]     p = sre_parse.parse(p, flags)
[     396ms]   File "/usr/lib/python3.6/sre_parse.py", line 855, in parse
[     396ms]     p = _parse_sub(source, pattern, flags & SRE_FLAG_VERBOSE, 0)
[     396ms]   File "/usr/lib/python3.6/sre_parse.py", line 416, in _parse_sub
[     396ms]     not nested and not items))
[     396ms]   File "/usr/lib/python3.6/sre_parse.py", line 765, in _parse
[     396ms]     p = _parse_sub(source, state, sub_verbose, nested + 1)
[     396ms]   File "/usr/lib/python3.6/sre_parse.py", line 416, in _parse_sub
[     396ms]     not nested and not items))
[     396ms]   File "/usr/lib/python3.6/sre_parse.py", line 765, in _parse
[     396ms]     p = _parse_sub(source, state, sub_verbose, nested + 1)
[     396ms]   File "/usr/lib/python3.6/sre_parse.py", line 416, in _parse_sub
[     396ms]     not nested and not items))
[     396ms]   File "/usr/lib/python3.6/sre_parse.py", line 764, in _parse
[     396ms]     not (del_flags & SRE_FLAG_VERBOSE))
[     396ms] TypeError: unsupported operand type(s) for &: 'tuple' and 'int'
[     703ms] > Exit code: 1
SystemError: unknown opcode
[       3ms] > Running command: podman run --rm --env-file=.env -e "CONFIG=$(cat ./docsearch-config.json | jq -r tostring)" algolia/docsearch-scraper
[     389ms] XXX lineno: 774, opcode: 163
[     391ms] Traceback (most recent call last):
[     391ms]   File "/usr/local/bin/pipenv", line 7, in <module>
[     391ms]     from pipenv import cli
[     391ms]   File "/usr/local/lib/python3.6/dist-packages/pipenv/__init__.py", line 22, in <module>
[     391ms]     from pipenv.vendor.urllib3.exceptions import DependencyWarning
[     391ms]   File "/usr/local/lib/python3.6/dist-packages/pipenv/vendor/urllib3/__init__.py", line 7, in <module>
[     391ms]     import logging
[     391ms]   File "/usr/lib/python3.6/logging/__init__.py", line 26, in <module>
[     391ms]     import sys, os, time, io, traceback, warnings, weakref, collections
[     391ms]   File "/usr/lib/python3.6/traceback.py", line 5, in <module>
[     391ms]     import linecache
[     391ms]   File "/usr/lib/python3.6/linecache.py", line 11, in <module>
[     391ms]     import tokenize
[     391ms]   File "/usr/lib/python3.6/tokenize.py", line 37, in <module>
[     391ms]     cookie_re = re.compile(r'^[ \t\f]*#.*?coding[:=][ \t]*([-\w.]+)', re.ASCII)
[     391ms]   File "/usr/lib/python3.6/re.py", line 233, in compile
[     391ms]     return _compile(pattern, flags)
[     391ms]   File "/usr/lib/python3.6/re.py", line 301, in _compile
[     391ms]     p = sre_compile.compile(pattern, flags)
[     391ms]   File "/usr/lib/python3.6/sre_compile.py", line 562, in compile
[     391ms]     p = sre_parse.parse(p, flags)
[     391ms]   File "/usr/lib/python3.6/sre_parse.py", line 855, in parse
[     391ms]     p = _parse_sub(source, pattern, flags & SRE_FLAG_VERBOSE, 0)
[     391ms]   File "/usr/lib/python3.6/sre_parse.py", line 416, in _parse_sub
[     391ms]     not nested and not items))
[     391ms]   File "/usr/lib/python3.6/sre_parse.py", line 774, in _parse
[     391ms]     subpatternappend((AT, AT_BEGINNING))
[     391ms] SystemError: unknown opcode
[     709ms] > Exit code: 1
AttributeError: 'Environment' object has no attribute 'scan'
[       2ms] > Running command: podman run --rm --env-file=.env -e "CONFIG=$(cat ./docsearch-config.json | jq -r tostring)" algolia/docsearch-scraper
[     677ms] Traceback (most recent call last):
[     677ms]   File "/usr/local/bin/pipenv", line 11, in <module>
[     677ms]     sys.exit(cli())
[     677ms]   File "/usr/local/lib/python3.6/dist-packages/pipenv/vendor/click/core.py", line 829, in __call__
[     677ms]     return self.main(*args, **kwargs)
[     677ms]   File "/usr/local/lib/python3.6/dist-packages/pipenv/vendor/click/core.py", line 781, in main
[     677ms]     with self.make_context(prog_name, args, **extra) as ctx:
[     677ms]   File "/usr/local/lib/python3.6/dist-packages/pipenv/vendor/click/core.py", line 700, in make_context
[     677ms]     self.parse_args(ctx, args)
[     677ms]   File "/usr/local/lib/python3.6/dist-packages/pipenv/vendor/click/core.py", line 1212, in parse_args
[     678ms]     rest = Command.parse_args(self, ctx, args)
[     678ms]   File "/usr/local/lib/python3.6/dist-packages/pipenv/vendor/click/core.py", line 1044, in parse_args
[     678ms]     parser = self.make_parser(ctx)
[     678ms]   File "/usr/local/lib/python3.6/dist-packages/pipenv/vendor/click/core.py", line 965, in make_parser
[     678ms]     for param in self.get_params(ctx):
[     678ms]   File "/usr/local/lib/python3.6/dist-packages/pipenv/vendor/click/core.py", line 912, in get_params
[     678ms]     help_option = self.get_help_option(ctx)
[     678ms]   File "/usr/local/lib/python3.6/dist-packages/pipenv/cli/options.py", line 27, in get_help_option
[     678ms]     from ..core import format_help
[     678ms]   File "/usr/local/lib/python3.6/dist-packages/pipenv/core.py", line 33, in <module>
[     678ms]     from .project import Project
[     678ms]   File "/usr/local/lib/python3.6/dist-packages/pipenv/project.py", line 30, in <module>
[     679ms]     from .vendor.requirementslib.models.utils import get_default_pyproject_backend
[     679ms]   File "/usr/local/lib/python3.6/dist-packages/pipenv/vendor/requirementslib/__init__.py", line 9, in <module>
[     679ms]     from .models.lockfile import Lockfile
[     679ms]   File "/usr/local/lib/python3.6/dist-packages/pipenv/vendor/requirementslib/models/lockfile.py", line 9, in <module>
[     680ms]     import plette.lockfiles
[     680ms]   File "/usr/local/lib/python3.6/dist-packages/pipenv/vendor/plette/__init__.py", line 8, in <module>
[     680ms]     from .lockfiles import Lockfile
[     680ms]   File "/usr/local/lib/python3.6/dist-packages/pipenv/vendor/plette/lockfiles.py", line 13, in <module>
[     680ms]     from .models import DataView, Meta, PackageCollection
[     680ms]   File "/usr/local/lib/python3.6/dist-packages/pipenv/vendor/plette/models/__init__.py", line 8, in <module>
[     680ms]     from .base import (
[     680ms]   File "/usr/local/lib/python3.6/dist-packages/pipenv/vendor/plette/models/base.py", line 2, in <module>
[     680ms]     import cerberus
[     680ms]   File "/usr/local/lib/python3.6/dist-packages/pipenv/vendor/cerberus/__init__.py", line 21, in <module>
[     680ms]     __version__ = get_distribution("Cerberus").version
[     680ms]   File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 472, in get_distribution
[     680ms]     dist = get_provider(dist)
[     680ms]   File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 344, in get_provider
[     680ms]     return working_set.find(moduleOrReq) or require(str(moduleOrReq))[0]
[     680ms]   File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 892, in require
[     681ms]     needed = self.resolve(parse_requirements(requirements))
[     681ms]   File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 765, in resolve
[     681ms]     env = Environment(self.entries)
[     681ms]   File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 976, in __init__
[     681ms]     self.scan(search_path)
[     681ms] AttributeError: 'Environment' object has no attribute 'scan'
[    1.096s] > Exit code: 1

Or if there's a way to run the crawler in a more verbose mode so I can have a clue of what fails, that'd help a lot too.