algolia / docsearch-scraper

DocSearch - Scraper

Home Page:https://docsearch.algolia.com/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Error crawling from docker image, invalid JSON

jacknewwl opened this issue · comments

commented

Hi guys,

I'm trying to crawl but it seems like there's an error and I'm not sure if its syntax error or if I've got the files in the wrong places. Hoping to get some feedback or some guidance. Thanks.

Input:
docker run -it --env-file=.env -e "CONFIG=$(cat ./website/config.json | jq -r tostring)" algolia/docsearch-scraper

Output:
Traceback (most recent call last):
File "/root/src/config/config_loader.py", line 107, in _load_config
data = json.loads(config, object_pairs_hook=OrderedDict)
File "/usr/lib/python3.6/json/init.py", line 367, in loads
return cls(**kw).decode(s)
File "/usr/lib/python3.6/json/decoder.py", line 339, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib/python3.6/json/decoder.py", line 357, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/root/src/index.py", line 98, in
run_config(environ['CONFIG'])
File "/root/src/index.py", line 28, in run_config
config = ConfigLoader(config)
File "/root/src/config/config_loader.py", line 75, in init
data = self._load_config(config)
File "/root/src/config/config_loader.py", line 112, in _load_config
raise ValueError('CONFIG is not a valid JSON')
ValueError: CONFIG is not a valid JSON

My folder arrangement:
image

Config.json
{
"index_name": "treaty",
"start_urls": ["https://competent-lalande-599ab3.netlify.com/docs/homepage"],
"selectors": {
"lvl0": "#content header h1",
"lvl1": "#content article h1",
"lvl2": "#content section h3",
"lvl3": "#content section h4",
"lvl4": "#content section h5",
"lvl5": "#content section h6",
"text": "#content header p,#content section p,#content section ol"
}
}

For the reference, issue solved on Discord.

Troubleshoot:

  • cat command is not available on windows
  • Make sure there is no ' in the .env file
  • Make sure the docker image is correctly updated from docker HUB
  • Be careful to the typo with algolia/docsearch-scrapper. It should be algolia/docsearch-scraper.

I have similar problem an its not working for m I tri trouble shooting too.

What can we use instead of cat command for Windows?