algolia / docsearch-scraper

DocSearch - Scraper

Home Page:https://docsearch.algolia.com/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Crawler isn't following links

lorensr opened this issue · comments

I'm using the docker container and this config:

https://github.com/GraphQLGuide/book/blob/411bb46629b622a785312f199053e5c55234608d/docsearch.json

When I run the docker command, I get 303 "nb hits", but they all point to different anchors on the start_url page—none of them are for the other pages linked on the start_url page (https://graphql.guide/contents)

Hi @lorensr,

The start_urls are more of "a pattern of URLs the crawler should accept" than "which URL should I start with", which is why other pages are not crawled.

As the contents route doesn't have children, nothing else is found, but if you try with "start_urls": ["https://graphql.guide/vue"], it will work.

One way to solve this issue could be to create a sitemap.xml only for the crawler, so it can follow all the pages inside (doc)
Or use a more generic "start_urls": ["https://graphql.guide/"] for example

Thank you so much! Generic solution worked great ☺️

No worries, feel free to close the bounty or give it to a charity of your choice :D

No worries, feel free to close the bounty or give it to a charity of your choice :D

Hey @shortcuts please take a look at my problem here #571
It is not generating hits for child routes example /docs
And when I enter the complete URL with the /docs route then it shows ignored start URL