algolia / docsearch-scraper

DocSearch - Scraper

Home Page:https://docsearch.algolia.com/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Cannot index pages when using a custom port

ArthurFlag opened this issue · comments

Hi,

I've been using Docsearch without issue for weeks, but it suddenly seems that most of my content is not indexed.
I'm running a static website built using Sphinx and I host it on locally on localhost:8080.

I'm indexing it at the moment running a local install of docsearch (updated to the lastest master), and I'm using the following config:

{
  "index_name": "abc-index",
  "sitemap_urls": ["http://127.0.0.1:8080/sitemap.xml"],
  "start_urls": [
    {
      "url": "http://127.0.0.1:8080/docs/"
    },
    {
      "url": "http://127.0.0.1:8080/abc-cloud/abc.html",
      "selector_key": "api-docs",
      "page_rank": 5
    },
    {
      "url": "http://127.0.0.1:8080/abc-cloud/associate.html",
      "selector_key": "api-docs",
      "page_rank": 4
    },
    {
      "url": "http://127.0.0.1:8080/abc-cloud/fulfillment.html",
      "selector_key": "api-docs",
      "page_rank": 0
    },
    {
      "url": "http://127.0.0.1:8080/abc-cloud/consumer.html",
      "selector_key": "api-docs",
      "page_rank": -1
    },
    {
      "url": "http://127.0.0.1:8080/abc-cloud/",
      "selector_key": "api-docs"
    }
  ],
  "stop_urls": [
    "http://127.0.0.1:8080/abc-cloud/../",
    "http://127.0.0.1:8080/abc-cloud/new.html",
    "http://127.0.0.1:8080/abc-cloud/hooks_new.html"
  ],
  "selectors": {
    "default": {
      "lvl0": "h1",
      "lvl1": "h2",
      "lvl2": "h3",
      "lvl3": "h4",
      "lvl4": "h5",
      "text": "p, li"
    },
    "api-docs": {
      "lvl0": "h1",
      "lvl1": "h2",
      "lvl2": "h3",
      "lvl3": "h4",
      "lvl4": "pre code.json",
      "text": "p, li"
    }
  }
}

I get the following output from docsearch:

> DocSearch: http://127.0.0.1:8080/docs/ 60 records)
> DocSearch: http://127.0.0.1:8080/abc-cloud/ 1 records)
> DocSearch: http://127.0.0.1:8080/abc-cloud/associate.html 509 records)
> DocSearch: http://127.0.0.1:8080/abc-cloud/consumer.html 745 records)
> DocSearch: http://127.0.0.1:8080/abc-cloud/fulfillment.html 662 records)
> DocSearch: http://127.0.0.1:8080/abc-cloud/abc.html 2288 records)

What I notice is that the crawler does not seem to go into /docs/*, and /abc-cloud/*, only the pages that are starting urls are crawled.
I do I make the crawler recursive?

Thank you

Got a reponse from @s-pace on the docsearch repo:

👋 @arthurflageul

This repo is only related to the front end part and the documentation of the product. Please move it here: https://github.com/algolia/docsearch-scraper>

Some quick lead that might you help to debug:

* Do not use URLs with a port as it might wrongly impact the crawl

* Are you sure that the sitemap is correctly parsed? Pages crawled from a sitemap are written in cyan blue

* Are you sure that the missing pages are linked from a crawled one thanks to a `<a/>` tag?

* The stop_url '"http://127.0.0.1:8080/abc-cloud/../"' is interpreted as a regex. Be careful with some side-effect

* Comment [these two lines](https://github.com/algolia/docsearch-scraper/blob/master/scraper/src/index.py#L55-L56) and run it again to see the full logs of scrappy. You will have more details about the crawl

Ok so after double checking, it seems to be because i'm running on 8080.

For a bit of context, I'm indexing a private website. So I'm running it locally and I have a Jenkins job running the docsearch docker image on it.
Port 80 is typically reserved on most setups, so I have to use something else when I have Jenkins setup a local server for itself.

How much effort would it be to allow other ports?

It just how we parse the URL, using a port will broke everything. I would recommend you to use a local DNS

Thanks for the answer.
This is a very unfortunate design decision and it is making things complicated for certain of your paying customers 😞
If there is a way to push a feature to your backlog, I would like to request the ability to crawl any url, regardless of the port.

Anyway, as a first improvement, I think this should be clearly documented.

This is not a good practice to use a port when live, this is why we do not document it.

Thanks for sharing, I would recommend you to use the default 80 port and avoid to precise it.

🙏

It just hit me as well. I too think that it should be at least documented or generate a meaningful error. It's a regular procedure to develop and test a website on localhost port 8080, which makes this bug (?) a perfect beginners' trap.

Ok

Let me start by saying I really appreciate you guys making this tool open source, it's amazing.

Now, I just got hit by this as well, and it took me many hours before finding this thread.

This is not a good practice to use a port when live, this is why we do not document it.
Thanks for sharing, I would recommend you to use the default 80 port and avoid to precise it.

isn't a good justification. Tons of people run on non default ports. This should at a minimum be documented.

Same here. Let me also say that I genuinely like Algolia, but this here is a bit ignorant

Port 80 is typically reserved on most setups

I would recommend you to use the default 80 port and avoid to precise it.

I also spend a good few hours on this and still have no workaround.

commented

Gonna up this one cause we need that port feature to test the scraper on a dev environment