Cannot index pages when using a custom port

Question

Cannot index pages when using a custom port

ArthurFlag opened this issue 5 years ago · comments

Hi,

I've been using Docsearch without issue for weeks, but it suddenly seems that most of my content is not indexed.
I'm running a static website built using Sphinx and I host it on locally on localhost:8080.

I'm indexing it at the moment running a local install of docsearch (updated to the lastest master), and I'm using the following config:

{
  "index_name": "abc-index",
  "sitemap_urls": ["http://127.0.0.1:8080/sitemap.xml"],
  "start_urls": [
    {
      "url": "http://127.0.0.1:8080/docs/"
    },
    {
      "url": "http://127.0.0.1:8080/abc-cloud/abc.html",
      "selector_key": "api-docs",
      "page_rank": 5
    },
    {
      "url": "http://127.0.0.1:8080/abc-cloud/associate.html",
      "selector_key": "api-docs",
      "page_rank": 4
    },
    {
      "url": "http://127.0.0.1:8080/abc-cloud/fulfillment.html",
      "selector_key": "api-docs",
      "page_rank": 0
    },
    {
      "url": "http://127.0.0.1:8080/abc-cloud/consumer.html",
      "selector_key": "api-docs",
      "page_rank": -1
    },
    {
      "url": "http://127.0.0.1:8080/abc-cloud/",
      "selector_key": "api-docs"
    }
  ],
  "stop_urls": [
    "http://127.0.0.1:8080/abc-cloud/../",
    "http://127.0.0.1:8080/abc-cloud/new.html",
    "http://127.0.0.1:8080/abc-cloud/hooks_new.html"
  ],
  "selectors": {
    "default": {
      "lvl0": "h1",
      "lvl1": "h2",
      "lvl2": "h3",
      "lvl3": "h4",
      "lvl4": "h5",
      "text": "p, li"
    },
    "api-docs": {
      "lvl0": "h1",
      "lvl1": "h2",
      "lvl2": "h3",
      "lvl3": "h4",
      "lvl4": "pre code.json",
      "text": "p, li"
    }
  }
}

I get the following output from docsearch:

> DocSearch: http://127.0.0.1:8080/docs/ 60 records)
> DocSearch: http://127.0.0.1:8080/abc-cloud/ 1 records)
> DocSearch: http://127.0.0.1:8080/abc-cloud/associate.html 509 records)
> DocSearch: http://127.0.0.1:8080/abc-cloud/consumer.html 745 records)
> DocSearch: http://127.0.0.1:8080/abc-cloud/fulfillment.html 662 records)
> DocSearch: http://127.0.0.1:8080/abc-cloud/abc.html 2288 records)

What I notice is that the crawler does not seem to go into /docs/*, and /abc-cloud/*, only the pages that are starting urls are crawled.
I do I make the crawler recursive?

Thank you

252819 commented 4 years ago

Ok

Arthur · Answer 1 · Tue May 28 2019 22:27:07 GMT+0800 (China Standard Time)

Got a reponse from @s-pace on the docsearch repo:

👋 @arthurflageul

This repo is only related to the front end part and the documentation of the product. Please move it here: https://github.com/algolia/docsearch-scraper>

Some quick lead that might you help to debug:

* Do not use URLs with a port as it might wrongly impact the crawl

* Are you sure that the sitemap is correctly parsed? Pages crawled from a sitemap are written in cyan blue

* Are you sure that the missing pages are linked from a crawled one thanks to a `<a/>` tag?

* The stop_url '"http://127.0.0.1:8080/abc-cloud/../"' is interpreted as a regex. Be careful with some side-effect

* Comment [these two lines](https://github.com/algolia/docsearch-scraper/blob/master/scraper/src/index.py#L55-L56) and run it again to see the full logs of scrappy. You will have more details about the crawl

Arthur · Answer 2 · Tue May 28 2019 23:43:28 GMT+0800 (China Standard Time)

Ok so after double checking, it seems to be because i'm running on 8080.

For a bit of context, I'm indexing a private website. So I'm running it locally and I have a Jenkins job running the docsearch docker image on it.
Port 80 is typically reserved on most setups, so I have to use something else when I have Jenkins setup a local server for itself.

How much effort would it be to allow other ports?

Sylvain Pace · Answer 3 · Wed May 29 2019 00:03:57 GMT+0800 (China Standard Time)

It just how we parse the URL, using a port will broke everything. I would recommend you to use a local DNS

Arthur · Answer 4 · Wed May 29 2019 16:58:20 GMT+0800 (China Standard Time)

Thanks for the answer.
This is a very unfortunate design decision and it is making things complicated for certain of your paying customers 😞
If there is a way to push a feature to your backlog, I would like to request the ability to crawl any url, regardless of the port.

Anyway, as a first improvement, I think this should be clearly documented.

Sylvain Pace · Answer 5 · Wed May 29 2019 17:01:45 GMT+0800 (China Standard Time)

This is not a good practice to use a port when live, this is why we do not document it.

Thanks for sharing, I would recommend you to use the default 80 port and avoid to precise it.

🙏

Igor Żuk · Answer 6 · Sun May 31 2020 15:57:37 GMT+0800 (China Standard Time)

It just hit me as well. I too think that it should be at least documented or generate a meaningful error. It's a regular procedure to develop and test a website on localhost port 8080, which makes this bug (?) a perfect beginners' trap.

Hugues Alary · Answer 7 · Sat Nov 14 2020 05:13:41 GMT+0800 (China Standard Time)

Let me start by saying I really appreciate you guys making this tool open source, it's amazing.

Now, I just got hit by this as well, and it took me many hours before finding this thread.

This is not a good practice to use a port when live, this is why we do not document it.
Thanks for sharing, I would recommend you to use the default 80 port and avoid to precise it.

isn't a good justification. Tons of people run on non default ports. This should at a minimum be documented.

Matthias Koch · Answer 8 · Sun Jun 26 2022 14:13:44 GMT+0800 (China Standard Time)

Same here. Let me also say that I genuinely like Algolia, but this here is a bit ignorant

Port 80 is typically reserved on most setups

I would recommend you to use the default 80 port and avoid to precise it.

I also spend a good few hours on this and still have no workaround.

jun · Answer 9 · Mon Sep 19 2022 18:19:17 GMT+0800 (China Standard Time)

Gonna up this one cause we need that port feature to test the scraper on a dev environment