Scraper indexes folders, not just their files

Question

Scraper indexes folders, not just their files

ArthurFlag opened this issue 5 years ago · comments

Hi,

My website is behind authentication so to index it, I start a local HTTP server (either HTTPD via the Docker image or using Python's http.server).

I have content folders that do not contain index.html files but other html files:

├── myapp
│      ├── catalog.html
│      ├── users.html
│      └── myapp.html

And somehow, certain folders only, are indexed (along with the content, which is good) 🤔 :

It does this with only 2 folders out of the dozens i'm scraping. There is nothing different about these folders compared to the rest.

I'm using this basic config file:

{
  "index_name": "newstore-index",
  "start_urls": [
    {
      "url": "http://127.0.0.1/docs/"
    }
  ],
  "selectors": {
    "default": {
      "lvl0": "h1",
      "lvl1": "h2",
      "lvl2": "h3",
      "lvl3": "h4",
      "lvl4": "h5",
      "text": "p, li"
    }
  }
}

I'm not sure where the problem is, could it be the server or the scraper?

Sylvain Pace · Answer 1 · Fri Mar 22 2019 17:56:58 GMT+0800 (China Standard Time)

👋 @arthurflageul,

The source of the problem can be both. I would highly recommend you to use a sitemap. You can find some details here. It will be straightforward

Some points to check:

Make sure that the missing pages are referenced from another crawled one thanks to a hyperlink (<a/> tag).
Make sure the missing pages are available with a 200 http status

If you want to investigate further, you can look under the hood and focus on the scrapy's log by switching this parameter to DEBUG. You will need to run the crawler from the source code then.

Closing this issue since it is related to a personal setup that do not only involved the scraper.

Arthur · Answer 2 · Fri Mar 22 2019 18:25:22 GMT+0800 (China Standard Time)

Great, thanks a lot (again) 👍