lgraubner / sitemap-generator

Easily create XML sitemaps for your website.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Adding urls stuck in a loop

kure- opened this issue · comments

Hey,
I've discovered, that if you create robots.txt like this:
User-agent: *
Sitemap: http://localhost:1337/sitemap.xml
The generator or crawler itself stucks in a loop while adding urls. I've binded logs to all known events, and only the "add" event is being called. If I call also the "getStats" method, the added urls (total of 3 urls) are increasing from 0 to 3, then resets and starts over from 0 and it doesnt call done event.

Similar problem might also happen, if you create a robots.txt file like that:
User-agent: *
Disallow: /
Right now it basically ignores all urls (3 urls) and doesnt call any event so the event hangs after server responds with timeout.

My case is, that i'm using Sailsjs and i have a robots.txt generator which responds with different content based on set environment to minimize or deny any indexing of that environment.

Thanks

Interesting. Seems like this problem is not in the scope of this project, but I will look into it and open an issue in the crawler package. Should be easy to reproduce.

Well yes and no, the same thing happens for a case:
You put a <meta name="robots" content="noindex, nofollow" /> to your layout, your base url is then ignored, no other url is being crawled and no other event is called :)

You are partly right. The meta tag is indeed handled by sitemap-generator but the result is that it returns 0 discovered URLs to the crawler and the crawler complete event seems not to be fired.

I tested the simplecrawler package directly with you provided robots.txt examples and they both worked fine so far.
Are you sure the robots.txt content is the same for every request the crawler makes? I'm curious, because it should not reset anything.

I must admit, that the behaviour right now is different, which makes my case just weird :-) Right now, with the provided robots.txt, the urls are just ignored and no complete/done event is fired. No more endless loop for me, but still, no event is being called here :)

Good to know. 😜 I will check if the other problem is caused by this module.

After some investigation I discovered the problem. The callback which emits done was not executed because there were no items to process. Fixed!

i have nothing to say but love your attitude and love your library 👍

Thanks! 😊