seantomburke / sitemapper

parses sitemaps for Node.JS

Home Page:https://www.npmjs.com/package/sitemapper

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

zero sites found for https://www.newyorker.com/sitemap.xml

palashkulsh opened this issue · comments

sitemapper is not able to find any site in https://www.newyorker.com/sitemap.xml . not able to figure out why.

That doesn’t look like a proper .xml file, it’s called sitemap.xml but it looks like just a list of links.

It shows xml format in my browser pfa screenshot. Please let me know if I am missing anything
image

What i think is happening is that, since there are a lot of child sitemaps, sitemapper is trying to parallely fetch all the sitemaps which leads to no request getting complete. digging further into it. Trying to figure out how to reduce concurrency of parallel requests.

yes, thats the case. It is not able to fetch because there is no limit to the parallel requests that could be made at a time. And hence all request fails.

@seantomburke any ideas on how to curb infinite parallel requests.

let me check with increasing timeout. But eventually need to check the uncontrolled hits. I am trying to find a way for it. Will raise a PR if i find a way

i dont think increasing the timeout will solve the problem . will need to work on concurrency.

The new yorker blocks the request due to the user agent.
Try adding this to your options:

const sitemapper = new Sitemapper({
  url: 'https://www.newyorker.com/sitemap.xml',
  timeout: 15000,
  headers: {
    'User-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 11_1_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36',
  }
});