How to debug feeds that throw an error?

Question

How to debug feeds that throw an error?

kylealwyn opened this issue 2 years ago · comments

Trying to pull something like https://www.nature.com/nature.rss - getting an error both locally and in demo. Ran the address through the w3c validator and came up valid.

Somewhat related, I'm also trying to use a proxy but to no avail as http://api_key@proxy.scrapingbee.com:8887 is throwing Invalid URL

Kyle Alwyn · Answer 1 · Thu Jan 12 2023 08:24:34 GMT+0800 (China Standard Time)

Curious if something similar to extractus/article-extractor#326 is viable for this library - it'd be great to fetch the xml on my own and provide that to this parser

Kyle Alwyn · Answer 2 · Thu Jan 12 2023 08:28:55 GMT+0800 (China Standard Time)

Sorry last thing but I think the type for headers in FetchOptions is incorrect, believe it should be something like Record<string, string>:

export interface FetchOptions {
  /**
   * list of request headers
   * default: null
   */
  headers?: string[];
  /**
   * the values to configure proxy
   * default: null
   */
  proxy?: ProxyConfig;
}

Dong Nguyen · Answer 3 · Thu Jan 12 2023 10:01:31 GMT+0800 (China Standard Time)

@kylealwyn same idea, this lib should have that method too.

Dong Nguyen · Answer 4 · Thu Jan 12 2023 10:20:52 GMT+0800 (China Standard Time)

@kylealwyn https://www.nature.com/nature.rss uses RDF, It's been a long time since I've seen this format!

Dong Nguyen · Answer 5 · Thu Jan 12 2023 13:06:31 GMT+0800 (China Standard Time)

@kylealwyn v6.2.1 has just been released with 2 new methods for extracting feed data from XML or JSON string. That mays resolve your case.

Regarding https://www.nature.com/nature.rss, we have not plan to support RDF format for right now, because this format is quite rarely used.

Somewhat related, I'm also trying to use a proxy but to no avail as http://api_key@proxy.scrapingbee.com:8887 is throwing Invalid URL

Could you share more info about your code here? This lib does not modify or verify proxy url. it simply prefers to pick the url from the proxy if that presents.

Kyle Alwyn · Answer 6 · Thu Jan 12 2023 13:43:43 GMT+0800 (China Standard Time)

Awesome! Will check it out. Would be great to expose the utils to validate whether xml or json feed, or have a unified entrypoint that runs the validation & normalization, but I will copy those over for now!

Regarding https://www.nature.com/nature.rss, we have not plan to support RDF format for right now, because this format is quite rarely used.

Makes sense

Could you share more info about your code here? This lib does not modify or verify proxy url. it simply prefers to pick the url from the proxy if that presents.

I'm doing something like

const res = await read(
  feed.xmlUrl,
  {},
  {
    proxy: {
      target: 'http://127.0.0.1:3001',
    },
  },
);

Where the target is the url initially shared, or any ip/port combination, and getting back an Invalid URL error.

Kyle Alwyn · Answer 7 · Thu Jan 12 2023 14:11:40 GMT+0800 (China Standard Time)

Also, what would be the lift on supporting RDF feeds? https://rss.slashdot.org/Slashdot/slashdotMain is another big one I'm interested in. Seeing the format quite a bit through my explorations.

Dong Nguyen · Answer 8 · Thu Jan 12 2023 14:59:22 GMT+0800 (China Standard Time)

@kylealwyn thank you, RDF can reuse almost logic from RSS parser. I will try to implement a draft.