extractus / feed-extractor

Simplest way to read & normalize RSS/ATOM/JSON feed data

Home Page:https://extractor-demos.pages.dev/feed-extractor

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

How to debug feeds that throw an error?

kylealwyn opened this issue · comments

Trying to pull something like https://www.nature.com/nature.rss - getting an error both locally and in demo. Ran the address through the w3c validator and came up valid.

Somewhat related, I'm also trying to use a proxy but to no avail as http://api_key@proxy.scrapingbee.com:8887 is throwing Invalid URL

Curious if something similar to extractus/article-extractor#326 is viable for this library - it'd be great to fetch the xml on my own and provide that to this parser

Sorry last thing but I think the type for headers in FetchOptions is incorrect, believe it should be something like Record<string, string>:

export interface FetchOptions {
  /**
   * list of request headers
   * default: null
   */
  headers?: string[];
  /**
   * the values to configure proxy
   * default: null
   */
  proxy?: ProxyConfig;
}

@kylealwyn same idea, this lib should have that method too.

@kylealwyn https://www.nature.com/nature.rss uses RDF, It's been a long time since I've seen this format!

@kylealwyn v6.2.1 has just been released with 2 new methods for extracting feed data from XML or JSON string. That mays resolve your case.

Regarding https://www.nature.com/nature.rss, we have not plan to support RDF format for right now, because this format is quite rarely used.

Somewhat related, I'm also trying to use a proxy but to no avail as http://api_key@proxy.scrapingbee.com:8887 is throwing Invalid URL

Could you share more info about your code here? This lib does not modify or verify proxy url. it simply prefers to pick the url from the proxy if that presents.

Awesome! Will check it out. Would be great to expose the utils to validate whether xml or json feed, or have a unified entrypoint that runs the validation & normalization, but I will copy those over for now!

Regarding https://www.nature.com/nature.rss, we have not plan to support RDF format for right now, because this format is quite rarely used.

Makes sense

Could you share more info about your code here? This lib does not modify or verify proxy url. it simply prefers to pick the url from the proxy if that presents.

I'm doing something like

const res = await read(
  feed.xmlUrl,
  {},
  {
    proxy: {
      target: 'http://127.0.0.1:3001',
    },
  },
);

Where the target is the url initially shared, or any ip/port combination, and getting back an Invalid URL error.

Also, what would be the lift on supporting RDF feeds? https://rss.slashdot.org/Slashdot/slashdotMain is another big one I'm interested in. Seeing the format quite a bit through my explorations.

@kylealwyn thank you, RDF can reuse almost logic from RSS parser. I will try to implement a draft.