miniflux / v2

Minimalist and opinionated feed reader

Home Page:https://miniflux.app

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Extracting the wrong description tag

KevinCFechtel opened this issue · comments

Hi, it seems that miniflux ignores the actual description of an RSS item if a media description is available.
As an example an item from the NYT feed (https://rss.nytimes.com/services/xml/rss/nyt/HomePage.xml):

<item>
  <title>A Popular Israeli Minister’s Meeting in London Sends a Message to Netanyahu</title>
  <link>https://www.nytimes.com/2024/03/07/world/middleeast/netanyahu-cameron-gantz-israel.html</link>
  <guid isPermaLink="true">https://www.nytimes.com/2024/03/07/world/middleeast/netanyahu-cameron-gantz-israel.html</guid>
  <atom:link href="https://www.nytimes.com/2024/03/07/world/middleeast/netanyahu-cameron-gantz-israel.html" rel="standout"></atom:link>
  <description>A meeting between the British foreign secretary, David Cameron, and an Israeli minister, Benny Gantz, carried more weight than usual, analysts said, and stressed the frustration of Israel’s allies.</description>
  <dc:creator>Mark Landler</dc:creator>
  <pubDate>Thu, 07 Mar 2024 08:52:25 +0000</pubDate>
  <category domain="http://www.nytimes.com/namespaces/keywords/des">Israel-Gaza War (2023- )</category>
  <category domain="http://www.nytimes.com/namespaces/keywords/des">International Relations</category>
  <category domain="http://www.nytimes.com/namespaces/keywords/des">United States International Relations</category>
  <category domain="http://www.nytimes.com/namespaces/keywords/des">Politics and Government</category>
  <category domain="http://www.nytimes.com/namespaces/keywords/des">Palestinians</category>
  <category domain="http://www.nytimes.com/namespaces/keywords/nyt_per">Cameron, David</category>
  <category domain="http://www.nytimes.com/namespaces/keywords/nyt_per">Eisenkot, Gadi</category>
  <category domain="http://www.nytimes.com/namespaces/keywords/nyt_per">Gantz, Benny</category>
  <category domain="http://www.nytimes.com/namespaces/keywords/nyt_per">Netanyahu, Benjamin</category>
  <category domain="http://www.nytimes.com/namespaces/keywords/nyt_geo">Israel</category>
  <category domain="http://www.nytimes.com/namespaces/keywords/nyt_geo">United States</category>
  <category domain="http://www.nytimes.com/namespaces/keywords/nyt_geo">Great Britain</category>
  <media:content height="1146" medium="image" url="https://static01.nyt.com/images/2024/03/07/multimedia/07UK-Gantz-jmwq/07UK-Gantz-jmwq-mediumSquareAt3X.jpg" width="1146"></media:content>
  <media:credit>Aaron Chown/Press Association, via Associated Press</media:credit>
  <media:description>Benny Gantz, right, a key member of Israel’s War Cabinet and the top political rival of Israel’s prime minister, and Britain’s foreign secretary, David Cameron, in London on Wednesday.</media:description>
</item>

In this example, miniflux ignores the text in the description tag and only extracts the text in the media:description tag, which is only the description of the attached image.
This behavior occurs reproducibly for every item with media:description, for items without media:description the actual description is extracted correctly.

Regards,
Kevin

After a bit of research, the problem seems to be related to the parsing of xml in go.
The parser seems to have no possibility in the struct declaration to explicitly specify an element without namespace.
So both the element without namespace (correct) and the element with namespace (incorrect) are identified equally.
I have found the following thread on stack overflow:
https://stackoverflow.com/questions/14145864/dealing-with-namespaces-while-parsing-xml-in-go

The place in the code:

Description string `xml:"description"`