ageitgey / node-unfluff

Automatically extract body content (and other cool stuff) from an html document

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Date isn't always ISO format?

mooniker opened this issue · comments

Is the date attribute supposed to be consistently an iso datestring? or it normal for it sometimes to be (an attempt at at) a human readable string, i.e. "August 2, 2017"?

{
    "title": "What Made the Moon? New Ideas Try to Rescue a Troubled Theory",
    "softTitle": "What Made the Moon? New Ideas Try to Rescue a Troubled Theory",
    "date": "ByRebecca BoyleAugust 2, 2017",
    "author": [
      "Rebecca Boyle"
    ],
    "publisher": "Quanta Magazine",
    "copyright": "2017",
    "description": "Textbooks say that the moon was formed after a Mars-size mass smashed the young Earth. But new evidence has cast doubt on that story, leaving researchers to",
    "keywords": "chemistry,geochemistry,geophysics,physics,planetary science",
    "lang": null,
    "canonicalLink": "https://www.quantamagazine.org/what-made-the-moon-new-ideas-try-to-rescue-a-troubled-theory-20170802/",
    "tags": [
      "planetary science",
      "chemistry",
      "geochemistry",
      "geophysics",
      "physics"
    ],
    "image": "https://d2r55xnwy6nx47.cloudfront.net/uploads/2017/08/Synestia_520x2921.jpg",
    "videos": [],
    "links": [
      {
        "text": "published a paper on the physics of synestias",
        "href": "http://onlinelibrary.wiley.com/doi/10.1002/2016JE005239/full"
      },
      {
        "text": "captured asteroids",
        "href": "http://aasnova.org/2016/09/23/explaining-the-birth-of-the-martian-moons/"
      },
      {
        "text": "others argue formed from Martian impacts",
        "href": "http://onlinelibrary.wiley.com/doi/10.1002/2017GL074002/abstract"
      },
      {
        "text": "she argued that Earth’s moon is not the original moon",
        "href": "http://www.nature.com/ngeo/journal/v10/n2/abs/ngeo2866.html"
      },
      {
        "text": "Robin Canup",
        "href": "https://www.boulder.swri.edu/~robin/"
      }
    ],
    "text": "Conditions in this structure are indescribably hellish; there is no surface, but instead clouds of molten rock, with every region of the cloud forming molten-rock raindrops. The moon grew inside this vapor, Lock said, before the vapor eventually cooled and left in its wake the Earth-moon system.\n\nGiven the structure’s unusual characteristics, Lock and Stewart thought it deserved a new name. They tried several versions before coining synestia, which uses the Greek prefix syn-, meaning together, and the goddess Hestia, who represents the home, hearth and architecture. The word means “connected structure,” Stewart said.\n\n“These bodies aren’t what you think they are. They don’t look like what you thought they did,” she said.\n\nIn May, Lock and Stewart published a paper on the physics of synestias; their paper arguing for a synestia lunar origin is still in review. They presented the work at planetary science conferences in the winter and spring and say their fellow researchers were intrigued but hardly sold on the idea. That may be because synestias are still just an idea; unlike ringed planets, which are common in our solar system, and protoplanetary disks, which are common across the universe, no one has ever seen one.\n\n“But this is certainly an interesting pathway that could explain the features of our moon and get us over this hump that we’re in, where we have this model that doesn’t seem to work,” Lock said.\n\nAmong natural satellites in the solar system, Earth’s moon may be most striking for its solitude. Mercury and Venus lack natural satellites, in part because of their nearness to the sun, whose gravitational interactions would make their moons’ orbits unstable. Mars has tiny Phobos and Deimos, which some argue are captured asteroids and others argue formed from Martian impacts. And the gas giants are chockablock with moons, some rocky, some watery, some both.\n\nIn contrast to these moons, Earth’s satellite also stands out for its size and the physical burden it carries. The moon is about 1 percent the mass of Earth, while the combined mass of the outer planets’ satellites is less than one-tenth of 1 percent of their parents. Even more important, the moon contains 80 percent of the angular momentum of the Earth-moon system. That is to say, the moon is responsible for 80 percent of the motion of the system as a whole. For the outer planets, this value is less than 1 percent.\n\nThe moon may not have carried all this weight the whole time, however. The face of the moon bears witness to its lifelong bombardment; why should we assume that just one rock was responsible for carving it out of Earth? It’s possible that multiple impacts made the moon, said Raluca Rufu, a planetary scientist at the Weizmann Institute of Science in Rehovot, Israel.\n\nIn a paper published last winter, she argued that Earth’s moon is not the original moon. It is instead a compendium of creation by a thousand cuts — or at the very least, a dozen, according to her simulations. Projectiles coming in from multiple angles and at multiple speeds would hit Earth and form disks, which coalesce into “moonlets,” essentially crumbs that are smaller than Earth’s current moon. Interactions between moonlets of different ages cause them to merge, eventually forming the moon we know today.\n\nPlanetary scientists were receptive when her paper was published last year; Robin Canup, a lunar scientist at the Southwest Research Institute and a dean of moon-formation theories, said it was worth considering. More testing remains, however. Rufu is not sure whether the moonlets would have been locked in their orbital positions, similar to how Earth’s moon constantly faces the same direction; if so, she is not sure how they could have merged. “That’s what we are trying to figure out next,” Rufu said.\n\nMeanwhile, others have turned to another explanation for the similarity of Earth and the moon, one that might have a very simple answer. From synestias to moonlets, new physical models — and new physics — may be moot. It’s possible that the moon looks just like Earth because Theia did, too.",
  }

The date can come from many different sources (both machine-readable meta tags and human-written text):

dateCandidates = doc("meta[property='article:published_time'], \
meta[itemprop*='datePublished'], meta[name='dcterms.modified'], \
meta[name='dcterms.date'], \
meta[name='DC.date.issued'], meta[name='dc.date.issued'], \
meta[name='dc.date.modified'], meta[name='dc.date.created'], \
meta[name='DC.date'], \
meta[name='DC.Date'], \
meta[name='dc.date'], \
meta[name='date'], \
time[itemprop*='pubDate'], \
time[itemprop*='pubdate'], \
span[itemprop*='datePublished'], \
span[property*='datePublished'], \
p[itemprop*='datePublished'], \
p[property*='datePublished'], \
div[itemprop*='datePublished'], \
div[property*='datePublished'], \
li[itemprop*='datePublished'], \
li[property*='datePublished'], \
time, \
span[class*='date'], \
p[class*='date'], \
div[class*='date']")

Because of that, there's no promise of what format the data will be in.

Very good to know. Looks like a lot of websites (the ones I've worked with so far) provide the meta in ISO but later fallbacks in that list pick it out from HTML elements (as it must be the case above).

Thanks for pointing that out so I can take that into account.

Sure, no problem :)