microlinkhq / metascraper

Get unified metadata from websites using Open Graph, Microdata, RDFa, Twitter Cards, JSON-LD, HTML, and more.

Home Page:https://metascraper.js.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Issues with some article pages on nytimes.com

DePasqualeOrg opened this issue · comments

  • [✔️] I'm using the last version.

Metascraper is not returning the relevant title, image, or description for some pages on nytimes.com. The respective meta tags are in fact present on these pages. Perhaps there's a subscription reminder popup or something that is interfering with the scraping?

A few examples picked at random:

Doesn't work for these pages:
https://www.nytimes.com/2022/06/15/business/romania-energy-nuclear-power-natural-gas.html
https://www.nytimes.com/2022/06/15/business/lego-first-factory-united-states.html

{
  title: 'nytimes.com',
  image: null,
  description: null,
}

Works for this page:
https://www.nytimes.com/interactive/2022/06/14/climate/congo-rainforest-logging.html

{
  title: 'Raft by Raft, a Rainforest Loses Its Trees',
  image: 'https://static01.nyt.com/images/2022/06/16/climate/16cli-congoriver-promo2/16cli-congoriver-promo2-facebookJumbo.jpg',
  description: 'The Congo River Basin rainforest, vital in the fight against climate change, has long been protected in part by its remoteness. But the river acts as a highway for sprawling flotillas of logs, sent downstream by tiny villages and international lumber companies alike, all seeking profit from a vulner…',
}

Hello,

I think what is not working for you is getting the HTML from the target URLs, and that isn't a thing related with metascraper.

Saying that because all the URLs are working fine under api.microlink.io that is essentially a hosted version of metascraper:

For improving your getting HTML step, take a look at html-get 🙂

Ah, thanks. I should have checked the results of the fetch request first. It was hitting a captcha page. I was able to get the HTML using puppeteer:

import puppeteer from 'puppeteer';

const browser = await puppeteer.launch();
const page = await browser.newPage();

const getHTML = async (url) => {
  await page.goto(url, { waitUntil: 'domcontentloaded' });
  return page.content();
};