Issues with some article pages on nytimes.com

Question

Issues with some article pages on nytimes.com

DePasqualeOrg opened this issue 2 years ago · comments

[✔️] I'm using the last version.

Metascraper is not returning the relevant title, image, or description for some pages on nytimes.com. The respective meta tags are in fact present on these pages. Perhaps there's a subscription reminder popup or something that is interfering with the scraping?

A few examples picked at random:

Doesn't work for these pages:
https://www.nytimes.com/2022/06/15/business/romania-energy-nuclear-power-natural-gas.html
https://www.nytimes.com/2022/06/15/business/lego-first-factory-united-states.html

{
  title: 'nytimes.com',
  image: null,
  description: null,
}

Works for this page:
https://www.nytimes.com/interactive/2022/06/14/climate/congo-rainforest-logging.html

{
  title: 'Raft by Raft, a Rainforest Loses Its Trees',
  image: 'https://static01.nyt.com/images/2022/06/16/climate/16cli-congoriver-promo2/16cli-congoriver-promo2-facebookJumbo.jpg',
  description: 'The Congo River Basin rainforest, vital in the fight against climate change, has long been protected in part by its remoteness. But the river acts as a highway for sprawling flotillas of logs, sent downstream by tiny villages and international lumber companies alike, all seeking profit from a vulner…',
}

Kiko Beats · Answer 1 · Thu Jun 16 2022 00:26:45 GMT+0800 (China Standard Time)

Hello,

I think what is not working for you is getting the HTML from the target URLs, and that isn't a thing related with metascraper.

Saying that because all the URLs are working fine under api.microlink.io that is essentially a hosted version of metascraper:

For improving your getting HTML step, take a look at html-get 🙂

Anthony · Answer 2 · Thu Jun 16 2022 02:25:43 GMT+0800 (China Standard Time)

Ah, thanks. I should have checked the results of the fetch request first. It was hitting a captcha page. I was able to get the HTML using puppeteer:

import puppeteer from 'puppeteer';

const browser = await puppeteer.launch();
const page = await browser.newPage();

const getHTML = async (url) => {
  await page.goto(url, { waitUntil: 'domcontentloaded' });
  return page.content();
};