matthewmueller / x-ray

The next web scraper. See through the <html> noise.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Can't scrape Instagram.com?

mileung opened this issue · comments

Subject of the issue

Using chrome's selector to scrape data on Instagram's website yields nothing.

Your environment

  • version of node: v9.11.2
  • version of npm: 5.6.0

Steps to reproduce

copy the selector of an element in Chrome's dev tools.

function getInstagramFollowers(username) {
  let url = `https://www.instagram.com/${username}/`;
  let selector = '#react-root > section > main > div > ul > li:nth-child(2) > span > span';

  x(url, selector)((err, count) => {
    console.log('COUNT', count)
  })
}

getInstagramFollowers('facebook')

Expected behaviour

Get the number of followers back

Actual behaviour

count is an empty string

If you set the selector to just 'html', you get back what appears to be the raw js used before React.

I believe the selector isn't exactly a CSS selector. I wasn't able to get :nth-child(2) or other selectors like that working, but there is additional functionality like a@href. You can nest selectors as long as it is an x-ray issue and not an anti-scraping issue.

What css selectors can't I use? Is > ok?

@mileung instagram.com is a client-side React app, which means the HTML isn't present in the request that comes from the server, instead the HTML is constructed using JavaScript. If you take a look at the HTML source of the site you'll see the only plain HTML inside the <body /> is <span id="react-root"></span>.

This means you'll need to use a driver that understands JavaScript -- you could try x-ray-phantom which will require phantomjs be installed on your computer/server.

However the data you're looking for (follower count) is still available elsewhere in the HTML source of the page.

For example, in the header you'll find:

<meta property="og:description" content="3m Followers, 9 Following, 315 Posts - See Instagram photos and videos from Facebook (@facebook)" />

Which could be selected and parsed.

However an even better source of data would be the <script> tag in the <body> which contains the initial data object used by their React app to render. We can use x-ray or just about anything to reach in and grab that data.

const XRay = require('x-ray');
const x = XRay();

const url = 'https://instagram.com/facebook';

x(url, 'body script@html').then(res => {
  // First strip variable declaration
  res = res.replace('window._sharedData = ', '');

  // Next strip the trailing semi-colon as that's not valid JSON
  res = res.replace(/;$/, '');

  // Now we parse the string as JSON
  const data = JSON.parse(res);

  // Now we deeply select the user object from the data
  const user = data.entry_data.ProfilePage[0].graphql.user;

  // And console log just the follower count
  // however there's heaps of useful data in the user object
  console.log(user.edge_followed_by.count);
});

You can see a working online example here: https://repl.it/@levibuzolic/x-ray-instagram-followers

This whole thing of course is pretty brittle and like scraping of any website relies on Instagram not changing their HTML or JS data structure for it to be able to continue working. Instagram has an API you could just use, or there's 3rd party sites/tools that will get this data for you and they'll take on the burden of keeping their service working for you.

@levibuzolic How do you detect if a client-side React app?

@mileung while not specific to React, the fact you should be able to tell you’re dealing with a client side app by looking at the difference between the HTML that comes back in the request (view source) vs the HTML that’s present after JS has run (inspect elements).