dart-lang / html

Dart port of html5lib. For parsing HTML/HTML5 with Dart. Works in the client and on the server.

Home Page:https://pub.dev/packages/html

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

getElementsByTagName doesn't work on some sites

sgehrman opened this issue · comments

I'm scraping a website to get the title and description and other meta data, but it's not working on all sites.

for example:
https://www.youtube.com/watch?v=3AIZAGwMRg8

final List elements = document.head.getElementsByTagName('title');

elements returns []

But other sites work just fine, like https://apple.com

I'm also using:

  final List<Element> metas = document.head.getElementsByTagName('meta');

And on that site, I'm not seeing all the meta tags

It won't work because all of that is rendered through javascript, which this library does not run.

Disable javascript before loading a page and then you can see what can be scraped and what cannot.

I installed a chrome extension to do this (https://chrome.google.com/webstore/detail/toggle-javascript/cidlcjdalomndpeagkjpnefhljffbnlo) but you can also do it by pressing F12 to open the console and then pressing `Cntr

  • Shift + P` to open the command line, then just type javascript and the option is going to show up for you.

If you NEED javascript, i recommend running a library like puppeteer first and then parsing that post-rendered HTML.

Youtube also has an API you can tap into, instead of scraping their site. See if that can fit your need somehow.

If you set the User-Agent to a bot when retrieving the document, then it will return all of the tags.