getElementsByTagName doesn't work on some sites
sgehrman opened this issue · comments
I'm scraping a website to get the title and description and other meta data, but it's not working on all sites.
for example:
https://www.youtube.com/watch?v=3AIZAGwMRg8
final List elements = document.head.getElementsByTagName('title');
elements returns []
But other sites work just fine, like https://apple.com
I'm also using:
final List<Element> metas = document.head.getElementsByTagName('meta');
And on that site, I'm not seeing all the meta tags
It won't work because all of that is rendered through javascript, which this library does not run.
Disable javascript before loading a page and then you can see what can be scraped and what cannot.
I installed a chrome extension to do this (https://chrome.google.com/webstore/detail/toggle-javascript/cidlcjdalomndpeagkjpnefhljffbnlo) but you can also do it by pressing F12
to open the console and then pressing `Cntr
- Shift + P` to open the command line, then just type javascript and the option is going to show up for you.
If you NEED javascript, i recommend running a library like puppeteer first and then parsing that post-rendered HTML.
Youtube also has an API you can tap into, instead of scraping their site. See if that can fit your need somehow.
If you set the User-Agent to a bot when retrieving the document, then it will return all of the tags.