Web Scraping With Playwright
This article discusses everything you need to know about news scraping, including the benefits and use cases of news scraping as well as how you can use Python to create an article scraper.
For a detailed explanation, see our blog post.
Support for proxies in Playwright
Without Proxy.js
// Node.js
const { chromium } = require('playwright'); "
const browser = await chromium.launch();
# Python
from playwright.async_api import async_playwright
import asyncio
with async_playwright() as p:
browser = await p.chromium.launch()
With Proxy
// Node.js
const launchOptions = {
proxy: {
server: 123.123.123.123:80'
},
headless: false
}
const browser = await chromium.launch(launchOptions);
# Python
proxy_to_use = {
'server': '123.123.123.123:80'
}
browser = await p.chromium.launch(proxy=proxy_to_use, headless=False)
Basic scraping with Playwright
Node.Js
npm init -y
npm install playwright
const playwright = require('playwright');
(async () => {
const browser = await playwright.chromium.launch({
headless: false // Show the browser.
});
const page = await browser.newPage();
await page.goto('https://books.toscrape.com/');
await page.waitForTimeout(1000); // wait for 1 seconds
await browser.close();
})();
Python
pip install playwright
from playwright.async_api import async_playwright
import asyncio
async def main():
async with async_playwright() as pw:
browser = await pw.chromium.launch(
headless=False # Show the browser
)
page = await browser.new_page()
await page.goto('https://books.toscrape.com/')
# Data Extraction Code Here
await page.wait_for_timeout(1000) # Wait for 1 second
await browser.close()
if __name__ == '__main__':
asyncio.run(main())
Web Scraping
Node.JS
const playwright = require('playwright');
(async () => {
const browser = await playwright.chromium.launch();
const page = await browser.newPage();
await page.goto('https://books.toscrape.com/');
const books = await page.$$eval('.product_pod', all_items => {
const data = [];
all_items.forEach(book => {
const name = book.querySelector('h3').innerText;
const price = book.querySelector('.price_color').innerText;
const stock = book.querySelector('.availability').innerText;
data.push({ name, price, stock});
});
return data;
});
console.log(books);
await browser.close();
})();
Python
from playwright.async_api import async_playwright
import asyncio
async def main():
async with async_playwright() as pw:
browser = await pw.chromium.launch()
page = await browser.new_page()
await page.goto('https://books.toscrape.com')
all_items = await page.query_selector_all('.product_pod')
books = []
for item in all_items:
book = {}
name_el = await item.query_selector('h3')
book['name'] = await name_el.inner_text()
price_el = await item.query_selector('.price_color')
book['price'] = await price_el.inner_text()
stock_el = await item.query_selector('.availability')
book['stock'] = await stock_el.inner_text()
books.append(book)
print(books)
await browser.close()
if __name__ == '__main__':
asyncio.run(main())
If you wish to find out more about Web Scraping With Playwright, see our blog post.