michelssousa / playwright-web-scraping

A tutorial for web scraping using Playwright headless browser

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Web Scraping With Playwright

This article discusses everything you need to know about news scraping, including the benefits and use cases of news scraping as well as how you can use Python to create an article scraper.

For a detailed explanation, see our blog post.

Support for proxies in Playwright

Without Proxy.js

// Node.js

const { chromium } = require('playwright'); "
const browser = await chromium.launch();
# Python

from playwright.async_api import async_playwright
import asyncio
with async_playwright() as p:
    browser = await p.chromium.launch()

With Proxy

// Node.js
const launchOptions = {
    proxy: {
        server: 123.123.123.123:80'
    },
    headless: false
}
const browser = await chromium.launch(launchOptions);
# Python
proxy_to_use = {
    'server': '123.123.123.123:80'
}
browser = await p.chromium.launch(proxy=proxy_to_use, headless=False)

Basic scraping with Playwright

Node.Js

npm init -y
npm install playwright
const playwright = require('playwright');
(async () => {
    const browser = await playwright.chromium.launch({
        headless: false // Show the browser. 
    });

    const page = await browser.newPage();
    await page.goto('https://books.toscrape.com/');
    await page.waitForTimeout(1000); // wait for 1 seconds
    await browser.close();
})();

Python

pip install playwright
from playwright.async_api import async_playwright
import asyncio

async def main():
    async with async_playwright() as pw: 
        browser = await pw.chromium.launch(
            headless=False  # Show the browser
        )
        page = await browser.new_page()
        await page.goto('https://books.toscrape.com/')
        # Data Extraction Code Here
        await page.wait_for_timeout(1000)  # Wait for 1 second
        await browser.close()

if __name__ == '__main__':
    asyncio.run(main())

Web Scraping

Node.JS

const playwright = require('playwright');

(async () => {
    const browser = await playwright.chromium.launch();
    const page = await browser.newPage();
    await page.goto('https://books.toscrape.com/');
    const books = await page.$$eval('.product_pod', all_items => {
        const data = [];
        all_items.forEach(book => {
            const name = book.querySelector('h3').innerText;
            const price = book.querySelector('.price_color').innerText;
            const stock = book.querySelector('.availability').innerText;
            data.push({ name, price, stock});
        });
        return data;
    });
    console.log(books);
    await browser.close();
})();

Python

from playwright.async_api import async_playwright
import asyncio


async def main():
    async with async_playwright() as pw:
        browser = await pw.chromium.launch()
        page = await browser.new_page()
        await page.goto('https://books.toscrape.com')

        all_items = await page.query_selector_all('.product_pod')
        books = []
        for item in all_items:
            book = {}
            name_el = await item.query_selector('h3')
            book['name'] = await name_el.inner_text()
            price_el = await item.query_selector('.price_color')
            book['price'] = await price_el.inner_text()
            stock_el = await item.query_selector('.availability')
            book['stock'] = await stock_el.inner_text()
            books.append(book)
        print(books)
        await browser.close()

if __name__ == '__main__':
    asyncio.run(main())

If you wish to find out more about Web Scraping With Playwright, see our blog post.

About

A tutorial for web scraping using Playwright headless browser