This library returns the HTML of a Single Page Application (SPA) after the page has loaded. This HTML can then be passed to BeautifulSoup for parsing.
sudo apt-get install -y nodejs npm
npm i puppeteer
The traditional method of web scraping in Python w/ requests and BeautifulSoup isn't effective for more modern pages and SPAs. This library dynamically generates a JavaScript file that uses puppeteer to fully load the page and return the HTML that is dynamicaly generated in the Document Object Model (DOM).
The primary method, get_soup, accepts a full URL as a string and returns the page's content as a string.
Typical Workflow (requests/BeautifulSoup):
res = requests.get('http://example.com')
soup = BeautifulSoup(res.text, 'html.parser')
New Workflow w/ javasoup:
from javasoup import get_soup
soup = BeautifulSoup(get_soup('http://example.com'), 'html.parser')
- Creates the necessary JavaScript file in the current working directory (MUST HAVE WRITE PRIVILEGES)
- Runs that JavaScript file using the URL provided and stores the value returned
- Deletes the temporary JavaScript file
- Returns the HTML content
pip install javasoup