andrealenzi11 / py-web-miner

Extensible Web Miner to extract information from web pages. It is based on HTTP Requests library, Beautiful Soup parser, and Selenium WebDriver.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

py-web-miner

Extensible Web Miner to extract information from web pages.

It is based on HTTP Requests library, Beautiful Soup parser, and Selenium WebDriver.

Usage Example

Use Selenium WebDriver Scraper (You must have the chromedriver/geckodriver executable in a folder associated to the environment variable PATH (e.g. /usr/local/bin/):

from py_web_miner.scraping import SeleniumScraper

scraper_obj = SeleniumScraper(
    random_user_agent_flag=True,
    random_screen_resolution_flag=True,
    browser="chrome",  # "chrome" / "firefox"
    bs4_parser="html.parser",
    proxy=None
)

Or, eventually, use Requests Scraper (only HTML parsing, no Javascript execution):

from py_web_miner.scraping import RequestsScraper

scraper_obj = RequestsScraper(
    random_user_agent_flag=True,
    random_screen_resolution_flag=True,
    bs4_parser="html.parser",
    proxy=None
)

Start the scraper, retrieve the HTML source and extract raw text and all external links:

# start the scraper object
scraper_obj.start()

# retrieve the HTML source
html_body = scraper_obj.retrieve_html(
    url="https://github.com/andrealenzi11",
    wait_seconds=1.0,
    delete_cookies_flag=True
)

# extract the raw text
extracted_text = scraper_obj.extract_text(
    html_body=html_body
)

# extract the links
extracted_links = scraper_obj.extract_links(
    html_body=html_body
)

# quit
scraper_obj.quit()

About

Extensible Web Miner to extract information from web pages. It is based on HTTP Requests library, Beautiful Soup parser, and Selenium WebDriver.

License:Apache License 2.0


Languages

Language:Python 100.0%