forhadsidhu / Beautifulsoup-CheatSheet

Basic code for web-scraping using beautifulsoup

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Beautifulsoup-CheatSheet

Library import

import requests
from bs4 import BeautifulSoup

Making Simple Requests

r = requests.get("http://example.com/page")

usually used when sending information to the server like submitting a form

r = requests.post("http://example.com/page", data=dict(
    email="me@domain.com",
    password="secret_value"
))

(usually used when making a search query or paging through results)

r = requests.get("http://example.com/page", params=dict(
    query="web scraping",
    page=2
))

Getting response

print r.status_code

Full response as text

print r.text

Find substring

if "blocked" in r.text:
    print "we've been blocked"

Find Content Type

print r.headers.get("content-type", "unknown")

Parsing Using BeautifulSoup soup = BeautifulSoup(r.text, "html.parser") Find all links

links = soup.find_all("a")
  • ...
  • tags = soup.find_all("li", "search-result")
    
    ...
    tag = soup.find("div", id="bar")
    
    Look for nested patterns of tags
    tags = soup.find("div", id="search-results").find_all("a", "external-links")
    
    Look for all tags matching CSS selectors
    tags = soup.select("#search-results .external-links")
    
    Get a list of strings representing the inner contents of a tag
    inner_contents = soup.find("div", id="price").contents
    
    Return only the text contents within this tag, but ignore the text representation of other HTML tags
    nner_text = soup.find("div", id="price").text.strip()
    
    Convert the text that are extracting from unicode to ascii
    inner_text = soup.find("div", id="price").text.strip().encode("utf-8")
    
    Using XPath Selectors
    BeautifulSoup doesn’t currently support XPath selectors,
    
    Writing to CSV
    import csv
    ...
    with open("~/Desktop/output.csv", "w") as f:
        writer = csv.writer(f)
    
    # collected_items = [
    #   ["Product #1", "$10", "http://example.com/product-1"],
    #   ["Product #2", "$25", "http://example.com/product-2"],
    #   ...
    # ]
    
    for item_property_list in collected_items:
        writer.writerow(item_property_list)
    

    import csv ... field_names = ["Product Name", "Price", "Detail URL"] with open("~/Desktop/output.csv", "w") as f: writer = csv.DictWriter(f, field_names)

    # collected_items = [
    #   {
    #       "Product Name": "Product #1",
    #       "Price": "$10",
    #       "Detail URL": "http://example.com/product-1"
    #   },
    #   ...
    # ]
    
    # Write a header row
    writer.writerow({x: x for x in field_names})
    
    for item_property_dict in collected_items:
        writer.writerow(item_property_dict)
    
    Writing to a SQLite Database
    import sqlite3
    

    conn = sqlite3.connect("/tmp/output.sqlite") cur = conn.cursor() ... for item in collected_items: cur.execute("INSERT INTO scraped_data (title, price, url) values (?, ?, ?)", (item["title"], item["price"], item["url"]) )

    Different Parser

    Python’s html.parser                                            BeautifulSoup(markup, "html.parser")
    lxml’s HTML parser                                              BeautifulSoup(markup, "lxml")
    lxml’s XML parser                                               BeautifulSoup(markup, "lxml-xml") BeautifulSoup(markup, "xml")
    html5lib                                                        BeautifulSoup(markup, "html5lib")
    

    What is Web Scraping

    Web scraping is the technique to extract and read the data from the internet. The collected data can be saved and reused for data analytics
    

    Explain Web Scraping Procedure.

    There are multiple steps involved in web scraping:
    
    Reading data (source code of the web page URL) from the website
    Parsing this data based on the HTML tags
    Storing or displaying desired scraped information
    Scraped data is very useful in data analytics.
    

    What are the preferred programming languages for web scrapping?

    Python is the most preferred programming language for web scrapping.
    It has many libraries to read and extract data from the internet, to parse and manipulate the data.
    

    What are the Python libraries you have used for web scrapping?

    Beautiful Soap and Scrappy are the two most useful Python modules for scrapping web information.
    The request module is to read the data from internet web pages.
    JSON library is used to dump, to read and to write the JSON formatting objects.
    

    What is the purpose of the request module in Python?

    The request module is used to read the data from the internet web pages.
    You have to pass the URL from where you want to read the data along with the HTTP request method, 
    header information like encoding method, response data format, and session cookies…
    
    In the HTTP response, you get data from the website. Data can be in any format like string,
    JSON, XML and YAML; based on data format mentioned in the request and server response.
    

    How to deal if your IP address is blocked by the website?

    If you are accessing any website more than a certain threshold, your IP address can be blocked by the website. 
    Proxy IPs/servers can be used to access the web pages if your IP address is blocked.
    
    Usually, data analytics companies web scraps millions of web pages. Many times their IP addresses get blocked.
    To overcome this they use a VPN (Virtual Private Network). There are many VPN service providers.
    

    How does VPN work?

    You send a request to the VPN server. It reads the data from the website. VPN sends back the response to your IP address.
    VPN hide your IP
    

    What is Robot.txt

    Robot.txt has instruction by following we will not be blocked and also have website structure
    

    Where Get Robot.txt

    Navigate to your domain, and just add "/robots.txt".
    

    About

    Basic code for web-scraping using beautifulsoup