store-parser

A python3 library that queries product prices in popular online stores.

How it works

It connects to online store websites using HTTP GET and parses the markup using lxml (https://github.com/lxml/lxml). Each store parser placed under the plugins directory is loaded at runtime and called to extract product information.

The sequence of steps is controlled by the Parser super class as described below:

An initial request using the supplied term is performed, to extract the page count of the results
For every page, a separate request is made to extract the list of products
The results from all stores are added to a unified list and returned to the caller

Installation

Prerequisites

Requires python 3: brew install python3
Requires lxml: pip3 install lxml

Usage

python3 store-parser -q "marshall amp"

Adding support for other stores

Steps

Add your [store-name-in-lowercase].py in the plugins directory
Write a class that inherits from Parser named by this convention: [Store-name-capitalized]Parser
In your __init__ method call the super class __init__ using (name, host, path, key_query, key_page) where:
- name is the name of the plugin - only used for logging purposes
- host is the domain of the store, eg. mystore.com
- path is the relative path of the search page, eg. /search?
- key_query is the query string key used for the search term, eg. search?keywords=tube+screamer
- key_page is the query string key used for the result page, eg. search?keywords=tube+screamer&page=2

Example

from parsing import Parser, HtmlTraversal
from query import Result


class MyStoreParser(Parser):
    def __init__(self):
        super().__init__("mystore", "www.mystore.com", "/search?lang=en", "keywords", "page")

    def extract_page_count(self, markup):
        count = 1

        traverse = HtmlTraversal(markup)
        elements = traverse.get_elements("a", {"class": "page last"})

        if len(elements) > 0:
            count = int(traverse.in_element(elements[-1]).get_value())

        return count

    def extract_results(self, markup):
        results = []

        traverse = HtmlTraversal(markup)
        elements = traverse.get_elements("li", {"class": "result"})

        for x in elements:
            result = Result()
            result.title = traverse.in_element(x).get_value_of("a", {"class": "title"})
            result.url = traverse.in_element(x).get_attr_of("a", {"class": "title"}, "href")
            result.price = traverse.in_element(x).get_value_of("span", {"class": "price"})
            results.append(result)

        return results

Contribution

Contributions for new features and support for new stores are very welcome. The only requirement is that the code conforms to PEP 8 - coding style guide for python.

stphivos / store-parser