stphivos / store-parser

A parser library that queries product prices in popular online stores

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

store-parser

A python3 library that queries product prices in popular online stores.

How it works

It connects to online store websites using HTTP GET and parses the markup using lxml (https://github.com/lxml/lxml). Each store parser placed under the plugins directory is loaded at runtime and called to extract product information.

The sequence of steps is controlled by the Parser super class as described below:

  • An initial request using the supplied term is performed, to extract the page count of the results
  • For every page, a separate request is made to extract the list of products
  • The results from all stores are added to a unified list and returned to the caller

Installation

Prerequisites

  • Requires python 3: brew install python3
  • Requires lxml: pip3 install lxml

Usage

python3 store-parser -q "marshall amp"

Adding support for other stores

Steps

  • Add your [store-name-in-lowercase].py in the plugins directory
  • Write a class that inherits from Parser named by this convention: [Store-name-capitalized]Parser
  • In your __init__ method call the super class __init__ using (name, host, path, key_query, key_page) where:
    • name is the name of the plugin - only used for logging purposes
    • host is the domain of the store, eg. mystore.com
    • path is the relative path of the search page, eg. /search?
    • key_query is the query string key used for the search term, eg. search?keywords=tube+screamer
    • key_page is the query string key used for the result page, eg. search?keywords=tube+screamer&page=2

Example

from parsing import Parser, HtmlTraversal
from query import Result


class MyStoreParser(Parser):
    def __init__(self):
        super().__init__("mystore", "www.mystore.com", "/search?lang=en", "keywords", "page")

    def extract_page_count(self, markup):
        count = 1

        traverse = HtmlTraversal(markup)
        elements = traverse.get_elements("a", {"class": "page last"})

        if len(elements) > 0:
            count = int(traverse.in_element(elements[-1]).get_value())

        return count

    def extract_results(self, markup):
        results = []

        traverse = HtmlTraversal(markup)
        elements = traverse.get_elements("li", {"class": "result"})

        for x in elements:
            result = Result()
            result.title = traverse.in_element(x).get_value_of("a", {"class": "title"})
            result.url = traverse.in_element(x).get_attr_of("a", {"class": "title"}, "href")
            result.price = traverse.in_element(x).get_value_of("span", {"class": "price"})
            results.append(result)

        return results

Contribution

Contributions for new features and support for new stores are very welcome. The only requirement is that the code conforms to PEP 8 - coding style guide for python.

About

A parser library that queries product prices in popular online stores


Languages

Language:Python 100.0%