A python3 library that queries product prices in popular online stores.
It connects to online store websites using HTTP GET and parses the markup using lxml (https://github.com/lxml/lxml). Each store parser placed under the plugins directory is loaded at runtime and called to extract product information.
The sequence of steps is controlled by the Parser super class as described below:
- An initial request using the supplied term is performed, to extract the page count of the results
- For every page, a separate request is made to extract the list of products
- The results from all stores are added to a unified list and returned to the caller
- Requires python 3: brew install python3
- Requires lxml: pip3 install lxml
python3 store-parser -q "marshall amp"
- Add your [store-name-in-lowercase].py in the plugins directory
- Write a class that inherits from Parser named by this convention: [Store-name-capitalized]Parser
-
In your __init__ method call the super class __init__ using (name, host, path, key_query, key_page) where:
- name is the name of the plugin - only used for logging purposes
- host is the domain of the store, eg. mystore.com
- path is the relative path of the search page, eg. /search?
- key_query is the query string key used for the search term, eg. search?keywords=tube+screamer
- key_page is the query string key used for the result page, eg. search?keywords=tube+screamer&page=2
from parsing import Parser, HtmlTraversal
from query import Result
class MyStoreParser(Parser):
def __init__(self):
super().__init__("mystore", "www.mystore.com", "/search?lang=en", "keywords", "page")
def extract_page_count(self, markup):
count = 1
traverse = HtmlTraversal(markup)
elements = traverse.get_elements("a", {"class": "page last"})
if len(elements) > 0:
count = int(traverse.in_element(elements[-1]).get_value())
return count
def extract_results(self, markup):
results = []
traverse = HtmlTraversal(markup)
elements = traverse.get_elements("li", {"class": "result"})
for x in elements:
result = Result()
result.title = traverse.in_element(x).get_value_of("a", {"class": "title"})
result.url = traverse.in_element(x).get_attr_of("a", {"class": "title"}, "href")
result.price = traverse.in_element(x).get_value_of("span", {"class": "price"})
results.append(result)
return results
Contributions for new features and support for new stores are very welcome. The only requirement is that the code conforms to PEP 8 - coding style guide for python.