Entrophy
Reduce complexity, create simplicity
what it does?
-
creates product groups with rich metadata from a raw mess of online products
-
this is a medium size project with around 10k lines of Python3 by June 2020
around 200 files, with an average of 50 lines per file
directory walkthrough
api/
cloud entry points are here,
scripts here defined as services in setup.py
and cron jobs are set for them in scrapinghub
visitor.py visits all websites
refresher.py refreshes the matching and sync the new matching
also sentry is init in visitor, backup, and refresher. this is connected to my personal account, please replace the api keys because I will revoke present keys
build/
autogenerated
constants/
frequently used dictionary keys are defined here
data_services/
when a function needs to read or write some data, it just calls a service for it.
enrich/
enrich skus with brand, category, color etc.
topic-modelling
experimental, not in production
mobile/
helper scripts to provide some necessary data to mobile, usually one-of scripts
project.egg/
autogenerated on deploy
services/
reusable generic services
spec/
defines data models and exceptions, not an exhaustistive spec though
spiders/
all collection logic here,
spider_modules/
parsers here
for a new module to be registered, it has to be defined in SPIDER_MODULES at spiders/settings.py
on start, scrapy executes every file defined in SPIDER_MODULES, so be careful, don't leave functions or code in the open
pipelines/
they are very important, all data goes through a pipeline
they clean the parsed data and sync them to mongo, elastic, and firebase
this clean and sync are an involved operation, so be careful
in settings.py default pipeline is set as MarketPipeline, almost all data goes through it
ITEM_PIPELINES = {"spiders.pipelines.market_pipeline.MarketPipeline": 300}
however some parsers define a different pipeline in their custom_settings
base.py
spider template
visitor.py
calling all spiders one by one would be very tedious
visitor collects all of them, and run_spiders_concurrently
test_spider.py
test method, when you run it, pipelines are not activated
supermatch/
get raw docs and turns them into groups of skus and products
syncer.py
writes the created groups to elasticsearch and firestore
test/
tests and one-of scratch scripts
.deepsource.toml
static analysis tool, connected to github
travis.yml
for travis ci, it is not active
Pipfile
package management
requirements
scrapinghub needs this instead of Pip
congrats, that's all :)
how to deploy
in terminal, from the top level directory,
run shub deploy
deployment configs are in
- setup.py
- scrapy.cfg
- scrapinghub.yml
- and spiders/settings.py
how does it work ?
system summary
-
crawlers collect raw docs and save to mongo
-
supermatch creates groups
- docs as nodes of a graph
- edges from barcodes, names, and promoted links
- connected components are groups of docs
- reduce a doc group to a single SKU
- group SKUs using links and names
- reduce SKUs into products
a key library here is networkx, it's used create and update the graph
-
assign a category, brand, type, size, color, and other identifiers to SKUs by using names and info in raw docs
-
saves skus to elastic and firestore
-
instant update -> when crawling a raw doc next time, write the fresh price to the related elastic SKU
result: a clean set of products with rich metadata
skus could be generated confidently at any time from the raw docs thus we only back up the raw docs
tools
use pylint, mypy, black regularly
write automated tests for critical parts
beware the code smells
any one them in your code ?
- long classes
- long functions
- hardcoded variables
- too many args
- flags
- errors not handled properly
- copy-paste
- mixed styles
- dead code
- commented out lines
- no tests
- functions are hard to test
- bad names
- unnecessary complexity
- functions should do one thing
image match
you may take a similarity hash of images and compare them to match items
https://github.com/JohannesBuchner/imagehash
topic modelling
gensim is great
https://radimrehurek.com/gensim/models/ldamodel.html
next steps
a few bits of advice :)
-
think about data structures and their relationships
-
think ahead
- set up backups
- have a plan B
- make it easy to recover
-
solve the right problem
-
If it is hard to explain, it's a bad idea.
-
define what is most important and focus on this
ride now, ride and fear no darkness..