Entrophy

Reduce complexity, create simplicity

what it does?

creates product groups with rich metadata from a raw mess of online products
this is a medium size project with around 10k lines of Python3 by June 2020

around 200 files, with an average of 50 lines per file

directory walkthrough

api/

cloud entry points are here,

scripts here defined as services in setup.py

and cron jobs are set for them in scrapinghub

visitor.py visits all websites

refresher.py refreshes the matching and sync the new matching

also sentry is init in visitor, backup, and refresher. this is connected to my personal account, please replace the api keys because I will revoke present keys

build/

autogenerated

constants/

frequently used dictionary keys are defined here

data_services/

when a function needs to read or write some data, it just calls a service for it.

enrich/

enrich skus with brand, category, color etc.

topic-modelling

experimental, not in production

mobile/

helper scripts to provide some necessary data to mobile, usually one-of scripts

project.egg/

autogenerated on deploy

services/

reusable generic services

spec/

defines data models and exceptions, not an exhaustistive spec though

spiders/

all collection logic here,

spider_modules/

parsers here

for a new module to be registered, it has to be defined in SPIDER_MODULES at spiders/settings.py

on start, scrapy executes every file defined in SPIDER_MODULES, so be careful, don't leave functions or code in the open

pipelines/

they are very important, all data goes through a pipeline

they clean the parsed data and sync them to mongo, elastic, and firebase

this clean and sync are an involved operation, so be careful

in settings.py default pipeline is set as MarketPipeline, almost all data goes through it

ITEM_PIPELINES = {"spiders.pipelines.market_pipeline.MarketPipeline": 300}

however some parsers define a different pipeline in their custom_settings

base.py

spider template

visitor.py

calling all spiders one by one would be very tedious

visitor collects all of them, and run_spiders_concurrently

test_spider.py

test method, when you run it, pipelines are not activated

supermatch/

get raw docs and turns them into groups of skus and products

syncer.py

writes the created groups to elasticsearch and firestore

test/

tests and one-of scratch scripts

.deepsource.toml

static analysis tool, connected to github

travis.yml

for travis ci, it is not active

Pipfile

package management

requirements

scrapinghub needs this instead of Pip

congrats, that's all :)

how to deploy

in terminal, from the top level directory, run shub deploy

deployment configs are in

setup.py
scrapy.cfg
scrapinghub.yml
and spiders/settings.py

how does it work ?

system summary

crawlers collect raw docs and save to mongo
supermatch creates groups
1. docs as nodes of a graph
2. edges from barcodes, names, and promoted links
3. connected components are groups of docs
4. reduce a doc group to a single SKU
5. group SKUs using links and names
6. reduce SKUs into products
a key library here is networkx, it's used create and update the graph
assign a category, brand, type, size, color, and other identifiers to SKUs by using names and info in raw docs
saves skus to elastic and firestore
instant update -> when crawling a raw doc next time, write the fresh price to the related elastic SKU

result: a clean set of products with rich metadata

skus could be generated confidently at any time from the raw docs thus we only back up the raw docs

tools

use pylint, mypy, black regularly

write automated tests for critical parts

beware the code smells

any one them in your code ?

long classes
long functions
hardcoded variables
too many args
flags

errors not handled properly

copy-paste
mixed styles
dead code
commented out lines

no tests
functions are hard to test

bad names

unnecessary complexity
functions should do one thing

image match

you may take a similarity hash of images and compare them to match items

https://github.com/JohannesBuchner/imagehash

topic modelling

gensim is great

https://radimrehurek.com/gensim/auto_examples/core/run_similarity_queries.html#sphx-glr-auto-examples-core-run-similarity-queries-py

https://radimrehurek.com/gensim/models/ldamodel.html

next steps

a few bits of advice :)

think about data structures and their relationships
think ahead

set up backups
have a plan B
make it easy to recover

solve the right problem
If it is hard to explain, it's a bad idea.
define what is most important and focus on this

ride now, ride and fear no darkness..

selimslab / entrophy