A simple Python application which processes 'recipe' files and records their output in a separate Github repo at https://github.com/weitzman/diffi. A recipe is a small JSON file detailing the URL of an organization's Privacy policy or Terms of Service. We record changes to those files in our "data" Git repository. This provides an excellent history of changes. In addition to saving a full HTML file, we save a Markdown variant for easy browsing and change presentation.
Collaboration is welcome on this project. Please file PRs for new/updated recipes and use the issue tracker for communication.
To get started:
- Clone this repo:
git clone https://github.com/weitzman/difficode.git
- Change into new dir:
cd difficode
- Get dependencies:
pip install -r requirements.txt
- Some commands you may want to run
- Process all recipes:
app.py all
- Process one recipe:
app.py one recipes/uber/privacy.json
- Increase log verbosity via an env variable
DEBUGGING=1 app.py all
- Write output to a custom dir:
REPO_PATH=/my/path app.py all
. Defaults to/tmp/diffidata
.
- Process all recipes:
- A recipe is JSON file. Example.
- url: The web page to fetch
- selector: A CSS selector so we can extract only the policy content, and not page navigation.
- See all properties in the Recipe class.
- Ideally a recipe directory contains a maintainers.json file. A maintainer helps fix problems when the policy web page changes or moves.
- A recipe can have a accompanying Python script what does arbitrary cleanup before Markdown is extracted. Example.
- Why not add more recipes by submitting a PR to this repo?
We stand on the shoulders of giants.
- Requests-HTML. Make HTTP Requests.
- Beautiful Soup. Parse HTML.
- Python Fire. Make a CLI based on your existing objects and functions.
- jsons. (De)serializing JSON into Python objects.
- This app runs daily at Heroku.
- Make sure you have Python 3.7 or higher:
python3 --version
.