willb / nachlass

automation to make it possible to publish your ephemeral note(book)s on Kubernetes

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

nachlass

A philosopher's Nachlass is the set of notes he or she leaves behind unpublished. This project is a tool to make it easier to publish your Jupyter notebooks; it started with some experiments building tooling to treat notebooks as software artifacts and currently includes a source-to-image builder to translate from scikit-learn pipelines to scoring services that publish prediction metrics for classifiers and regressors. Our goal with this work is to streamline the model lifecycle, shorten feedback loops, and make models more reproducible.

See a demo

Sophie Watson and Will Benton gave a talk at OSCON 2019 that includes a demonstration of an early version of Nachlass, including pipeline service generation and metrics publishing.

Trying it out

Quick start

First, download the s2i command-line tool.

s2i build -e S2I_SOURCE_NOTEBOOK_LIST=03-feature-engineering-tfidf.ipynb,04-model-logistic-regression.ipynb quay.io/willbenton/nachlass-s2i:latest https://github.com/willb/ml-workflows-notebook

This will generate a container image for a model service. Deploy this service in your favorite Kubernetes environment and post text data to it with cURL:

curl --request GET \
  --url http://pipeline:8080/metrics \
  --header 'content-type: application/x-www-form-urlencoded' \
  --data 'json_args="example text goes here"' \
  --data =

or with Python's requests library:

DEFAULT_BASE_URL = "http://pipeline:8080/%s"

import requests
from urllib.parse import urlencode
import json

def score_text(text, url = None):
    url = (url or (DEFAULT_BASE_URL % "predict")) 
    if type(text) == str:
        text = [text]
    payload = urlencode({"json_args" : json.dumps(text)})
    headers = {'content-type': 'application/x-www-form-urlencoded'}
    response = requests.request("POST", url, data=payload, headers=headers)
    return json.loads(response.text)

(In either case, substitute your actual service name for pipeline.)

This service will classify text as either spam or legitimate -- in the context of the deployed model, "spam" means that supplied text resembles customer-supplied food reviews from a retail website and "legitimate" means that the supplied text resembles Jane Austen novels. (See this repository for background on this model.)

With your own pipeline

One of our design goals for Nachlass was to require as little as possible of machine learning practitioners -- a data scientist does not need to know that a notebook or set of notebooks will run on Kubernetes, as a service, or under Nachlass so long as she follows some basic principles:

  1. notebooks must be committed to a git repository,
  2. notebook requirements must be specified in a requirements.txt file
  3. notebooks must produce the expected output when cells are executed in order from start to finish, and
  4. each notebook that produces a pipeline stage must save it to a file whose location is specified by the environment variable S2I_PIPELINE_STAGE_SAVE_FILE.

The last of these is the only requirement that should change the contents of a disciplined notebook, but we can make it less of a burden by supplying a simple helper function:

def serialize_to(obj, default_filename):
    filename = os.getenv("S2I_PIPELINE_STAGE_SAVE_FILE", default_filename)
    cp.dump(obj, open(filename, "wb"))

If practitioners save pipeline stages with serialize_to, they can specify filenames that will hold for interactive use but these will be overridden by Nachlass builds.

To build a model service, use s2i (or a source build strategy in OpenShift) and set S2I_SOURCE_NOTEBOOK_LIST the notebooks corresponding to each pipeline stage in a comma-separated list. The general form of the command line is this:

s2i build -e S2I_SOURCE_NOTEBOOK_LIST=notebook1.ipynb,notebook2.ipynb quay.io/willbenton/nachlass-s2i:latest https://github.com/your-username/your-repo

If your notebooks require access to nonpublic data, they will need to pass credentials and service endpoints in the build environment or in Kubernetes secrets. (Small, nonsensitive data sets may be stored in the git repo alongside models.)

Work in progress

Our work in progress includes:

  • a Tekton backend for builds,
  • a Kubernetes operator and custom resources to make it simpler to deploy pipelines, and
  • more flexible training and publishing options

About

automation to make it possible to publish your ephemeral note(book)s on Kubernetes

License:MIT License


Languages

Language:Python 72.2%Language:Dockerfile 19.6%Language:Shell 8.2%