otto-de-legacy / crystal-gazer

Recommender POC

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Proof of concept (POC) of a recommender of time series categorical data. We use Tensorboard and nmslib for visualization and nearest neighbor queries, respectively.

The virtual conda environment can be setup via the included setup.sh script.

What is under the hood beside python standard libraries such as pandas, scipy, ...:

Pull-requests are welcome.

#license

Copyright 2018 Otto GmbH & Co Kg

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

training and testing

prior data preparation

Prepare data with integer numbers as in test/train_data_small which is used as input. Each line is one unser interaction history ordered w.r.t. time (see attached data), e.g.:

_,0
a,1
b,2
c,3
d,4
e,5
f,6
...

A map to string is needed (see attached data), e.g.:

15,8,34,15,76,77,78,38,76,79,40,76,8,50
15,15,92,15,92,15,92,15,92
15,35,38,37,39,36,40,41,35,15,35,37,39,36,38,40,41,35
29,30
68,68,15
15,6,15,6,3,4,5,15,6,15,6,15
... 

overview

Naming convention:

  • interaction: one single integer in the history given in the data
  • event: one event contains features and targets beside having an occurence count.

We use the data to generate events. The events are sparse one-hot encoded according to the integer given in the data file. Thus, make sure, the user interactions are mapped to unique sequential integer values! For the translation, we use a mapping, defined in the interaction_map file for easier interpretation.

profiling and evaluation

Some performance metrics are plotted with Tensorboard. Below, the three subsequent runs can be seen, each time adding new data and new cathegories up to 1 million. The metrics include TODO: explain metrics.

tensorboard_example

The (final) json file produced can be viewed via a chrome browser: chrome://tracing

json_profiling_example

rest-api falcon

The reco rest api for a single recommender (conditional) index is shown as example: tensorboard_example The metrics endpoint is shown in the profiling section below.

starting

We use gunicorn in the virtual environment (see environment.yml) to start the server. Start server (set timeout to let it create the index):

    gunicorn reco:application --timeout=1500

profiling

install:

    sudo apt install apache2-utils

profile via, e.g.:

    ab -c 10 -n 100000 http://127.0.0.1:8000/recos?url=aa

Profiling output of the rest-api calls on an i7 local machine for ca. 600 thousand items with a vector size of 100:

Percentage of the requests served within a certain time (ms)
  50%      7
  66%      7
  75%      7
  80%      7
  90%      7
  95%      8
  98%      8
  99%      8
 100%     26 (longest request)

The internal nmslib call was (shown in the metrics endpoint): tensorboard_example

Carl says that's ok.

About

Recommender POC

License:Apache License 2.0


Languages

Language:Python 99.9%Language:Shell 0.1%