abevoelker / forgetsy

A trending library

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Forgetsy

Build Status

Forgetsy is a scalable trending library designed to track temporal trends in non-stationary categorical distributions. It uses forget-table style data structures which decay observations over time. Using a ratio of two such sets decaying over different lifetimes, it picks up on changes to recent dynamics in your observations, whilst forgetting historical data responsibly. The technique is closely related to exponential moving average (EMA) ratios used for detecting trends in financial data.

Trends are encapsulated by a construct named Delta. A Delta consists of two sets of counters, each of which implements exponential time decay of the form:

equation

Where the inverse of the decay rate (lambda) is the mean lifetime of an observation in the set. By normalising such a set by a set with half the decay rate, we obtain a trending score for each category in a distribution. This score expresses the change in the rate of observations of a category over the lifetime of the set, as a proportion in the range 0..1.

Forgetsy removes the need for manually sliding time windows or explicitly maintaining rolling counts, as observations naturally decay away over time. It's designed for heavy writes and sparse reads, as it implements decay at read time.

Each set is implemented as a redis sorted set, and keys are scrubbed when a count is decayed to near zero, providing storage efficiency.

Forgetsy handles distributions with upto around 106 active categories, receiving hundreds of writes per second, without much fuss. Its scalability is dependent on your redis deployment.

It requires redis to be running on localhost at the default port (6379).

Installation

Add to Gemfile:

gem "forgetsy", git: "https://github.com/abevoelker/forgetsy.git"

Setup

Forgetsy will default to using Redis.current for Redis commands, but can be configured to use a specific Redis client:

Forgetsy.redis = Redis.new(host: "10.0.1.1", port: 6380, db: 15)

or a hash of options that will be passed to Redis.new:

Forgetsy.redis = { host: "10.0.1.1", port: 6380, db: 15 }

The hash options also support a special namespace key which will namespace all Forgetsy keys under a prefix using the redis-namespace gem. For example:

Forgetsy.redis = { host: "localhost", port: 6379, db: 1, namespace: "forgetsy" }

which is equivalent to:

require "redis-namespace"
client = Redis.new(host: "localhost", port: 6379, db: 1)
Forgetsy.redis = Redis::Namespace.new(:forgetsy, redis: client)

Note that if you are using the namespace key you are responsible for adding the redis-namespace gem to your project's Gemfile - this gem doesn't list it as a dependency.

Usage

Take, for example, a social network in which users can follow each other. You want to track trending users. You construct a one week delta, to capture trends in your follows data over one week periods:

follows_delta = Forgetsy::Delta.create('user_follows', t: 1.week, replay: true)

The delta consists of two sets of counters indexed by category identifiers. In this example, the identifiers will be user ids. One set decays over the mean lifetime specified by t, and another set decays over double the lifetime.

You can now add observations to the delta, in the form of follow events. Each time a user follows another, you increment the followed user id. We can also do this retrospectively, since we have passed the replay option to the factory method above:

follows_delta = Forgetsy::Delta.fetch('user_follows')
follows_delta.incr('UserFoo', date: 2.weeks.ago)
follows_delta.incr('UserBar', date: 10.days.ago)
follows_delta.incr('UserBar', date: 1.week.ago)
follows_delta.incr('UserFoo', date: 1.day.ago)
follows_delta.incr('UserFoo')

Providing an explicit date is useful if you are processing data asynchronously. You can also use incr_by to increment a counter in batches.

You can now consult your follows delta to find your top trending users:

puts follows_delta.fetch

Will print:

{ 'UserFoo' => 0.667, 'UserBar' => 0.500 }

Each user is given a dimensionless score in the range 0..1 corresponding to the normalised follows delta over the time period. This expresses the proportion of follows gained by the user over the last week compared to double that lifetime.

Optionally fetch the top n users, or an individual user's trending score:

follows_delta.fetch(n: 20)
follows_delta.fetch(bin: 'UserFoo')

Contributing

Just fork the repo and submit a pull request.

Copyright & License

MIT license. See LICENSE for details.

(c) 2013 Art.sy Inc.

About

A trending library

License:MIT License


Languages

Language:Ruby 100.0%