amallia / taily

Implementation of Taily algorithm as described by Aly et al. in the 2013 paper "Taily: shard selection using the tail of score distributions."

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Build Status

This library implements Taily algorithm as described by Aly et al. in the 2013 paper Taily: shard selection using the tail of score distributions.

Disclaimer

At this early stage of development, the library interface is subject to changes. If you rely on it now, I advise to use a specific git tag.

Installation

taily is a header-only library. For now, copy and include include/taily.hpp file.

cmake and conan to come...

Dependencies

Library compiles with GCC >= 4.9 and Clang >= 4, and it requires C++14. The only other dependency is Boost.Math library used for Gamma distribution.

Usage

Chances are you will only need to call one function that scores all shards with respect to one query:

std::vector<double> score_shards(
    const CollectionStatistics& global_stats,
    const std::vector<CollectionStatistics>& shard_stats,
    const int ntop)

global_stats contains statistics for the entire index, while shard_stats vector represents the shards, and ntop is the parameter of Taily---the number top results for which a score threshold will be estimated.

CollectionStatistics is a simple structure that contains the collection size and a vector of of length equal to the number of query terms.

struct CollectionStatistics {
    std::vector<FeatureStatistics> term_stats;
    int size;
};

Each element of term_stats contains the values needed for computations:

struct FeatureStatistics {
    double expected_value;
    double variance;
    int frequency;

    template<typename FeatureRange>
    static FeatureStatistics from_features(const FeatureRange& features);

    template<typename ForwardIterator>
    static FeatureStatistics from_features(ForwardIterator first, ForwardIterator last);
};

Generating and Writing Features

In case you want to use this library for storing features as well, you can use the helper functions from_features() to computes statistics:

const std::vector<double>& features = fetch_or_generate_features(term);
auto stats = FeatureStatistics::from_features(features);

or

double* features = fetch_or_generate_features(term);
auto stats = FeatureStatistics::from_features(features, features + len);

The first one takes any forward range, such as std::vector, std::array, that overload std::begin() and std::end() that return a forward iterator of doubles. The latter takes two of such iterators.

About

Implementation of Taily algorithm as described by Aly et al. in the 2013 paper "Taily: shard selection using the tail of score distributions."

License:MIT License


Languages

Language:C++ 71.9%Language:CMake 20.2%Language:Python 7.9%