siboehm / lleaves

Compiler for LightGBM gradient-boosted trees, based on LLVM. Speeds up prediction by ≥10x.

Home Page:https://lleaves.readthedocs.io/en/latest/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Benchmarking

skaae opened this issue · comments

I tried benchmarking lleaves vs treelite and found that lleaves is slightly slower than treelite.
I might be doing something wrong?

I benchmark in google benchmark with batch size 1 and random features. I have ~600 trees with 450 leaves and max depth 13.
Treelite is compiled with clang 10.0. I think we did see that treelite was a lot slower using GCC.

I noticed that the compile step for lleaves took severeal hours, so maybe the forest I'm using is somehow off?

In any case I think your library looks very nice :)

Xeon E2278G

------------------------------------------------------
Benchmark            Time             CPU   Iterations
------------------------------------------------------
BM_LLEAVES       32488 ns        32487 ns        21564
BM_TREELITE      27251 ns        27250 ns        25635

EPYC 7402P

------------------------------------------------------
Benchmark            Time             CPU   Iterations
------------------------------------------------------
BM_LLEAVES       38020 ns        38019 ns        18308
BM_TREELITE      32155 ns        32154 ns        21579
#include <benchmark/benchmark.h>
#include <random>
#include "lleavesheader.h"
#include "treeliteheader.h"
#include <vector>
#include <random>
#include <iostream>


constexpr int NUM_FATURES = 108;

static void BM_LLEAVES(benchmark::State& state)
{

    std::random_device dev;
    std::mt19937 rng(dev());
    std::uniform_int_distribution<std::mt19937::result_type> dist(-10,10); 

    std::size_t N = 10000000;
    std::vector<double> f;
    f.reserve(N);
    for (std::size_t i = 0; i<N; ++i){
        f.push_back(dist(rng));
    }

    double out;
    std::size_t i = 0;
    for (auto _ : state) {
        forest_root(f.data()+NUM_FATURES*i, &out, (int)0, (int)1);
        ++i;
    }
}

static void BM_TREELITE(benchmark::State& state)
{

    std::random_device dev;
    std::mt19937 rng(dev());
    std::uniform_int_distribution<std::mt19937::result_type> dist(-10,10); // distribution in range [1, 6]

    std::size_t N = 10000000;
    std::vector<Entry> f;
    f.reserve(N);
    for (std::size_t i = 0; i<N; ++i){
        auto e = DE::Entry();
        e.missing = -1;
        e.fvalue = dist(rng);
        e.qvalue = 0;
        f.push_back(e);
    }

    std::size_t i = 0;
    union DE::Entry *pFeatures = nullptr; 
    for (auto _ : state) {
        pFeatures = f.data()+NUM_FATURES*i;
        predict(pFeatures, 1);   // call treelite predict function

        ++i;
        
    }
}
BENCHMARK(BM_LLEAVES);
BENCHMARK(BM_TREELITE);
BENCHMARK_MAIN();

Thank you for reporting your results! My single-batch benchmarking setup (c_bench) is pretty poor (as you probably noticed), so I should really improve this. Could you send me your full benchmarking code / a PR for this?

Regarding the compile time: Several hours is rather curious. How long did treelite take? I don't think I've ever seen any model take longer than 30min, and those were larger 1000 tree models. Can you send me the model.txt (email is fine)? Else it'll be hard for me to debug.

Regarding the runtime: I'm not super surprised by this, the gap betw. treelite and lleaves for numerical models and single-batch prediction has always been small. I'm currently working on making lleaves more configurable by adding compiler flags: https://github.com/siboehm/lleaves/tree/compile_flags. You could install from that branch and try using the small codemodel and disable cache blocking. Be aware that without cache blocking compilation will take significantly longer (unless you also disable function inlining).

Regarding your benchmark: I'm not a fan of benchmarking using random data. LightGBM generates very imbalanced trees so out-of-distribution datasets are likely to hit only short paths in the tree which would distort results. If I were you I'd benchmark on the shuffled training dataset. Also I'd argue the data matrix generation should go inside the loop for treelite, as this is really extra overhead, but maybe I'm just sour ;)

Thank you for reporting your results! My single-batch benchmarking setup (c_bench) is pretty poor (as you probably noticed), so I should really improve this. Could you send me your full benchmarking code / a PR for this?

Sure, Do you want code for the google benchmark code and cmake changes? For Lleaves I linked the .o file and for Treelite I we exported the c code and compiled it with clang++.
Do you have something similar in mind?

Regarding your benchmark: I'm not a fan of benchmarking using random data.

True, I will try to rerun it with real data.

Yes, CMake and fully benchmark code would be sweet!

Closing since I cannot debug this issue without further details. lleaves now has proper benchmarking implemented using Google benchmark on the minibenchmarks branch.

Feel free to reopen with more info.