wallet::mnemonic: expected performance question.

Question

wallet::mnemonic: expected performance question.

alonargel opened this issue 5 months ago · comments

I made the example below and got ~600 iterations per second, it seems to me that for C++ this is a very low speed, because similar examples in Python give a speed of 5 or more higher. Why is it so slow and is it possible to speed it up?

int bc::system::main(int, char* [])
{
    using namespace bc;
    using namespace bc::system;
    using namespace bc::system::wallet;

    int iterations = 0;
    auto start_time = std::chrono::steady_clock::now();

    while (true) {
        std::string mnemonic_phrase = "bunker churn kangaroo melt bleak chalk vacant alert reason exit forward language";

        const mnemonic recovery_seed(mnemonic_phrase);
        const auto context = ctx::btc::main::p2pkh;

        // Account private key derivation.
        const auto passphrase = "";
        const auto account_private_key = recovery_seed.to_key(passphrase, context);
        const auto& m = account_private_key;
   
        iterations += 1;

        auto end_time = std::chrono::steady_clock::now();
        auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(end_time - start_time).count();

        if (duration >= 1000) {
            double iterations_per_second = iterations / (duration / 1000.0);
            std::cout << "Iterations per second: " << std::fixed << std::setprecision(0) << iterations_per_second << std::endl;
            iterations = 0;
            start_time = end_time;
        }
    }
}

Eric Voskuil · Answer 1 · Thu Feb 22 2024 04:55:34 GMT+0800 (China Standard Time)

A decent compiler with optimizations enabled should eliminate everything except for the timing calls, since m is unused, all above is dead code to the compiler. So it should be as fast as the clock. And even if m was used, the loop computes the same value each time, so a decent compiler would reduce it to a single actual iteration. Getting good performance timing results can be tricky.

Eric Voskuil · Answer 2 · Thu Feb 22 2024 05:36:59 GMT+0800 (China Standard Time)

I reduced some of the counting overhead and added language to the constructor, both of which sped it up a bit. Also verified that my compiler optimizations are enabled and that the call is not being optimized out.

int bc::system::main(int, char* [])
{
    using namespace bc;
    using namespace bc::system;
    using namespace bc::system::wallet;
    using namespace std::chrono;

    constexpr auto iterations = 10'000;
    const auto passphrase = "";
    const auto mnemonic_phrase = "bunker churn kangaroo melt bleak chalk vacant alert reason exit forward language";
    const auto context = ctx::btc::main::p2pkh;
    const auto start_time = steady_clock::now();

    for (auto i = zero; i < iterations; ++i)
    {
        const auto m = mnemonic{ mnemonic_phrase, language::en }.to_key(passphrase, context);
    }

    const auto end_time = steady_clock::now();
    const auto duration = duration_cast<seconds>(end_time - start_time).count();
    std::cout << "Iterations per second: " << std::fixed << std::setprecision(0) << iterations / duration << std::endl;
    return 0;
}

On my Intel(R) Xeon(R) CPU E5-2683 v4 @ 2.10GHz it reports 476 per second.

The heat map and module view above shows that 95.41% of the time is spent in to_key/seeder/pbkd/accumulator/sha512

The implementation takes advantage of most hashing CPU optimizations, but not presently sha-ni or platform equivalent. If your machine supports sha intrinsics and the Python implementation takes advantage of this that could account for the difference.

However it may also be the case that the Python libraries may cache recent hash results. Given that you are regenerating the same key each time, this would eliminate most of the hashing cost, but would be not be useful behavior in a real scenario.

Eric Voskuil · Answer 3 · Thu Feb 22 2024 05:40:26 GMT+0800 (China Standard Time)

In any case, this isn't a language issue and not likely an implementation issue (apart from the missing optimization). I've performance tested our hashing libraries extensively against the satoshi client and they are marginally to materially better in most scenarios (except of course sha-ni).

Eric Voskuil · Answer 4 · Thu Feb 22 2024 05:48:39 GMT+0800 (China Standard Time)

I would suggest writing test cases that accept a numeric seed, iterate over that incrementing seed to generate passphrases, parse the passphrase, and perform some sort of accumulation on the key so that it cannot be optimized out. With the timing on the outside of the loop. Running this on the same machine in both scenarios would give you a better comparison. But I would assume that the python hash libs do incorporate sha intrinsics, either through openssl or independently.

5-6x performance difference in the processing of sha512 is expected.

Typically an interpreted language like Python would suffer, but in this case it doesn't show because nearly all of the work is being performed in the hashing library, overwhelming the other overhead.

alonargel · Answer 5 · Thu Feb 22 2024 06:19:37 GMT+0800 (China Standard Time)

I would suggest writing test cases that accept a numeric seed, iterate over that incrementing seed to generate passphrases, parse the passphrase, and perform some sort of accumulation on the key so that it cannot be optimized out. With the timing on the outside of the loop. Running this on the same machine in both scenarios would give you a better comparison. But I would assume that the python hash libs do incorporate sha intrinsics, either through openssl or independently.

5-6x performance difference in the processing of sha512 is expected.

Typically an interpreted language like Python would suffer, but in this case it doesn't show because nearly all of the work is being performed in the hashing library, overwhelming the other overhead.

Ok, thanks for the explanation.