[proposal] Faster algorithm for computing the Mertens function

Question

[proposal] Faster algorithm for computing the Mertens function

trizen opened this issue 5 years ago · comments

Hi. I would like to suggest a faster algorithm for computing the Mertens function, by trading memory for speed.

A simple implementation in Perl, would be:

use 5.020;
use strict;
use warnings;

use experimental qw(signatures);
use ntheory qw(moebius sqrtint rootint mertens);

sub mertens_fast($n) {

    my $lookup_size = 2 * rootint($n, 3)**2;

    my @moebius_lookup = moebius(0, $lookup_size);
    my @mertens_lookup = (0);

    foreach my $i (1 .. $lookup_size) {
        $mertens_lookup[$i] = $mertens_lookup[$i - 1] + $moebius_lookup[$i];
    }

    my %seen;

    sub ($n) {

        if ($n <= $lookup_size) {
            return $mertens_lookup[$n];
        }

        if (exists $seen{$n}) {
            return $seen{$n};
        }

        my $s = sqrtint($n);
        my $M = 1;

        foreach my $k (2 .. int($n/($s+1))) {
            $M -= __SUB__->(int($n/$k));
        }

        foreach my $k (1 .. $s) {
            $M -= $mertens_lookup[$k] * (int($n/$k) - int($n/($k+1)));
        }

        $seen{$n} = $M;

    }->($n);
}

#foreach my $n (1 .. 9) {    # takes ~1.6 seconds
#    say "M(10^$n) = ", mertens_fast(10**$n);
#}

say mertens_fast(10**9);     # takes ~1.3 seconds
say mertens(10**9);          # takes ~5.2 seconds

In computing M(10^10) = -33722, the above implementation takes ~6 seconds and it requires about 727 MB of RAM.

In computing M(10^11) = -87856, the above implementation takes ~29 seconds and it requires about 3.2 GB of RAM.

If implemented in C, the memory usage may be considerably lower and the speed will be better as well.

Dana Jacobsen · Answer 1 · Mon Jun 01 2020 18:25:28 GMT+0800 (China Standard Time)

I just implemented a version in C. It uses a super simple hash of about 8*n^(1/3) values which seemed to be a point of diminishing returns. This is trivial memory use even at 16 bytes for each entry. The resulting function seems to be scaling as ~ n^(2/3) as well, vs n for the original (as it notes in the comments).

It's much faster than the original code. On my laptop.

For 10^10:
  41.0s  Original C
  11.9s  Your Perl code
   0.22s New C

For 10^11:
 405.2s  Original C
  58.4s  Your Perl code
   1.11s New C

Dana Jacobsen · Answer 2 · Mon Jun 01 2020 19:40:32 GMT+0800 (China Standard Time)

I left a TODO item in about speeding up the Pure Perl version. A couple thoughts there:

As pure Perl, your code is a huge improvement in speed. But I am a bit worried about the memory use. There might be some way to mitigate it.
The case of a 32-bit Perl is probably important. Here we have fast 32-bit C functions, but they want the result for something larger. There might be ways to exploit this a bit.

While the memory use of the C version seems to be ok, it still scales higher than it did before, so perhaps it could be adjusted some more.

Daniel Șuteu · Answer 3 · Mon Jun 01 2020 20:47:54 GMT+0800 (China Standard Time)

Thank you very much for implementing this in C. The performance is really impressive!

Regarding the PP version, one solution would be to use this algorithm for numbers <= B, for some choice of B (say B = 10^9 or B = 10^10), and use the current PP implementation for numbers > B, to prevent extreme memory usage.

Dana Jacobsen · Answer 4 · Mon Jun 01 2020 22:54:13 GMT+0800 (China Standard Time)

Following up on the last reply, I added Perl code. It scales the same but about 8x less memory use. About 34MB for 10^9, 133MB for 10^10, 534MB for 10^11 (75 seconds). I think the main slowdown is the references rather than using a inner sub -- even some inner loop refactoring couldn't make up for it. It goes a bit faster if using double the memory (64s), slower for me after that but I suspect this laptop has slow memory access.

The C code uses only 2 bytes per entry in the base Mertens array so that helps immensely. But the current code even after adjusting a bit can still go up into the GBs for things like 2^47 or higher.

Dana Jacobsen · Answer 5 · Fri Jan 22 2021 21:10:05 GMT+0800 (China Standard Time)

See the 2021 paper by Helfgott and Thompson: https://arxiv.org/abs/2101.08773

It is substantially more code, but it looks like it does have pseudocode (13 pages) for everything necessary. I'm curious how it would compare. If nothing else the space usage would be better.