oracle / truffleruby

A high performance implementation of the Ruby programming language, built on GraalVM.

Home Page:https://www.graalvm.org/ruby/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Serious performance regression for method pow(a, m)

jzakiya opened this issue · comments

The code below shows this issue.
On Truffleruby, using pow(a, m) is orders of magnitude slower than on [C|J]Ruby, where it's the fastest.

# Enable YJIT if using CRuby >= 3.3"
RubyVM::YJIT.enable if RUBY_ENGINE == 'ruby' and RUBY_VERSION.to_f >= 3.3

def modinv(a0, m0)
  return 1 if m0 == 1
  a, m = a0, m0
  x0, inv = 0, 1
  while a > 1
    inv -= (a / m) * x0
    a, m = m, a % m
    x0, inv = inv, x0
  end
  inv += m0 if inv < 0
  inv
end

def modinv1(r, m)
  inv = r
  while (r * inv) % m != 1; inv %= m; inv *= r end
  inv % m
end

def modinv2(r, m)
  inv = r
  while (r * inv) % m != 1; inv = (inv * r) end
  inv #% m
end

def modinv3(r, m, rescnt)
  (r ** (rescnt-1)) % m
end

def modinv4(r, m, rescnt)
  r.pow(rescnt-1, m)
end

def tm; t = Time.now; yield; Time.now - t end

resp7 = [11,13,17,19,23,29,31,37,41,43,47,53,59,61,67,71,73,79,83,89,
          97,101,103,107,109,113,121,127,131,137,139,143,149,151,157,163,
         167,169,173,179,181,187,191,193,197,199,209,211]

print resp7.map { |r| modinv(r, 210) }
puts
print resp7.map { |r| modinv1(r, 210) }
puts
print resp7.map { |r| modinv2(r, 210) }
puts
print resp7.map { |r| modinv3(r, 210, 48) }
puts
print resp7.map { |r| modinv4(r, 210, 48) }
puts

puts tm{ 100000.times {resp7.each { |r| modinv(r, 210) } } }

puts tm{ 100000.times {resp7.each { |r| modinv1(r, 210) } } }

puts tm{ 100000.times {resp7.each { |r| modinv2(r, 210) } } }

puts tm{ 100000.times {resp7.each { |r| modinv3(r, 210, 48) } } }

puts tm{ 100000.times {resp7.each { |r| modinv4(r, 210, 48) } } }

Thank you for reporting the issue!

Could you please share your benchmarking results?

Which version of TruffleRuby are you using?

Related: #1999
We do modular exponentiation now:

Primitive.mod_pow(self, e, m)

But always convert to BigInteger, maybe that's the overhead?
If that's the case we could use similar logic to CRuby/JRuby for the small integers cases, but would be good to make sure that's the issue and how much slower it is before importing that code.

First code correction.

def modinv2(r, m)
  inv = r
  while (r * inv) % m != 1; inv = (inv * r) % m end
  inv #% m
end

Here are times on my system for Ruby 3.3 w/wo YJIT, JRuby 9.4.6.0 and TruffleRuby 24.0.0, via rvm.
Total times in secs, smaller the faster.

Ruby 3.3 w/o YJIT
1.544888113
1.054920322
0.954105064
3.531902812
0.369263064

Ruby 3.3 w/YJIT
0.375549548
0.383313453
0.36227265
3.213001315
0.235034272

JRuby 9.4.6.0
0.9543389999999999
0.99756
0.870005
1.6784890000000001
0.266971

TruffleRuby 24.0.0
0.222029
0.33000100000000004
0.19171500000000002
2.120432
4.006869

With a slightly modified script which:

  • uses Benchmark.realtime to use a monotonic clock
  • reports 3 timings for each method (to get an idea if things are warmed up or not yet, and get an idea of the error in the measurements)
  • uncomments the #% m for modinv2, otherwise the result differs from other methods
  • cleans up the check for RubyVM::YJIT.enable
require 'benchmark'

# Enable YJIT if using CRuby >= 3.3"
RubyVM::YJIT.enable if defined?(RubyVM::YJIT.enable)
puts RUBY_DESCRIPTION

def modinv(a0, m0)
  return 1 if m0 == 1
  a, m = a0, m0
  x0, inv = 0, 1
  while a > 1
    inv -= (a / m) * x0
    a, m = m, a % m
    x0, inv = inv, x0
  end
  inv += m0 if inv < 0
  inv
end

def modinv1(r, m)
  inv = r
  while (r * inv) % m != 1; inv %= m; inv *= r end
  inv % m
end

def modinv2(r, m)
  inv = r
  while (r * inv) % m != 1; inv = (inv * r) % m end
  inv % m
end

def modinv3(r, m, rescnt)
  (r ** (rescnt-1)) % m
end

def modinv4(r, m, rescnt)
  r.pow(rescnt-1, m)
end

def tm(&b)
  puts
  3.times do
    puts Benchmark.realtime(&b)
  end
end

resp7 = [11,13,17,19,23,29,31,37,41,43,47,53,59,61,67,71,73,79,83,89,
          97,101,103,107,109,113,121,127,131,137,139,143,149,151,157,163,
         167,169,173,179,181,187,191,193,197,199,209,211]

print resp7.map { |r| modinv(r, 210) }
puts
print resp7.map { |r| modinv1(r, 210) }
puts
print resp7.map { |r| modinv2(r, 210) }
puts
print resp7.map { |r| modinv3(r, 210, 48) }
puts
print resp7.map { |r| modinv4(r, 210, 48) }
puts

tm{ 100000.times {resp7.each { |r| modinv(r, 210) } } }

tm{ 100000.times {resp7.each { |r| modinv1(r, 210) } } }

tm{ 100000.times {resp7.each { |r| modinv2(r, 210) } } }

tm{ 100000.times {resp7.each { |r| modinv3(r, 210, 48) } } }

tm{ 100000.times {resp7.each { |r| modinv4(r, 210, 48) } } }

I get:

On Intel:
ruby 3.3.0 (2023-12-25 revision 5124f9ac75) +YJIT [x86_64-linux]
1.347028863034211
1.3474754459457472
1.3452114629908465

1.2710674100089818
1.2754993910202757
1.2733817050466314

1.3521089899586514
1.351813138986472
1.3528226349735633

7.390963806945365
7.495342062029522
7.4823689730255865

0.8994149370118976
0.8945350690046325
0.9004962909966707

truffleruby 24.0.0, like ruby 3.2.2, Oracle GraalVM Native [x86_64-linux]
0.3218597859959118
0.24441085202852264
0.21928530297009274

0.20125181798357517
0.18273464200319722
0.1594632159685716

0.1859166320064105
0.16585971100721508
0.13765042601153255

4.247611464001238
4.127755031979177
4.079420825000852

8.350773187004961
8.356869303970598
8.30608620395651
On AMD Ryzen 7 3700X 8-Core Processor
ruby 3.3.0 (2023-12-25 revision 5124f9ac75) +YJIT [x86_64-linux]
0.6688195530005032
0.6663745859996197
0.6647927419999178

0.7047924210000929
0.7027385810006308
0.70747054200001

0.6540912820000813
0.6548124430000826
0.6545112009998775

5.6871033949992125
5.716956408999977
5.703890113999478

0.46341663200018957
0.46297978599977796
0.46605407200058835

truffleruby 24.0.1, like ruby 3.2.2, Oracle GraalVM Native [x86_64-linux]
0.2904605989997435
0.207193976999406
0.18607605099987268

0.1610029299999951
0.13540141500016034
0.11250887000005605

0.13582420100010495
0.10616527599995607
0.08965392699974473

2.6358709440000894
2.63863470600063
2.630714566000279

5.454730677999578
5.439872395000748
5.455230954999934

truffleruby 24.0.1, like ruby 3.2.2, Oracle GraalVM JVM [x86_64-linux]
0.30870855699959066
0.19738048799990793
0.18506349099970976

0.17732479499954934
0.12824568500036548
0.12031344499973784

0.13241845700031263
0.09785182599989639
0.09070625899948936

2.0937121340002705
1.7953665820004971
1.7865803080003388

2.732251799000551
2.6282954050002445
2.6161076440002944

I noticed YJIT is much slower in my measurements, maybe you are on a different architecture, I guess darwin-aarch64?

But overall it's similar results, TruffleRuby is fastest everywhere except for the last method, modinv4.
You mentioned regression, do you mean a regression compared to an older version of TruffleRuby, or you mean regression as "slower than CRuby/JRuby"?

Either way modinv4 amounts to (r between 11 and 211).pow(47, 210) so that's all small integers as inputs, so we should optimize that case too like CRuby/JRuby, it seems the java.math.BigInteger#modPow code doesn't optimize that case well unfortunately.

My system is a Lenovo Legion Slimjet laptop (2022), AMD Ryzen 9 5900HX, with Linux.

The issue is that pow, which is used in modinv4, is significantly slower in Truffleruby than in [C|J]Ruby.
In [C|J]Ruby using pow is the fastest among different implementations, but is slowest in Truffleruby.

Fixed in #3552

Now modinv4 is as fast as modinv1 and modinv2:

On AMD Ryzen 7 3700X 8-Core Processor

truffleruby 24.1.0-dev-52596ca3, like ruby 3.2.2, Oracle GraalVM JVM [x86_64-linux]
0.35175757099932525
0.21934590799719444
0.19075998200059985

0.20475720999820624
0.13995916100247996
0.11859457800164819

0.16480916699947556
0.10902587799864705
0.09228703799817595

2.012781329001882
1.7254886529990472
1.7228568719983741

0.21183235100033926
0.13329798599806963
0.11889045199859538

Same but running 5 times to ensure enough warmup since this is on jargraal and not libgraal:
truffleruby 24.1.0-dev-52596ca3, like ruby 3.2.2, Oracle GraalVM JVM [x86_64-linux]
0.36994781300018076
0.221259508998628
0.18763062200014247
0.19057154599795467
0.18439673299872084

0.19098085299992817
0.1328596249986731
0.11979981799959205
0.11720118900120724
0.11687311399873579

0.1548915470011707
0.10623463299998548
0.10084365199872991
0.09816986200166866
0.09051773600003798

2.040468414998031
1.7358531079989916
1.755651747000229
1.6976114799981588
1.7026588670014462

0.22171137699842802
0.13052473000061582
0.11817796299874317
0.11785270799737191
0.11395291900043958

As a bonus the new code is written in Ruby (which in this case is a lot faster than Java's BigInteger#modPow!) and IMO looks a lot cleaner than the equivalents in CRuby / JRuby.