instacart / makara

A Read-Write Proxy for Connections; Also provides an ActiveRecord adapter.

Home Page:http://tech.taskrabbit.com/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Newrelic RPM Performance Regression

mnelson opened this issue · comments

Issues #45 and #51 have both reported issues when using Makara alongside certain versions of newrelic/rpm. In an attempt to be as transparent as possible as well as potentially isolate the issue, the following table will hold the information required to check your stack against other community reports.

Newrelic has released a fix

The range of newrelic releases affected is 3.7.2 <= X < 3.11.2

Makara Version AR Version RPM Version Regression
0.0.x 3.2 3.6.9 no
0.2.x 4.x 3.6.5.130 no
0.2.x 4.x 3.6.6.147 no
0.3.x 4.x 3.6.6.147 no
BROKE --- --- ---
0.2.x 4.x 3.7.2 yes
0.2.x 4.x 3.9.4.245 yes
0.3.x 4.x 3.8.1.221 yes
0.3.x 4.x 3.9.5.251 yes
0.3.x 4.x 3.9.6.257 yes
0.3.x 4.x 3.9.7.266 yes
FIXED --- --- ---
0.x.x x.x 3.11.2 no

👍 thanks @mnelson !

Trying makara v0.3.0rc3 for the first time with AR 4.1.8 and newrelic_rpm 3.9.7.266, I can confirm performance problems

Another for you

Makara 0.3.x + AR 4.x + RPM 3.9.6.257 = regression yes

Hey there! I'm a developer on newrelic_rpm and got linked to these perf issues today. Major bummer! I'd like to help get to the bottom of it if we can.

I haven't used makara in earnest before, and plugging it into my basic Rails 4 test app didn't show me much perf difference yet. I'm sure I'm missing something in my configuration, though.

Does anyone have a relatively simple duplication case that I could use to dig in? Ideal would be a repo with a Rails app that I could toggle between the versions and see the perf differences, but any indicator where to head would be appreciated.

Here you go! https://github.com/taskrabbit/makara_newrelic_test

makara_newrelic_test [master] ~> be rake bm_nr
-- create_table("users", {:force=>true})
   -> 0.0580s
-- initialize_schema_migrations_table()
   -> 0.0065s
Rehearsal ------------------------------------------
insert   8.660000   0.200000   8.860000 (  9.134924)
select   2.600000   0.060000   2.660000 (  2.768211)
-------------------------------- total: 11.520000sec

             user     system      total        real
insert   8.670000   0.150000   8.820000 (  9.086117)
select   2.520000   0.040000   2.560000 (  2.667939)
makara_newrelic_test [master] ~> be rake bm
-- create_table("users", {:force=>true})
   -> 0.0358s
-- initialize_schema_migrations_table()
   -> 0.0014s
Rehearsal ------------------------------------------
insert   0.260000   0.030000   0.290000 (  0.418320)
select   0.080000   0.010000   0.090000 (  0.157242)
--------------------------------- total: 0.380000sec

             user     system      total        real
insert   0.280000   0.020000   0.300000 (  0.419354)
select   0.090000   0.010000   0.100000 (  0.167273)

Let me know if there's anything I can do to help.

#71 will likely be of interest as well.

Based on the rubygem releases I believe the issue will be found between 3.6.8.168 - 3.6.9.171. 3.6.8.168 runs fine, 3.6.9 takes 40x longer. newrelic/newrelic-ruby-agent@3.6.8.168...3.6.9.171

#75

I'm seeing an improvement in this branch - but it's still 4x slower

makara_newrelic_test [master] ~> be rake bm_nr
-- create_table("users", {:force=>true})
   -> 0.0723s
-- initialize_schema_migrations_table()
   -> 0.0040s
Rehearsal ------------------------------------------
insert   1.130000   0.080000   1.210000 (  1.449123)
select   0.490000   0.030000   0.520000 (  0.615719)
--------------------------------- total: 1.730000sec

             user     system      total        real
insert   1.180000   0.050000   1.230000 (  1.425478)
select   0.370000   0.010000   0.380000 (  0.471990)
makara_newrelic_test [master] ~> be rake bm
-- create_table("users", {:force=>true})
   -> 0.0439s
-- initialize_schema_migrations_table()
   -> 0.0017s
Rehearsal ------------------------------------------
insert   0.290000   0.030000   0.320000 (  0.485325)
select   0.080000   0.010000   0.090000 (  0.165335)
--------------------------------- total: 0.410000sec

             user     system      total        real
insert   0.270000   0.020000   0.290000 (  0.419044)
select   0.090000   0.010000   0.100000 (  0.170199)

Thanks @mnelson, that's going to be super helpful in tracking this down! I probably won't be able to dig deep on this until next week, but having those dups will be huge in getting to the bottom of this. Thanks again!

As expected, that repro case was awesome ✨. Thanks again @mnelson for putting it together!

The slowdown is happening at this point in newrelic_rpm: https://github.com/newrelic/rpm/blob/5560d8ecff0dcece53a80edbc4535cc60fc188d8/lib/new_relic/agent/instrumentation/active_record_subscriber.rb#L87-L93. On Rails 4.x+, New Relic uses ActiveSupport::Notifications, but the payload only gives us the object_id of the connection. Because we need more information off the connection (database adapter type for instance), we search through the connection list and that's where the bottleneck is.

Here's a slightly nicer version of the same code from the Ruby agent that you can put in a task or script to run:

n = 1000
Benchmark.bm do |x|
  x.report do
    n.times do
      connections = ::ActiveRecord::Base.connection_handler.connection_pool_list.map do |handler|
        handler.connections
      end.flatten

      connection = connections.detect { |cnxn| cnxn.object_id == 1234 }
    end
  end
end

When I run this in a Rails project with the mysql adapter, I get the following timing:

       user     system      total        real
   0.030000   0.000000   0.030000 (  0.033124)

Running that with makara instead (doesn't matter if New Relic's in the picture for this), yields this:

       user     system      total        real
   1.600000   0.030000   1.630000 (  1.625922)

I haven't dug into makara to see about why this path is slow. Stackprof gives some slightly confusing results, but points to Delegator and SimpleDelegator at lot.

FWIW, we don't really like this connection lookup, and are continuing to look for a solution from the Rails end to avoid it. But even if we get an update it'll only be available in some as-yet-unavailable Rails version, so this code is likely to live on even then as a fallback on Rails 4.x.

Any thoughts where to go from here @mnelson?

@jasonrclark Is there an issue with using ObjectSpace._id2ref(object_id) rather than detecting the object by iterating the connections? I agree that there's a problem somewhere but from my quick attempt an id2ref lookup at least avoids it.

image

Oh, just remembered I use id2ref for the makara logging as well: https://github.com/taskrabbit/makara/blob/master/lib/makara/logging/subscriber.rb#L25

Btw, the repro is updated with some id2ref rake tasks https://github.com/taskrabbit/makara_newrelic_test

@mnelson There actually are problems with _id2ref--we had it that way for our original implementation of this code and had to move to the more complex scheme we've got today.

The two issues were that 1) _id2ref performs really badly on Rubinius and 2) it isn't even supported on JRuby without passing a special startup flag (for similar perf reasons).

Given those constraints, we can't go back to using _id2ref. With your knowledge of the project, does it seem possible to optimize the connection code? If so, that seems like the best option to me.

Failing that, I've got an idea or two about how we could let you override the connection retrieval. It's a little awkward because of load ordering (I expect makara will typically be required before newrelic_rpm), but I know how we could do it.

With a little more digging I've determine that the bottleneck occurs due to the flatten invocation. My guess is that there's something weird going on with object comparison or array determination due to the delegate. Maybe even down at the C level rb_check_array_type. I'll keep poking around.

FWIW, this is a potential fix for the NR codebase - even though I understand the real problem is in Makara.

        connection = ActiveRecord::Base.connection_handler.connection_pool_list.detect do |l| 
          l.connections.detect do |c| 
            c.object_id == 4
          end
        end

That's a great idea @mnelson. Tested it out, and with a little tweaking (can't use the outer detect since that returns the connection pool, not the inner connection), I was able to use this instead.

It also neatly avoids a couple extra object allocations, so in addition to helping makara out, it's a small win for every ActiveRecord 4.x client using newrelic_rpm 🎉

This isn't likely to make it in the next release of the agent which is baking internally already for release, but I'll ping here when we have a branch of the agent to test against directly.

Hi guys. I just got into the same issue. What is the workaround while it is not fixed?

With reference to the table above, you can use one of the makara or rpm versions which does not have the performance issue. I have a branch which somewhat improves the situation but it does not fix it completely. You can absolutely give that branch a try but, as I said, it won't completely fix the issue. The underlying perf issue seems to be in ruby's Delegator class and I'm still working on finding a solution which at the very least avoids the problem.

I have tested makara 0.3.x (latest version) with rpm 3.6.6.147 and AR 4.x and it is working fine!

Thanks @danicuki, added to the table.

Hey y'all - we've just released version 3.11.2 of the newrelic_rpm gem, which should address this issue. Thanks again for your help in isolating and fixing it!

Makara Version ---> 0.3.3
AR Version ---> 3.2.14
RPM Version ---> 3.9.7.266
Is this configuration tested ?

@nikhilvs Any newrelic_rpm < 3.11.2 and >= 3.7.2 will have the regression. I'd recommend upgrading newrelic_rpm to avoid the problem.

@mnelson, could we mark down when the fix came through in the table to avoid folks having to follow the whole comment thread to find out what happened?

@jasonrclark done. Thanks.