microsoft / ghcrawler

Crawl GitHub APIs and store the discovered orgs, repos, commits, ...

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

mongodb: commit - unindexed _metadata.url

grooverdan opened this issue · comments

Crawling a largish tree like MariaDB generated a rising CPU profile on the server for the mongodb process:

image

mongodb process was niced just before the fix below.

A log at the profiler shows a column scan looking for a _metadata.url

use ghcrawler
db.setProfilingLevel(1)
db.system.profile.find().pretty()
{
	"op" : "query",
	"ns" : "ghcrawler.commit",
	"command" : {
		"find" : "commit",
		"filter" : {
			"_metadata.url" : "https://api.github.com/repos/MariaDB/server/commits/6e791795a2d7319e32a65a4d8a2cb6ed54cfc5c6"
		},
		"limit" : 1,
		"batchSize" : 1,
		"singleBatch" : true,
		"$db" : "ghcrawler"
	},
	"keysExamined" : 0,
	"docsExamined" : 108972,
	"cursorExhausted" : true,
	"numYield" : 857,
	"locks" : {
		"Global" : {
			"acquireCount" : {
				"r" : NumberLong(1716)
			}
		},
		"Database" : {
			"acquireCount" : {
				"r" : NumberLong(858)
			}
		},
		"Collection" : {
			"acquireCount" : {
				"r" : NumberLong(858)
			}
		}
	},
	"nreturned" : 0,
	"responseLength" : 104,
	"protocol" : "op_query",
	"millis" : 168,
	"planSummary" : "COLLSCAN",
	"execStats" : {
		"stage" : "LIMIT",
		"nReturned" : 0,
		"executionTimeMillisEstimate" : 149,
		"works" : 108974,
		"advanced" : 0,
		"needTime" : 108973,
		"needYield" : 0,
		"saveState" : 857,
		"restoreState" : 857,
		"isEOF" : 1,
		"invalidates" : 0,
		"limitAmount" : 1,
		"inputStage" : {
			"stage" : "COLLSCAN",
			"filter" : {
				"_metadata.url" : {
					"$eq" : "https://api.github.com/repos/MariaDB/server/commits/6e791795a2d7319e32a65a4d8a2cb6ed54cfc5c6"
				}
			},
			"nReturned" : 0,
			"executionTimeMillisEstimate" : 149,
			"works" : 108974,
			"advanced" : 0,
			"needTime" : 108973,
			"needYield" : 0,
			"saveState" : 857,
			"restoreState" : 857,
			"isEOF" : 1,
			"invalidates" : 0,
			"direction" : "forward",
			"docsExamined" : 108972
		}
	},
	"ts" : ISODate("2018-06-07T03:54:57.791Z"),
	"client" : "172.18.0.5",
	"allUsers" : [ ],
	"user" : ""
}

Adding an index on this cased a rather quick drop in CPU as shown in the end of the graph:

 db.commit.createIndex( { "_metadata.url":  "hashed" } )

If this could be default that would be most useful.

Before change mongotop was showing the read/total ghcrawler.commit time in the 1000s of ms.
After:

                            ns    total    read    write    2018-06-07T04:39:42Z
              ghcrawler.commit      3ms     3ms      0ms                        
       admin.system.namespaces      0ms     0ms      0ms                        
            admin.system.roles      0ms     0ms      0ms                        
          admin.system.version      0ms     0ms      0ms                        
      config.system.namespaces      0ms     0ms      0ms                        
        config.system.sessions      0ms     0ms      0ms                        
ghcrawler.admin.system.version      0ms     0ms      0ms                        
             ghcrawler.commits      0ms     0ms      0ms                        
        ghcrawler.contributors      0ms     0ms      0ms                        
          ghcrawler.deadletter      0ms     0ms      0ms        

Thanks @grooverdan. The change looks simple. Anything else need to be done? Do you want to do a PR for it just to make sure the change goes where you expect etc? (not a lot of Mongo expertise available on the team right now)

I haven't looked at where in the code this is/should be. I'll take a look.

This is my first look at Mongo too. A runaway CPU consumption aids a bit of rapid learning :-)