pkdone / mongo-parallel-agg

Tool to demo splitting up a MongoDB Aggregation into concurrently run sub-pipelines

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Mongo Parallel Agg

The mongo-parallel-agg tool demos using an aggregation pipeline to calculate the average value of a field in the documents of a collection in a MongoDB database, but splitting the workload up into multiple aggregation pipeline jobs executed in parallel against subsets of documents to show how the total response time for the workload can be reduced.

The mongo-parallel-agg tool relies on analysing an inflated version of the Atlas sample data set) and specifically the samples_mflix movies data. This doesn't mean the target cluster has to be in Atlas, it is just typically easier that way. Alternatively you can download your own copy of the sample data set and load it into your self-managed MongoDB database cluster. The instructions here will assume you are using an Atlas cluster.

The mongo-parallel-agg tool also leverages the mongo-mangler utility to expand the sample movies data-set to a collection of 100 million records, using duplication.

For more information on using this tool to demonstrate decreasing the see execution time of an aggregation pipeline, see the blog post Achieving At Least An Order Of Magnitude Aggregation Performance Improvement By Scaling & Parallelism.

Steps To Run

  1. Ensure you have a running MongoDB Atlas cluster (e.g. an M40-tier 2-shard cluster) deployed and which is network accessible from your client workstation (by configuring an IP Access List entry).

  2. Ensure you are connecting to the MongoDB cluster with a database user which has read privileges for the source database and read + write privileges the target database. If you are running a Sharded cluster, the database user must also have the privileges to run the 'enablingSharding' and 'splitChunk' commands. In Atlas, you would typically need to assign the 'Atlas Admin' role to the database user to achieve this.

  3. On your client workstation, ensure you have Python 3 (version 3.8 or greater) and the MongoDB Python Driver (PyMongo) installed. Example to install PyMongo:

pip3 install --user pymongo
  1. Download the mongo-mangler project and unpack it on your local workstation, ready to use, and ensure the mongo-mangler.py file is executable.

  2. For your Atlas Cluster in the Atlas Console, choose the option to Load Sample Dataset.

  3. In a terminal, change directory to the unpacked mongo-mangler root folder and execute the following to connect to the Atlas cluster and copy and expand the data from an the existing sample_mflix.movies collection to an a new collection, testdb.big_collection, to contain 10 million documents (if the target is a sharded cluster this will automatically define a range shard key on the field with pre-split chunks):

./mongo-mangler.py -m "mongodb+srv://myusr:mypwd@mycluster.abc1.mongodb.net/" -d "sample_mflix" -c "movies" -t "movies_big" -k "title" -s 100000000

    NOTE: Before executing the above command, first change the URL's username, password, and hostname to match the URL of your running Atlas cluster.

  1. From the terminal, change directory to the root folder of this mongo-parallel-agg project and execute the following to connect to the Atlas cluster and execute a simple aggregation pipeline, split into 16 sub-pipelines run in parallel, which calculates and then prints out the average "metacritic" score across all movies in the collection of 100 million records, including printing out the total execution time at the end of the run:
./mongo-parallel-agg.py -m "mongodb+srv://myusr:mypwd@mycluster.abc1.mongodb.net/" -d "sample_mflix" -c "movies_big" -s 16 -p "title" -a "metacritic"

    NOTE: Before executing the above command, first change the URL's username, password, and hostname to match the URL of your running Atlas cluster. Also, optionally change the -s parameter to a different value to run a different number of sub-processes. You can set this value to 1 to instruct the tool to not split up the aggregation pipeline and instead run the full aggregation in one go from the main single process.

About

Tool to demo splitting up a MongoDB Aggregation into concurrently run sub-pipelines

License:MIT License


Languages

Language:Python 100.0%