multiprocessing: break down the main scan loop

Question

multiprocessing: break down the main scan loop

peb-peb opened this issue a year ago · comments

Some of the architectural design decisions:

Decision 1

By default when creating a process - it gets created in spawn start method. But, as mentioned in the Python docs here, it is rather slow.

Starting a process using this method is rather slow compared to using fork or forkserver.

So, should we have separate start methods depending on the machine on which we run our tool? (This could have some impact, but would have to test to exactly quantify the values).

Some questions to help answer these questions:

Support for all 3 types of systems?
On which platform would our tool be running most of the time?

Decision 2

We have to break down the main scan loop into several atomic components like - get_project_list, crawl_client, and others.

I'll propose the design in a few hours, then we can have a discussion regarding this.

Decision 3

As discussed with @ZetaTwo, we might sometimes require to run some service requests before another (not as of now).

So, should we group the requests from the beginning and have them run as a group?

Decision 4

What should we go with multiprocessing.Process or concurrent.futures.ProcessPoolExecutor?

Differences: concurrent.futures.ProcessPoolExecutor handles everything for us like locks and process scheduling and is much safer. Whereas, multiprocessing.Process also handles everything for us like locks and process scheduling but provides optional control over things like locks, queues, and Managers. (this level of control is not needed for us as of now)

mshudrak · Answer 1 · Mon Jul 10 2023 09:54:25 GMT+0800 (China Standard Time)

[Updated below] Hmm, I would go with multithreading for GCP resource crawler (e.g. use multiprocessing.pool.ThreadPool or concurrent.futures.ThreadPoolExecutor). Doing fork/spinning up new process for each request/set of requests might be expensive. I'd do actual multiprocessing per each GCP project and/or service account key where time spent on spinning up new process/forking is negligible comparing to scanning time. I think having one solution for all OSes will make maintenance and implementation easier (so spawn is preferable).

Update: just read the documentation about pool of workers. You can disregard what's written above about process per each request/set of request. You have a pool of process workers waiting for new request to come. BUT we need to make sure all workers live as long as the scanning loop lives. This way you waste time for spinning up everything just once during the start up (this is not super important IMO). However, keep in mind that that the object you pass should be pickable in case of concurrent.futures.ProcessPoolExecutor according to this (https://docs.python.org/3/library/concurrent.futures.html#processpoolexecutor). Everything sent/received to our functions should be pickable AFAIK but I'd check it anyway.

Harsh · Answer 2 · Tue Jul 11 2023 00:07:01 GMT+0800 (China Standard Time)

Meeting Notes for 10/07/2023:

Decision 1

We would go with the default approach and support for all 3 types of systems. The primary reason is that the IO tasks are in terms of seconds and optimizing it using fork would optimize it in terms of milliseconds (which is negligible in comparison).

Decision 2

Discussion on further PRs and Issues. But a starting point would be paralleling Storage Bucket.

Decision 3

The projects list needs to be queried first. So, instead of going with groups and handling them, we can make an exception and query the projects list outside the loop.

Decision 4

Concurrent.futures.ProcessPoolExecutor is the way to go.

Harsh · Answer 3 · Wed Aug 30 2023 15:52:58 GMT+0800 (China Standard Time)

completed with PR #265 and #269
closing