ipython / ipyparallel

IPython Parallel: Interactive Parallel Computing in Python

Home Page:https://ipyparallel.readthedocs.io/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

imap - only a single engine is still used after a few iterations

tommedema opened this issue · comments

I've setup an ipyparalle cluster which resulted in a decent speed up from 15 it/s to about 80-120 it/s:
Screen Shot 2022-08-30 at 6 21 18 PM

Sample code:

import ipyparallel as ipp
import os
    
clusterProcessesCount = os.cpu_count() # == 16

cluster = ipp.Cluster(n = clusterProcessesCount)
cluster.start_cluster_sync()
rc = cluster.connect_client_sync()
rc.wait_for_engines(clusterProcessesCount)
lview = rc.load_balanced_view()
dview = rc[:]

# ...

# push settings to child processes
dview.push(dict(
    cvTrainingFoldMinMatchCount = cvTrainingFoldMinMatchCount,
    cvTrainingFoldMaxMatchCount = cvTrainingFoldMaxMatchCount,
    cvEarlyAbandonMinPromisingSuccessRate = cvEarlyAbandonMinPromisingSuccessRate
))

# ....

for result in lview.imap(parallelTrainTestQuery, cvIndicesAndQueries, ordered = False, max_outstanding = 'auto'):            
            print(result)

Note that since I am using this for cross validation, I am referring to the the number of folds that were processed until the engines stop being used. After 2 folds (iterations of iterations) it is only using 1 engine out of 16:
Screen Shot 2022-08-31 at 5 35 48 AM

There are no error messages that I can see. At the first fold all 16 engines are fully used at 100% CPU. I did notice that when I first boot the cluster there are various warning messages, but none seem to stop the cluster from working.

Is there anything I can do to help resolve this?

Is there a chance the 100% cpu process is not an engine at all, but rather the notebook kernel or perhaps a scheduler? Can you tell which process that is (you can check the command-line of hte process with e.g. psutil.Process(pid).cmdline() or ps ax?

@minrk thanks for the response!

ps ax gave me:

1264 ?? Rs 1130:11.85 /Users/tommedema/opt/anaconda3/bin/python -m ipykern

also here are some screenshots from activity monitor:

Screen Shot 2022-08-31 at 5 59 13 PM

Screen Shot 2022-08-31 at 5 59 08 PM

Screen Shot 2022-08-31 at 5 59 03 PM

That means it's your kernel (the client) not any engines stuck doing work, perhaps processing incoming results. If you interrupt your notebook when this happens, do you get a traceback? How big are the result objects of your individual tasks? 300k is quite a few tasks. Depending on how many you have, you might want to add e.g. chunksize=10 to bundle 10 function calls per IPython Parallel message.

@minrk interesting, because the processing of a result is as simple as just adding it to a pandas dataframe:

    with tqdm(total=cvQueriesLength) as pbar:
        for result in lview.imap(parallelTrainTestQuery, cvIndicesAndQueries, ordered = False, max_outstanding = 'auto'):
            pbar.update(1)

            queryIndex = int(result[0])
            successRate = result[1]
            matchCount = int(result[2])
            maxDistance = result[3]
            
            cvResults.loc[cvResults.shape[0]] = {
                'query_index': queryIndex,
                'success_rate': successRate,
                'match_count': matchCount,
                'max_distance': maxDistance
            }

The result object is quite small, it's a numpy array with 4 floats:

return np.array([queryIndex, successRate, matchCount, trainingMaxDistance], dtype = np.float32)

I indeed have 300k tasks for each fold.

@minrk chunksize seems interesting, but from the docs it only seems applicable to map and not imap? I am using imap with max_outstanding = 'auto'.

I did just try setting chunksize and got: TypeError: imap() got an unexpected keyword argument 'chunksize'

I did just interrupt it while this happened, and this is the traceback:

https://gist.github.com/tommedema/b7107e66f4f70d1b2fa669927b2a4cff

Sorry, you're right - imap doesn't support chunksize yet.

That traceback shows it was waiting in your pandas append, not an IPP call. This could be a coincidence, but I suspect it's because appending a row to a pandas DataFrame makes a copy of the whole DataFrame. This gets expensive when you have 300k rows and are making a whole new 250k row data frame as each new row comes in.

Pre-allocating the whole DataFrame should be loads faster and less memory intensive:

N = len(cvIndicesAndQueries)

cvResults = pd.DataFrame(
    columns=["query_index", "success_rate", "match_count", "max_distance"],
    # defining the index ensures all the rows are defined
    index=np.arange(0, N),
)
...
for i, result in enumerate(lview.imap(...)):
    ...
    # addressing an _existing_ row doesn't create a new DataFrame
    # or maybe the index should be query_index?
    cvResults.iloc[i] = {...}
...

@minrk wow, very sharp! This helped tremendously. Thank you so much.