tcbenchstack / tcbench

tcbench is a Machine Learning and Deep Learning framework to train model from traffic packet time series or other input representations.

Home Page:https://tcbenchstack.github.io/tcbench/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

multiprocessing error

suyihhhhh opened this issue · comments

I attempted to successfully build tcbench using WSL, but encountered a multiprocessing error during the second artifacts run while testing the “tcbench campaign augment-at-loading”. The error specifically occurred in the "data augmentation (rotate)" step.The error message is as follows. Could you please advise me on how to resolve it? @tcbenchstack
image
image

Here is the code that I entered:
image

mmm...I never encountered this problem.

Can I ask you to

  1. Share the full stack trace. From what shared I cannot see exactly where it points to. No need to have a screen shot (a text file is enough)

  2. Tell me more about the your machine: How many CPU cores do you have? Under the hood the augmentations are done using 20 processes. The --workers allows to control this: what if you using less workers?

  3. Have you tried running the 2nd individual run of the campaign in isolation via tcbench run? I suspect this would work since from the screenshot the first run in the campaign seems fine

  4. Have you tried with another augmentation?

  5. The imc23 branch you can find requirements-imc23.txt which details all package version used for our IMC23 paper
    https://tcbenchstack.github.io/tcbench/papers . Have you tried doing a diff to see if something major changed between your environment and the original version?

Thank you very much for your prompt response!
Below, I have provided the complete content of the stack trace as well as details about my CPU configurations.

in fact, I have tried to change the --workers option to 2, 5, and 10. However, the results remain as described above.

I also modified the contents of the augmentations and found that apart from noaug, which can be executed normally, all other augmentation options encounter errors during the second activity.

However, as you mentioned, I encountered no errors when running the activity separately via tcbench run. Meanwhile, I tried tcbench campaign contralearn-and-finetune, and it executed successfully. I believe this should resolve my current problems.

image
stacktrace.txt

Thanks for the info.

Very curious scenario.

First of all, some high level information about the code.

When using contrastive learning, the augmentations are applied during the training loop.
In other words, a DataLoader forms a batch by invoking the dataset __getitem__()

def __getitem__(self, idx):

In this scenario there is no multiprocessing involved.
So, indeed it makes sense that you do not experience the problem.

Conversely, multiprocessing happen when applying the augmentations at loading.
I was hoping the full stack trace was point to the exact line where this is triggered, but it is not.
Since the stack trace mention multiprocessing.Pool, my best guess is that it happens in the augmentations loop

if worker_func.__name__ == "_worker_aug_torch":

Notice that augmentations applied on FlowPics are done via pytorch, while the rest in numpy

worker_func = self._worker_aug_numpy

But if you tested both kinds, I exclude problems related to the two functions handling them.

Notice also the comment about an issue with multiprocessing experienced during development.
The scenario however was different from what you experience.
Specifically, only when using "rotate" a campaign would hang at the 2nd run.
Instead, in the scenario you face the process breaks.

The stack trace you reported mentions only internal elements of the multiprocessing module.
Looking around, I found
https://bugs.python.org/issue47029
now migrated here
python/cpython#91185

The reported stack trace seems matching yours.
Which version of python are you using?
Installed via conda or directly from python.org?
Looking back at my dev environment, I used 3.10.11 (installed from conda)

If you share your python -m pip list --format freeze I can try to replicate your specific setting.

Thank you once again for your detailed response!
In fact, I completely followed the process you provided on https://tcbenchstack.github.io/tcbench/papers/imc23/artifacts/

To set up the environment and install tcbench, I used the conda create -n tcbench python=3.10 pip and python -m pip install tcbench[dev] command.
1.txt provides the result of python -m pip list --format freeze

I compared it with requirements-imc23.txt under theimc23branch and indeed found some differences.

Therefore, I created a new conda environment and used the pip install -r requirements-imc23.txt command to rebuild again. Since there was an error with tcbench=0.0.16, I changed it to version 0.0.17 without authorization.
The versions of other packages remain the same, but the result is still the same as before.