Evalator hangs while training

Question

Evalator hangs while training

jiqiujia opened this issue 2 years ago · comments

Environment:

Python version 3.7
Spark version 2.4
TensorFlow version 2.5
TensorFlowOnSpark version 2.2.3
Cluster version hadoop

Describe the bug:
I found the evaluator node won't work any more after sometime while training nodes work fine and the whole cluster doesn't crash. The total training step is 80000 and the evaluator only evaluates for 10000+ step. After that no more logs are output.

Lee Yang · Answer 1 · Wed Aug 03 2022 00:44:36 GMT+0800 (China Standard Time)

I don't see anything obvious from your logs. Given that it looks like the evaluator process stalled/quit, I'd check for CPU and memory usage on that node (when it's running) to get more clues. You can also try to run the TF cluster on a smaller scale on a single node without Spark by just running the code in separate processes using TF_CONFIG, i.e. just using distributed TF by itself. And with local processes, you should be able to debug the evaluator node a bit easier to see why it may be stalling.