Evalator hangs while training
jiqiujia opened this issue · comments
Environment:
- Python version 3.7
- Spark version 2.4
- TensorFlow version 2.5
- TensorFlowOnSpark version 2.2.3
- Cluster version hadoop
Describe the bug:
I found the evaluator node won't work any more after sometime while training nodes work fine and the whole cluster doesn't crash. The total training step is 80000 and the evaluator only evaluates for 10000+ step. After that no more logs are output.
I don't see anything obvious from your logs. Given that it looks like the evaluator process stalled/quit, I'd check for CPU and memory usage on that node (when it's running) to get more clues. You can also try to run the TF cluster on a smaller scale on a single node without Spark by just running the code in separate processes using TF_CONFIG, i.e. just using distributed TF by itself. And with local processes, you should be able to debug the evaluator node a bit easier to see why it may be stalling.