deeplearning4j / deeplearning4j-examples

Deeplearning4j Examples (DL4J, DL4J Spark, DataVec)

Home Page:http://deeplearning4j.konduit.ai

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

DL4J uses SharedTrainingMaster on spark and reports "ERROR - publicationUnblockTimeoutNs=15000000000 <= clientLivenessTimeoutNs=30000000000"

byanjie opened this issue · comments

Issue Description

The node IP is 10.0.6.201~204. When configuring VoidConfiguration, set networkMask=10.0.6.0/16, and the UDP port is 40123. When deployed in spark Standalone mode, it keeps reporting an error: "Caused by: io.aeron.exceptions.ConfigurationException: ERROR - publicationUnblockTimeoutNs=15000000000 <= clientLivenessTimeoutNs=30000000000", the spark cluster server firewall is all closed.

中文描述:
节点IP是10.0.6.201~204,配置VoidConfiguration时,设置networkMask=10.0.6.0/16,UDP端口为40123,spark Standalone 模式部署下时,一直报错:“Caused by: io.aeron.exceptions.ConfigurationException: ERROR - publicationUnblockTimeoutNs=15000000000 <= clientLivenessTimeoutNs=30000000000”,spark集群服务器防火墙是全部关闭的。

Version Information

Please indicate relevant versions, including, if relevant:

  • Deeplearning4j version:1.0.0-M2
  • platform information (OS, etc):Centos7.9
  • CUDA version, if used:Use CPUs
  • NVIDIA driver version, if in use

Contributing

If you'd like to help us fix the issue by contributing some code, but would
like guidance or help in doing so, please mention it!

Issue Description

The node IP is 10.0.6.201~204. When configuring VoidConfiguration, set networkMask=10.0.6.0/16, and the UDP port is 40123. When deployed in spark Standalone mode, it keeps reporting an error: "Caused by: io.aeron.exceptions.ConfigurationException: ERROR - publicationUnblockTimeoutNs=15000000000 <= clientLivenessTimeoutNs=30000000000", the spark cluster server firewall is all closed.

中文描述:
节点IP是10.0.6.201~204,配置VoidConfiguration时,设置networkMask=10.0.6.0/16,UDP端口为40123,spark Standalone 模式部署下时,一直报错:“Caused by: io.aeron.exceptions.ConfigurationException: ERROR - publicationUnblockTimeoutNs=15000000000 <= clientLivenessTimeoutNs=30000000000”,spark集群服务器防火墙是全部关闭的。

Version Information

Please indicate relevant versions, including, if relevant:

  • Deeplearning4j version:1.0.0-M2
  • platform information (OS, etc):Centos7.9
  • CUDA version, if used:Use CPUs
  • NVIDIA driver version, if in use

Contributing

If you'd like to help us fix the issue by contributing some code, but would
like guidance or help in doing so, please mention it!

@byanjie tweak the liveliness configuration in aeron itself the error is right there:
Caused by: io.aeron.exceptions.ConfigurationException: ERROR - publicationUnblockTimeoutNs=15000000000 <= clientLivenessTimeoutNs=30000000000”

You may find all of the relevant aeron overrides here: https://github.com/real-logic/aeron/blob/master/aeron-driver/src/main/java/io/aeron/driver/Configuration.java

If you want further support please post over on the community forums: https://community.konduit.ai/ - this repo is not monitored that much.

But I have configured "setProperty("aeron.publication.unblock.timeout", "60000000000");", but this error will still be reported in spark cluster mode.

中文解释:
但是我已经配置了“setProperty("aeron.publication.unblock.timeout", "60000000000");”,但是在spark 集群模式下任然会报这个错误。

@byanjie调整aeron自己的运行配置,错误就: 原因:ioexceptions.ConfigurationException:-publicationUnblockTimeout00 = 150000000 <= clientLivenessTimeoutNs = 3000”

您可以在这里找到所有相关的 aeron 覆盖:https://github.com/real-logic/aeron/blob/master/aeron-driver/src/main/java/io/aeron/driver/Configuration.java

如果您需要进一步的支持,在社区论坛上发帖:https://community.konduit.ai/ - 这个 repo 没有受到过多的监控。

@byanjie tweak the liveliness configuration in aeron itself the error is right there: Caused by: io.aeron.exceptions.ConfigurationException: ERROR - publicationUnblockTimeoutNs=15000000000 <= clientLivenessTimeoutNs=30000000000”

You may find all of the relevant aeron overrides here: https://github.com/real-logic/aeron/blob/master/aeron-driver/src/main/java/io/aeron/driver/Configuration.java

If you want further support please post over on the community forums: https://community.konduit.ai/ - this repo is not monitored that much.

But I have configured "setProperty("aeron.publication.unblock.timeout", "60000000000");", but this error will still be reported in spark cluster mode

@byanjie tweak the liveliness configuration in aeron itself the error is right there: Caused by: io.aeron.exceptions.ConfigurationException: ERROR - publicationUnblockTimeoutNs=15000000000 <= clientLivenessTimeoutNs=30000000000”

You may find all of the relevant aeron overrides here: https://github.com/real-logic/aeron/blob/master/aeron-driver/src/main/java/io/aeron/driver/Configuration.java

If you want further support please post over on the community forums: https://community.konduit.ai/ - this repo is not monitored that much.

But I have configured "setProperty("aeron.publication.unblock.timeout", "60000000000");", but this error will still be reported in spark cluster mode

@byanjie post more on the community forums then please. If you want help and you're asking for our time here the least you can do is go where other people can benefit from our discussion. Thanks.