Angel-ML / sona

Spark On Angel, arming Spark with a powerful Parameter Server, which enable Spark to train very big models

Repository from Github https://github.comAngel-ML/sonaRepository from Github https://github.comAngel-ML/sona

run demo of sona latest version bug

lcx517 opened this issue · comments

Hi, I'm running SONA-example,and got FAILED with stdout log here.
PLEASE HELP~~

2019-12-26 14:09:19 INFO  SignalUtils:54 - Registered signal handler for TERM
2019-12-26 14:09:19 INFO  SignalUtils:54 - Registered signal handler for HUP
2019-12-26 14:09:19 INFO  SignalUtils:54 - Registered signal handler for INT
2019-12-26 14:09:19 INFO  SecurityManager:54 - Changing view acls to: deepthought
2019-12-26 14:09:19 INFO  SecurityManager:54 - Changing modify acls to: deepthought
2019-12-26 14:09:19 INFO  SecurityManager:54 - Changing view acls groups to: 
2019-12-26 14:09:19 INFO  SecurityManager:54 - Changing modify acls groups to: 
2019-12-26 14:09:19 INFO  SecurityManager:54 - SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(deepthought); groups with view permissions: Set(); users  with modify permissions: Set(deepthought); groups with modify permissions: Set()
2019-12-26 14:09:20 INFO  UserGroupInformation:964 - Login successful for user deepthought using keytab file deepthought.keytab-4169bc48-f895-42c2-9dde-091feb49f3c5
2019-12-26 14:09:20 INFO  ApplicationMaster:54 - Preparing Local resources
2019-12-26 14:09:22 WARN  Client:677 - Exception encountered while connecting to the server : org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby. Visit https://s.apache.org/sbnn-error
2019-12-26 14:09:28 INFO  ApplicationMaster:54 - ApplicationAttemptId: appattempt_1576380960005_2467808_000001
2019-12-26 14:09:28 INFO  AMCredentialRenewer:54 - Scheduling login from keytab in 64776907 millis.
2019-12-26 14:09:28 INFO  ApplicationMaster:54 - Starting the user application in a separate Thread
2019-12-26 14:09:28 ERROR ApplicationMaster:91 - Uncaught exception: 
java.lang.ClassNotFoundException: org.apache.spark.angel.examples.JsonRunnerExamples
	at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
	at org.apache.spark.deploy.yarn.ApplicationMaster.startUserApplication(ApplicationMaster.scala:715)
	at org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:491)
	at org.apache.spark.deploy.yarn.ApplicationMaster.org$apache$spark$deploy$yarn$ApplicationMaster$$runImpl(ApplicationMaster.scala:345)
	at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$2.apply$mcV$sp(ApplicationMaster.scala:260)
	at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$2.apply(ApplicationMaster.scala:260)
	at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$2.apply(ApplicationMaster.scala:260)
	at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$5.run(ApplicationMaster.scala:815)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1692)
	at org.apache.spark.deploy.yarn.ApplicationMaster.doAsUser(ApplicationMaster.scala:814)
	at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:259)
	at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:839)
	at org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
2019-12-26 14:09:28 INFO  ApplicationMaster:54 - Final app status: FAILED, exitCode: 13, (reason: Uncaught exception: java.lang.ClassNotFoundException: org.apache.spark.angel.examples.JsonRunnerExamples)
2019-12-26 14:09:28 INFO  ShutdownHookManager:54 - Shutdown hook called

my SONA-example script:

source ./spark-on-angel-env.sh
export HADOOP_CONF_DIR=/usr/lib/hadoop/etc/hadoop

$SPARK_HOME/bin/spark-submit \
        --master yarn-cluster \
        --driver-java-options "-Djava.library.path=/usr/lib/hadoop/lib/native" \
        --keytab /home/deepthought/deepthought.keytab \
        --principal deepthought \
        --queue longyuan.p0 \
	--conf spark.ps.jars=$SONA_ANGEL_JARS \
	--conf spark.ps.instances=10 \
	--conf spark.ps.cores=2 \
	--conf spark.ps.memory=6g \
	--jars $SONA_SPARK_JARS\
	--name "LR-spark-on-angel" \
	--files /data/angel/sona-0.1.0-bin/jsons/logreg.json \
	--driver-memory 10g \
	--num-executors 10 \
	--executor-cores 2 \
	--executor-memory 4g \
	--class org.apache.spark.angel.examples.JsonRunnerExamples \
	./../lib/angelml-${SONA_VERSION}.jar \
	data:viewfs://hadoop-bd/user/deepthought/test/angel/sona-0.1.0-bin/data/angel/a9a/a9a_123d_train.libsvm \
	modelPath:viewfs://hadoop-bd/user/deepthought/test/output \
	jsonFile:./lr.json \
	lr:0.1

and my spark-on-angel-env.sh:

export JAVA_HOME=/usr
export HADOOP_HOME=/usr/lib/hadoop
export SPARK_HOME=/usr/local/spark/spark-2.3.1-bin-hadoop2.6
export SONA_HOME=/data/angel/sona-0.1.0-bin
export SONA_HDFS_HOME=viewfs://hadoop-bd/user/deepthought/test/angel/sona-0.1.0-bin
export SONA_VERSION=0.1.0
export ANGEL_VERSION=3.0.1
export ANGEL_UTILS_VERSION=0.1.1
export ANGEL_MLCORE_VERSION=0.1.2

...<not changed default content below>...```

Hi, I'm running SONA-example,and got FAILED with stdout log here.
PLEASE HELP~~

2019-12-26 14:09:19 INFO  SignalUtils:54 - Registered signal handler for TERM
2019-12-26 14:09:19 INFO  SignalUtils:54 - Registered signal handler for HUP
2019-12-26 14:09:19 INFO  SignalUtils:54 - Registered signal handler for INT
2019-12-26 14:09:19 INFO  SecurityManager:54 - Changing view acls to: deepthought
2019-12-26 14:09:19 INFO  SecurityManager:54 - Changing modify acls to: deepthought
2019-12-26 14:09:19 INFO  SecurityManager:54 - Changing view acls groups to: 
2019-12-26 14:09:19 INFO  SecurityManager:54 - Changing modify acls groups to: 
2019-12-26 14:09:19 INFO  SecurityManager:54 - SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(deepthought); groups with view permissions: Set(); users  with modify permissions: Set(deepthought); groups with modify permissions: Set()
2019-12-26 14:09:20 INFO  UserGroupInformation:964 - Login successful for user deepthought using keytab file deepthought.keytab-4169bc48-f895-42c2-9dde-091feb49f3c5
2019-12-26 14:09:20 INFO  ApplicationMaster:54 - Preparing Local resources
2019-12-26 14:09:22 WARN  Client:677 - Exception encountered while connecting to the server : org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby. Visit https://s.apache.org/sbnn-error
2019-12-26 14:09:28 INFO  ApplicationMaster:54 - ApplicationAttemptId: appattempt_1576380960005_2467808_000001
2019-12-26 14:09:28 INFO  AMCredentialRenewer:54 - Scheduling login from keytab in 64776907 millis.
2019-12-26 14:09:28 INFO  ApplicationMaster:54 - Starting the user application in a separate Thread
2019-12-26 14:09:28 ERROR ApplicationMaster:91 - Uncaught exception: 
java.lang.ClassNotFoundException: org.apache.spark.angel.examples.JsonRunnerExamples
	at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
	at org.apache.spark.deploy.yarn.ApplicationMaster.startUserApplication(ApplicationMaster.scala:715)
	at org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:491)
	at org.apache.spark.deploy.yarn.ApplicationMaster.org$apache$spark$deploy$yarn$ApplicationMaster$$runImpl(ApplicationMaster.scala:345)
	at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$2.apply$mcV$sp(ApplicationMaster.scala:260)
	at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$2.apply(ApplicationMaster.scala:260)
	at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$2.apply(ApplicationMaster.scala:260)
	at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$5.run(ApplicationMaster.scala:815)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1692)
	at org.apache.spark.deploy.yarn.ApplicationMaster.doAsUser(ApplicationMaster.scala:814)
	at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:259)
	at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:839)
	at org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
2019-12-26 14:09:28 INFO  ApplicationMaster:54 - Final app status: FAILED, exitCode: 13, (reason: Uncaught exception: java.lang.ClassNotFoundException: org.apache.spark.angel.examples.JsonRunnerExamples)
2019-12-26 14:09:28 INFO  ShutdownHookManager:54 - Shutdown hook called

my SONA-example script:

source ./spark-on-angel-env.sh
export HADOOP_CONF_DIR=/usr/lib/hadoop/etc/hadoop

$SPARK_HOME/bin/spark-submit \
        --master yarn-cluster \
        --driver-java-options "-Djava.library.path=/usr/lib/hadoop/lib/native" \
        --keytab /home/deepthought/deepthought.keytab \
        --principal deepthought \
        --queue longyuan.p0 \
	--conf spark.ps.jars=$SONA_ANGEL_JARS \
	--conf spark.ps.instances=10 \
	--conf spark.ps.cores=2 \
	--conf spark.ps.memory=6g \
	--jars $SONA_SPARK_JARS\
	--name "LR-spark-on-angel" \
	--files /data/angel/sona-0.1.0-bin/jsons/logreg.json \
	--driver-memory 10g \
	--num-executors 10 \
	--executor-cores 2 \
	--executor-memory 4g \
	--class org.apache.spark.angel.examples.JsonRunnerExamples \
	./../lib/angelml-${SONA_VERSION}.jar \
	data:viewfs://hadoop-bd/user/deepthought/test/angel/sona-0.1.0-bin/data/angel/a9a/a9a_123d_train.libsvm \
	modelPath:viewfs://hadoop-bd/user/deepthought/test/output \
	jsonFile:./lr.json \
	lr:0.1

and my spark-on-angel-env.sh:

export JAVA_HOME=/usr
export HADOOP_HOME=/usr/lib/hadoop
export SPARK_HOME=/usr/local/spark/spark-2.3.1-bin-hadoop2.6
export SONA_HOME=/data/angel/sona-0.1.0-bin
export SONA_HDFS_HOME=viewfs://hadoop-bd/user/deepthought/test/angel/sona-0.1.0-bin
export SONA_VERSION=0.1.0
export ANGEL_VERSION=3.0.1
export ANGEL_UTILS_VERSION=0.1.1
export ANGEL_MLCORE_VERSION=0.1.2

...<not changed default content below>...```

class changed aleady, while doc is outdated!

You need to change "org.apache.spark.angel.examples.JsonRunnerExamples" to "com.tencent.angel.sona.examples.JsonRunnerExamples".

luck~