A synchronization problem and ICE networking error
metacret opened this issue · comments
Hi
When I am running Yahoo LDA on my hadoop cluster, I found the following problems:
- permission denied for executable contained in jar package
- To resolve this issue, I added chmod 755 $LDALibs/* at Formatter.sh and LDA.sh
- synchronization problem of global/lda.dict.dump
I've found that before the process 0 finished writing global/lda.dict.dump if other processes tried to run the following script:
${HADOOP_CMD} dfs -get ${mapred_output_dir}/global/lda.dict.dump lda.dict.dump.global
it cannot download the file and whole process is going crashed. So, I put the synchronization code such as wait_for 60 ${mapred_output_dir}/global/lda.dict.dump.
- The critical problem of multi-machine of Yahoo LDA
Finally, I got the following problem, this is not related with running script, so how can I recover this situation?
1020 03:57:06.626588 20423 Merge_Topic_Counts.cpp:103] Initializing global dictionary from lda.dict.dump.global
W1020 03:57:11.659412 20423 Merge_Topic_Counts.cpp:105] global dictionary Initialized
terminate called after throwing an instance of 'Ice::ConnectionLostException'
what(): TcpTransceiver.cpp:248: Ice::ConnectionLostException:
connection lost: Connection reset by peer
Should I modify LDA.sh script to check the error code of each module execution and repeat unless the error code is success?
Thank you!
On Friday 21 October 2011 03:23 AM, metacret wrote:
Hi
When I am running Yahoo LDA on my hadoop cluster, I found the following problems:
- permission denied for executable contained in jar package
- To resolve this issue, I added chmod 755 $LDALibs/* at Formatter.sh and LDA.sh
cool!
2. synchronization problem of global/lda.dict.dump
I've found that before the process 0 finished writing global/lda.dict.dump if other processes tried to run the following script:${HADOOP_CMD} dfs -get ${mapred_output_dir}/global/lda.dict.dump lda.dict.dump.global
I dunno why you say that other processes try to get the global
dictionary before its written. There is already a
wait_for_all 60 ${synch_dir}"/global_dict";
that takes care of it.
it cannot download the file and whole process is going crashed. So, I put the synchronization code such as wait_for 60 ${mapred_output_dir}/global/lda.dict.dump.
- The critical problem of multi-machine of Yahoo LDA
Finally, I got the following problem, this is not related with running script, so how can I recover this situation?
1020 03:57:06.626588 20423 Merge_Topic_Counts.cpp:103] Initializing global dictionary from lda.dict.dump.global
W1020 03:57:11.659412 20423 Merge_Topic_Counts.cpp:105] global dictionary Initialized
terminate called after throwing an instance of 'Ice::ConnectionLostException'
what(): TcpTransceiver.cpp:248: Ice::ConnectionLostException:
connection lost: Connection reset by peerThis is more of a hadoop problem. Connection can be lost due to many
reasons beyond the control of LDA. So its only the checkpointing &
restart mechanism that will take care of these. You need to worry about
these. LDA is confiured to automatically restart from the last
checkpointed iteartion.
Thanks,
--Shravan
PS: Sorry for the late response. Was really busy with some other stuff
Should I modify LDA.sh script to check the error code of each module execution and repeat unless the error code is success?
Thank you!