A synchronization problem and ICE networking error

Question

A synchronization problem and ICE networking error

metacret opened this issue 13 years ago · comments

Jae Hyeon Bae commented 13 years ago

Hi

When I am running Yahoo LDA on my hadoop cluster, I found the following problems:

permission denied for executable contained in jar package
To resolve this issue, I added chmod 755 $LDALibs/* at Formatter.sh and LDA.sh
synchronization problem of global/lda.dict.dump
I've found that before the process 0 finished writing global/lda.dict.dump if other processes tried to run the following script:

${HADOOP_CMD} dfs -get ${mapred_output_dir}/global/lda.dict.dump lda.dict.dump.global

it cannot download the file and whole process is going crashed. So, I put the synchronization code such as wait_for 60 ${mapred_output_dir}/global/lda.dict.dump.

The critical problem of multi-machine of Yahoo LDA

Finally, I got the following problem, this is not related with running script, so how can I recover this situation?

1020 03:57:06.626588 20423 Merge_Topic_Counts.cpp:103] Initializing global dictionary from lda.dict.dump.global
W1020 03:57:11.659412 20423 Merge_Topic_Counts.cpp:105] global dictionary Initialized
terminate called after throwing an instance of 'Ice::ConnectionLostException'
what(): TcpTransceiver.cpp:248: Ice::ConnectionLostException:
connection lost: Connection reset by peer

Should I modify LDA.sh script to check the error code of each module execution and repeat unless the error code is success?

Thank you!

Shravan M Narayanamurthy · Answer 1 · Fri Nov 18 2011 17:03:27 GMT+0800 (China Standard Time)

On Friday 21 October 2011 03:23 AM, metacret wrote:

Hi

When I am running Yahoo LDA on my hadoop cluster, I found the following problems:

permission denied for executable contained in jar package

To resolve this issue, I added chmod 755 $LDALibs/* at Formatter.sh and LDA.sh

cool!
2. synchronization problem of global/lda.dict.dump
I've found that before the process 0 finished writing global/lda.dict.dump if other processes tried to run the following script:

${HADOOP_CMD} dfs -get ${mapred_output_dir}/global/lda.dict.dump lda.dict.dump.global

I dunno why you say that other processes try to get the global
dictionary before its written. There is already a
wait_for_all 60 ${synch_dir}"/global_dict";

that takes care of it.

it cannot download the file and whole process is going crashed. So, I put the synchronization code such as wait_for 60 ${mapred_output_dir}/global/lda.dict.dump.

The critical problem of multi-machine of Yahoo LDA

Finally, I got the following problem, this is not related with running script, so how can I recover this situation?

1020 03:57:06.626588 20423 Merge_Topic_Counts.cpp:103] Initializing global dictionary from lda.dict.dump.global
W1020 03:57:11.659412 20423 Merge_Topic_Counts.cpp:105] global dictionary Initialized
terminate called after throwing an instance of 'Ice::ConnectionLostException'
what(): TcpTransceiver.cpp:248: Ice::ConnectionLostException:
connection lost: Connection reset by peer

This is more of a hadoop problem. Connection can be lost due to many
reasons beyond the control of LDA. So its only the checkpointing &
restart mechanism that will take care of these. You need to worry about
these. LDA is confiured to automatically restart from the last
checkpointed iteartion.

Thanks,
--Shravan
PS: Sorry for the late response. Was really busy with some other stuff

Should I modify LDA.sh script to check the error code of each module execution and repeat unless the error code is success?

Thank you!