SpartanJ / efsw

efsw is a C++ cross-platform file system watcher and notifier.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[Bug] Crash while thread running after FileWatcherInotfiy destructed.

ashione opened this issue · comments

My main instruction likes

int main () {
FileWatcherInotify A = new FileWatcherInotify(...);
A.watch('...');
delete A;
return 0;
}

There are two threads cross running in process,

  • main thread
  • watcher thread

We found watcher thread crashed and this backtrace remained:

#4  0x00007f78d88248f3 in uw_update_context () from /usr/lib64/libgcc_s.so.1
#5  0x00007f78d88254ae in _Unwind_Backtrace () from /usr/lib64/libgcc_s.so.1
#6  0x00007f78d8759946 in backtrace () from /usr/lib64/libc.so.6
#7  0x00007f78da46f8cf in google::GetStackTrace(void**, int, int) () from /home/admin/.pyenv/versions/3.6.13/lib/python3.6/site-packages/ray/cpp/lib/libray_api.so
#8  0x00007f78da4707f9 in google::(anonymous namespace)::FailureSignalHandler(int, siginfo_t*, void*) ()
   from /home/admin/.pyenv/versions/3.6.13/lib/python3.6/site-packages/ray/cpp/lib/libray_api.so
#9  <signal handler called>
#10 0x00007f7876d34f2d in efsw::Private::ThreadMemberFunc<efsw::FileWatcherInotify>::run (this=<error reading variable: Cannot access memory at address 0x27f7773ffe798>)
    at bazel-out/k8-dbg/bin/external/com_github_efsw/_virtual_includes/efsw/efsw/Thread.hpp:78

main thread stack :

#0  0x00007f78d862cac7 in __pthread_timedjoin_ex () from /usr/lib64/libpthread.so.0
#1  0x00007f7876d43541 in efsw::Platform::ThreadImpl::wait (this=0x7f777cfa36a0) at external/com_github_efsw/src/efsw/platform/posix/ThreadImpl.cpp:25
#2  0x00007f7876d3de3b in efsw::Thread::wait (this=0x7f777c14c2a0) at external/com_github_efsw/src/efsw/Thread.cpp:22
#3  0x00007f7876d3dd48 in efsw::Thread::~Thread (this=0x7f777c14c2a0, __in_chrg=<optimized out>) at external/com_github_efsw/src/efsw/Thread.cpp:9
#4  0x00007f7876d3dd98 in efsw::Thread::~Thread (this=0x7f777c14c2a0, __in_chrg=<optimized out>) at external/com_github_efsw/src/efsw/Thread.cpp:12
#5  0x00007f7876d2cf0b in efsw::FileWatcherInotify::~FileWatcherInotify (this=0x7f7768002fe0, __in_chrg=<optimized out>)
    at external/com_github_efsw/src/efsw/FileWatcherInotify.cpp:50
#6  0x00007f7876d2d0d6 in efsw::FileWatcherInotify::~FileWatcherInotify (this=0x7f7768002fe0, __in_chrg=<optimized out>)
    at external/com_github_efsw/src/efsw/FileWatcherInotify.cpp:68
#7  0x00007f7876d2b351 in efsw::FileWatcher::~FileWatcher (this=0x7f777c847160, __in_chrg=<optimized out>) at external/com_github_efsw/src/efsw/FileWatcher.cpp:67
#8  0x00007f7876d2b378 in efsw::FileWatcher::~FileWatcher (this=0x7f777c847160, __in_chrg=<optimized out>) at external/com_github_efsw/src/efsw/FileWatcher.cpp:68

According to above stackes, it's watcher deletion happened before watcher entry_point running, which means m_objest had been deleted and memory undetected in that time

// Specialization using a member function
template <typename C> struct ThreadMemberFunc : ThreadFunc {
	ThreadMemberFunc( void ( C::*function )(), C* object ) :
		m_function( function ), m_object( object ) {}
	virtual void run() { ( m_object->*m_function )(); }
	void ( C::*m_function )();
	C* m_object;
};

Hi!
Why are you using FileWatcherInotify directly? That class is private. But as far as I can remember it should work if called directly.
I'm unable to reproduce the bug with a minimal example:

#include <efsw/efsw.hpp>

int main( int argc, char** argv ) {
	efsw::FileWatcher* fileWatcher = new efsw::FileWatcher();
	fileWatcher->watch();
	delete fileWatcher;
	return 0;
}

Neither with a watched dir:

#include <efsw/efsw.hpp>
#include <iostream>

class UpdateListener : public efsw::FileWatchListener {
  public:
	UpdateListener() {}

	void handleFileAction( efsw::WatchID, const std::string&, const std::string&, efsw::Action,
						   std::string ) {}
};

int main( int argc, char** argv ) {
	efsw::FileWatcher* fileWatcher = new efsw::FileWatcher();
	UpdateListener* ul = new UpdateListener();
	fileWatcher->addWatch( ".", ul );
	fileWatcher->watch();
	delete fileWatcher;
	delete ul;
	return 0;
}

I'm missing something?

Hi!

Why are you using FileWatcherInotify directly? That class is private. But as far as I can remember it should work if called directly.

I'm unable to reproduce the bug with a minimal example:

#include <efsw/efsw.hpp>



int main( int argc, char** argv ) {

	efsw::FileWatcher* fileWatcher = new efsw::FileWatcher();

	fileWatcher->watch();

	delete fileWatcher;

	return 0;

}

Neither with a watched dir:

#include <efsw/efsw.hpp>

#include <iostream>



class UpdateListener : public efsw::FileWatchListener {

  public:

	UpdateListener() {}



	void handleFileAction( efsw::WatchID, const std::string&, const std::string&, efsw::Action,

						   std::string ) {}

};



int main( int argc, char** argv ) {

	efsw::FileWatcher* fileWatcher = new efsw::FileWatcher();

	UpdateListener* ul = new UpdateListener();

	fileWatcher->addWatch( ".", ul );

	fileWatcher->watch();

	delete fileWatcher;

	delete ul;

	return 0;

}

I'm missing something?

I just want to make someone getting main code path and why it crashed,which does not means i use filewatcherinotfiy directly.
You might catch this from stack filewatcher was used.

In my usage scenario, it will appear once about an hour running. Especially in the case of high load machine, pthread_create returns does not mean that the thread has been created and run entry point function successfully. Actually you can deduce it from the way the code runs, which is indeed possible.
@SpartanJ

Oh, I understand. I thought that your code snipped was a literal example.
If I understand well mIsActive should be only true when entryPoint is called / running and it should be atomic. Do you agree?

Let me know if the last commit helps with your issue.

Let me know if the last commit helps with your issue.

I pack this commit to my program but still crashed. So perhas it not works yet.

Sorry, the fix I sent was for another kind of error and also was worst than before. I misinterpreted the crash completely. I think that now I have a clear picture of the issue but I have no idea how I can avoid it in 100% of the cases.
Reading the Linux documentation says basically that the thread can start at any time, I'm assuming that in your case is taking enough time to be called so late that the owner (the Thread instance and the ThreadImpl instance) does not exist anymore. I assumed that the call to pthread_join should prevent that (the Thread destructor calls to pthread_join before destroying ThreadImpl and its own instance), but it seems that pthread_join does nothing if the thread did not start. I sent another change that will check if the join worked, and if that didn't work will call to a pthread_cancel. If that doesn't work, I actually have no idea how this could be fixed.
In your use case, you're spawning many efsw::FileWatcher and destroying them very quickly or something alike? Otherwise, I don't see how this could happen at all.

Also: I tried to replicate it by doing something stupid like:

int main( int argc, char** argv ) {
	int c = 0;
	UpdateListener* ul = new UpdateListener();
	while ( c++ < 10000 ) {
		efsw::FileWatcher* fileWatcher = new efsw::FileWatcher();
		fileWatcher->addWatch( ".", ul );
		fileWatcher->watch();
		delete fileWatcher;
	}

	delete ul;
	return 0;
}

And pthread_join is doing its job, it always waits for the entryPoint to run at completion.
That confuses me more.

I'm going to figure out the root cause of this issue and upload lastest clues ASAP.

Maybe using c++11 threads helps?

maybe try my PR that uses c++11 primitives for threading and locking:
#170

Thanks for the collaboration. It seems that still needs some work to compile properly. You had this same issue? Because I wasn't able to reproduce it yet.

what needs work? I have no problem comliling?

no i did not, but the report somewhat lowered my trust in the hand rolled thread management, thus I pulled in the c++11 stuff, which I think is the best approach

no i did not, but the report somewhat lowered my trust in the hand rolled thread management, thus I pulled in the c++11 stuff, which I think is the best approach

Perphas it's better way to use standard thread library. Actually it's weird and no more detail information or core dump can be digged out.
I notice that filewatherinotify and some objects might not be threadsafe in high frequency creation or destruction.

I agree, efsw was using custom thread implementation because it was created actually before C++11 and I kept compatibility for older compilers until just recently. But know it's possible to migrate everything to C++11 standard headers. The PR looks OK but I'll not merge it until I have enough time to test it on all platforms.