apnadkarni / iocp

Implements Tcl channels based on Windows I/O completion ports.

Home Page:https://iocp.magicsplat.com

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

unestablished async connect with non-blocking unclosed socket causes hanging on exit

sebres opened this issue · comments

This is rather a placeholder to me (I'll investigate deeper later)...

Following script illustrate the freeze:

if {![llength $::argv]} {
  set ::argv [list ::iocp::inet::socket -async localhost 9999]
}
if {"::iocp::inet::socket" in $::argv} {
    load tcliocpsock iocp
}

puts "WARN: ensure there is no listener on port [lindex $::argv end]"

puts "connect: \[$::argv\]"
set ch [{*}$::argv]

chan configure $ch -blocking 0
#chan event $ch writable ...; chan event $ch readable ...

puts "send test data"
chan puts -nonewline $ch zzz; chan flush $ch

# after 100 {puts "done."; set done 1}; vwait done

# puts "close"; close $ch

puts "exit."

Output with some debug (marked as **):

WARN: ensure there is no listener on port 9999
connect: [::iocp::inet::socket -async localhost 9999]
send test data
exit.
** global NS deleted.
** interp deleted.
!FROZEN!

Prerequisites:

  • there is no listener on port the script trying to connect;
  • the script is executed in tcl-app with proper teardown process (interp of main thread gets deleted at end), so it could work in core tclsh with its usual exit (syscall exit() without proper cleanup);
  • there must be no extra automatic close/GC for the abandoned sockets (it does not hang if socket gets closed before "exiting");

This is sporadic using tcl 8.6 in normal case, but pretty reproducible with tcl 8.5.
The reason for this difference is the TIP#398, so 8.6 simply doesn't try to flush by default, to force it one'd need to set environment var TCL_FLUSH_NONBLOCKING_ON_EXIT, for example:

set TCL_FLUSH_NONBLOCKING_ON_EXIT=1 && tclsh86 test-sock-async.tcl

Supplying socket -async localhost 9999 to script (so evaluating using tcl sockets) doesn't show the freeze at all.
Also execution in different thread/interp doesn't show it - looks like it could be some "conflicting" or deadlocking clean-up handler with the tcl-API finalization stage.

More debug output shows the difference - it seems that a close handler (IocpChannelClose) is always called in success case and doesn't called in freeze case before the exit; somehow the presence of pending async flush operation can prevent it against deleting after DetachSocket (isn't this a ref-count issue?)

OK, it may be an issue with similar roots like #16 (or rather https://core.tcl-lang.org/tcl/tktview?name=75d525d37c), to be precise it could be a missing handler to signal an interp detach. This is hardly fixable at module scope level at the moment (may be iocp can use different ref-counters to protect the socket and its states).

In between it looks like a flush in TclFinalizeIOSubsystem is responsible for the freeze, which is simply entering wait without to obtain the completion for it.

It's not easy to fix it in iocp module with current state of the art (for instance because probably the whole handling must be rewritten to Tcl_Async* mechanisms instead of simple event-sources), let alone the IocpCompletionThread is exited (because IocpProcessCleanup closes overlapped handle) and IocpThreadExitHandler is called before TclFinalizeIOSubsystem is executed (because Tcl_FinalizeThread firstly calls the thread exit handlers).

As an alternative solution one could try to register IocpProcessCleanup as late exit handler (called after Tcl_FinalizeThread):

- Tcl_CreateExitHandler(IocpProcessCleanup, NULL);
+ TclCreateLateExitHandler(IocpProcessCleanup, NULL);

(note the similar bug in tcl-repo [1028264], basically the reason why TclCreateLateExitHandler got introduced once)
but TclCreateLateExitHandler is unavailable at module scope (at least without binding to tclInt.h and its stubs).

I guess it's pending now, unless the module gets back-ported to the tcl-core.

Thanks for the report. I'm out for the next few weeks of summer. Will catch up once I get back.