Is there a way to reset ZMQ_PUSH underlying pipe without re-creating another ZMQ_PUSH zmq_socket?
Lounarok opened this issue · comments
Issue description
Hi,
I got a case that the ZMQ_PUSH client always get EAGAIN forever.
The only way to recover it is to restart the application (Haven't tried re-create the zmq socket, but I guess that would work)
Re-creating a ZMQ_PUSH might lose all the messages in queue, so I'd like to know if there is other way to just fix the pipe?
Or may I just zmq_connect
to the same endpoint again? Aka re-create a new pipe.
Because I cannot reproduce the problem, just wondering if zmq_disconnect()
plus zmq_connect()
would help?
Edit: Checked the code, it seems like zmq_disconnect()
plus zmq_connect()
will destroy the pipe and queue..
Seems like zmq::session_base_t::reconnect ()
will call _pipe->hiccup ();
but I didn't find a manual way to re-trigger reconnect ()
.
Setup
It was caused by a complex setup where a haproxy uses a shared IP to round robin messages to its backends.
In a rare case in high load, zmq pipe state machine seems to be confused by haproxy and ZMQ_PUSH's pipe is in incorrect state forever. It cannot be reproduced locally...
This is a complex setup and I cannot change it.
ZMQ_PUSH ZMQ_PULL
client1 server1
client2 -------> haproxy (shared vip) ---> server2
client3 server3
...more clients ... more servers
Environment
- libzmq version (commit hash if unreleased): 4.3.2, 4.3.4. Two different applications using different libzmq have the same result. Both applications are connecting to the shared vip and they share the same sets of ZMQ_PULL servers
- OS: ubuntu 20.04
Minimal test code / Steps to reproduce the issue
- There are 4 threads, each thread has a
zmq_socket
. - Each of zmq_socket connects to the same
tcp://10.1.1.1:5555
once while process starting up. - zmq_socket sends with
zmq_msg_send(frame, zmq_socket, ZMQ_DONTWAIT)
netstat -tupn | grep '10.1.1.1:5555'
would be 4 connections to the haproxyvip.- All zmq_socket are enabled with
ZMQ_TCP_KEEPALIVE
and other values are default.
What's the actual result? (include assertion message & call stack if applicable)
- When issue occurs,
errno
is alwaysEAGAIN
afterzmq_msg_send()
netstat -tupn | grep '10.1.1.1:5555'
will show 0~3 connections.- Missing connections won't be restabalished (tested with 8 hours) and they're absent until process restart.
What's the expected result?
EAGAIN
should be gone after a while but infact it's alwaysEAGAIN
- Connections should be re-established after a while.