Is there a way to reset ZMQ_PUSH underlying pipe without re-creating another ZMQ_PUSH zmq_socket?

Question

Is there a way to reset ZMQ_PUSH underlying pipe without re-creating another ZMQ_PUSH zmq_socket?

Lounarok opened this issue 4 months ago · comments

Issue description

Hi,
I got a case that the ZMQ_PUSH client always get EAGAIN forever.
The only way to recover it is to restart the application (Haven't tried re-create the zmq socket, but I guess that would work)
Re-creating a ZMQ_PUSH might lose all the messages in queue, so I'd like to know if there is other way to just fix the pipe?
Or may I just zmq_connect to the same endpoint again? Aka re-create a new pipe.
Because I cannot reproduce the problem, just wondering if zmq_disconnect() plus zmq_connect() would help?
Edit: Checked the code, it seems like zmq_disconnect() plus zmq_connect() will destroy the pipe and queue..

Seems like zmq::session_base_t::reconnect () will call _pipe->hiccup (); but I didn't find a manual way to re-trigger reconnect ().

Setup

It was caused by a complex setup where a haproxy uses a shared IP to round robin messages to its backends.
In a rare case in high load, zmq pipe state machine seems to be confused by haproxy and ZMQ_PUSH's pipe is in incorrect state forever. It cannot be reproduced locally...
This is a complex setup and I cannot change it.

ZMQ_PUSH                                            ZMQ_PULL
client1                                             server1
client2         -------> haproxy (shared vip) --->  server2
client3                                             server3
...more clients                                 ... more servers

Environment

libzmq version (commit hash if unreleased): 4.3.2, 4.3.4. Two different applications using different libzmq have the same result. Both applications are connecting to the shared vip and they share the same sets of ZMQ_PULL servers
OS: ubuntu 20.04

Minimal test code / Steps to reproduce the issue

There are 4 threads, each thread has a zmq_socket.
Each of zmq_socket connects to the same tcp://10.1.1.1:5555 once while process starting up.
zmq_socket sends with zmq_msg_send(frame, zmq_socket, ZMQ_DONTWAIT)
netstat -tupn | grep '10.1.1.1:5555' would be 4 connections to the haproxyvip.
All zmq_socket are enabled with ZMQ_TCP_KEEPALIVE and other values are default.

What's the actual result? (include assertion message & call stack if applicable)

When issue occurs, errno is always EAGAIN after zmq_msg_send()
netstat -tupn | grep '10.1.1.1:5555' will show 0~3 connections.
Missing connections won't be restabalished (tested with 8 hours) and they're absent until process restart.

What's the expected result?

EAGAIN should be gone after a while but infact it's always EAGAIN
Connections should be re-established after a while.