zeromq / libzmq

ZeroMQ core engine in C++, implements ZMTP/3.1

Home Page:https://www.zeromq.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Is there a way to reset ZMQ_PUSH underlying pipe without re-creating another ZMQ_PUSH zmq_socket?

Lounarok opened this issue · comments

Issue description

Hi,
I got a case that the ZMQ_PUSH client always get EAGAIN forever.
The only way to recover it is to restart the application (Haven't tried re-create the zmq socket, but I guess that would work)
Re-creating a ZMQ_PUSH might lose all the messages in queue, so I'd like to know if there is other way to just fix the pipe?
Or may I just zmq_connect to the same endpoint again? Aka re-create a new pipe.
Because I cannot reproduce the problem, just wondering if zmq_disconnect() plus zmq_connect() would help?
Edit: Checked the code, it seems like zmq_disconnect() plus zmq_connect() will destroy the pipe and queue..

Seems like zmq::session_base_t::reconnect () will call _pipe->hiccup (); but I didn't find a manual way to re-trigger reconnect ().

Setup

It was caused by a complex setup where a haproxy uses a shared IP to round robin messages to its backends.
In a rare case in high load, zmq pipe state machine seems to be confused by haproxy and ZMQ_PUSH's pipe is in incorrect state forever. It cannot be reproduced locally...
This is a complex setup and I cannot change it.

ZMQ_PUSH                                            ZMQ_PULL
client1                                             server1
client2         -------> haproxy (shared vip) --->  server2
client3                                             server3
...more clients                                 ... more servers

Environment

  • libzmq version (commit hash if unreleased): 4.3.2, 4.3.4. Two different applications using different libzmq have the same result. Both applications are connecting to the shared vip and they share the same sets of ZMQ_PULL servers
  • OS: ubuntu 20.04

Minimal test code / Steps to reproduce the issue

  1. There are 4 threads, each thread has a zmq_socket.
  2. Each of zmq_socket connects to the same tcp://10.1.1.1:5555 once while process starting up.
  3. zmq_socket sends with zmq_msg_send(frame, zmq_socket, ZMQ_DONTWAIT)
  4. netstat -tupn | grep '10.1.1.1:5555' would be 4 connections to the haproxyvip.
  5. All zmq_socket are enabled with ZMQ_TCP_KEEPALIVE and other values are default.

What's the actual result? (include assertion message & call stack if applicable)

  1. When issue occurs, errno is always EAGAIN after zmq_msg_send()
  2. netstat -tupn | grep '10.1.1.1:5555' will show 0~3 connections.
  3. Missing connections won't be restabalished (tested with 8 hours) and they're absent until process restart.

What's the expected result?

  1. EAGAIN should be gone after a while but infact it's always EAGAIN
  2. Connections should be re-established after a while.