quickfixgo / quickfix

The Go FIX Protocol Library :rocket:

Home Page:https://www.quickfixgo.org/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Race condition during bilateral initial resend request

sami-sweng opened this issue · comments

commented

Race condition during bilateral initial resend request

I believe I have found a race condition when the initiator successfully connects to an acceptor.
The specific scenarios which cause a bug is when both have a few messages of advance at the time of the first successful logon and the initiator unsent messages are all administrative messages (Logon).

Expected sequence of event

The expected flow is as follow:

Direction Message type
init -> acc Logon
acc -> init Logon
acc -> init ResendRequest
init -> acc ResendRequest
acc -> init News (the 1st message to resend)
acc -> init News (the 2nd message to resend)
acc -> init SequenceReset
init -> acc SequenceReset

Issue

The issue is that sometimes, the initiator sends the sequenceReset immediatly after the resendRequest of the acceptor
and then sends it's own resendRequest.

The sequenceReset is visibly out of order as it has a wrong timestamp (a later one) and the resendRequest continues with the sequence number of before the sequence reset.

The issue is that the acceptor that I'm connecting to (which is using Quickfix Java I think) doesn't tolerate the wrong sequence number and rejects it with a Logout.
And thus the resend mechanism fails and the acceptor messages are lost.

Actual sequence of event

Here are the outputed FIX messages during the scenario:

8=FIX.4.4|A9=76|A35=A|A34=4|A49=INIT|A52=20210525-22:05:30.686|A56=ACC|A369=0|A98=0|A108=90|A10=220|A
8=FIX.4.4|A9=71|A35=A|A34=14|A49=ACC|A52=20210525-22:05:30.803|A56=INIT|A98=0|A108=90|A10=239|A
8=FIX.4.4|A9=68|A35=2|A34=15|A49=ACC|A52=20210525-22:05:30.804|A56=INIT|A7=1|A16=0|A10=068|A
8=FIX.4.4|A9=107|A35=4|A34=1|A43=Y|A49=INIT|A52=20210525-22:05:30.833|A56=ACC|A122=20210525-22:05:30.833|A369=15|A36=6|A123=Y|A10=013|A
8=FIX.4.4|A9=73|A35=2|A34=5|A49=INIT|A52=20210525-22:05:30.821|A56=ACC|A369=0|A7=1|A16=0|A10=030|A
8=FIX.4.4|A9=108|A35=5|A34=16|A49=ACC|A52=20210525-22:05:30.848|A56=INIT|A58=MsgSeqNum too low, expecting 6 but received 5|A10=159|A
8=FIX.4.4|A9=76|A35=A|A34=7|A49=INIT|A52=20210525-22:05:40.882|A56=ACC|A369=1|A98=0|A108=90|A10=223|A
8=FIX.4.4|A9=71|A35=A|A34=17|A49=ACC|A52=20210525-22:05:40.996|A56=INIT|A98=0|A108=90|A10=000|A
8=FIX.4.4|A9=68|A35=2|A34=18|A49=ACC|A52=20210525-22:05:40.997|A56=INIT|A7=6|A16=0|A10=090|A
8=FIX.4.4|A9=107|A35=4|A34=6|A43=Y|A49=INIT|A52=20210525-22:05:41.031|A56=ACC|A122=20210525-22:05:41.031|A369=18|A36=9|A123=Y|A10=008|A
8=FIX.4.4|A9=73|A35=2|A34=8|A49=INIT|A52=20210525-22:05:41.015|A56=ACC|A369=1|A7=2|A16=0|A10=032|A
8=FIX.4.4|A9=108|A35=5|A34=19|A49=ACC|A52=20210525-22:05:41.046|A56=INIT|A58=MsgSeqNum too low, expecting 9 but received 8|A10=160|A

Root cause

I believe I identified the root cause of the issue.

There are essentially three goroutines involved in processing messages in quickfixgo:

  • the read loop (go readLoop()), which parse and sends messages to the messageIn channel.
  • the processing loop (session.run()), which selects from either messageIn (message to be processed) or messageEvent (signaling message in the sending queue)
  • the write loop (go writeLoop())

Step 1

When the acceptor Logon message is received, we are in the LogonState.
The Logon message is written in the messageIn channel, and the session gets it and forwards it to the FixMsgIn() of the LogonState.
So the FixMsgIn() of the logon state essentially create a resendRequest which is simply put in the queue of message to send as we are not logged and trigger the messageEvent signal.
This puts us in the state resendState which in our case will simply proxy the next FixMsgIn to the loggedOn state as there is nothing else to do.

Step 2.a

The session get the messageEvent signal from step 1, and as the resendState signal is considered a loggedOn state, it dequeue the single resend message and immediately sends it.

Step 2.b

The readLoop gets the resendRequest of the acceptor and writes it in the messageIn channel.
The session get the message and submits it to the FixMsgIn of the loggedOn state. This creates a resendRequest and IMMEDIATELY sends it as bytes.

Issue

The issue is that the step 2.a and 2.b are concurrent, as during the time requested to perform step 1, the readloop could have first written the resendRequest in the messageIn which would cause step 2.b to be executed before step 2.a.
The problem is that in that case, step 2.b ignores the fact that there is already a message to send in the queue (the resend) and bypass it.

Possible remediation

If step 2.b happens first, if instead of immediately sending the sequenceReset, it first send the queued resendRequest, no out of order messages are sent.