eModbus / eModbus

Modbus library for RTU, ASCII and TCP protocols. Primarily developed on and for ESP32 MCUs.

Home Page:https://emodbus.github.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

TCP request are dropped

zivillian opened this issue · comments

Environment

I'm currently using the library in two different projects on an ESP32 to build a TCP to RTU bridge. Both are showing the same behaviour and I'm out of ideas. One RTU slave is a Heidelberg Wallbox and the other an SDM630 - two completely different devices. One Modbus TCP client is evcc and the other mbmd. As far as I can tell both use the same modbus go library.

Problem

Every few minutes I receive a timeout error on the TCP client side. I captured the traffic between the client and the esp32 and can see, that the request is transmitted and the esp32 also sends an ACK, but there is no response at all so the client closes the connection after a timeout of 10 seconds:

grafik

I also added an additional log statement in the ModbusServerTCPtemp:

        // Extract request data
        ModbusMessage request;
        request.add(m.data() + 6, m.size() - 6);
-
+        LOG_E(">");
        // Protocol ID shall be 0x0000 - is it?
        if (m[2] == 0 && m[3] == 0) {

and this was logged only when there was no timeout. So my current guess is, that the packet is received by the esp but not processed by emodbus.

Example code

https://github.com/zivillian/heidelberg-1p3p or https://github.com/zivillian/esp32-modbus-gateway

The hardware I'm currently using does not expose an additional HardwareSerial, so I can only get the logs via telnet (that's why I have no trace log). I'm currently not even sure if this is a problem with emodbus or if all of my hardware or wifi is somehow broken.

Do you have an idea how I can further debug this?

Strange. There is one possibility left that would go unnoticed with the library code: if the request TCP packet somehow is chopped to pieces less than 8 bytes, the outermost if in the receive loop will discard these without a message. You could put one in in an else block to see if that really is happening. When I wrote the code I did not really expect that to happen, but who knows?

I've added another log statement, but this does not show up in the log:

        } else {
          // No, protocol ID was something weird
          response.setError(request.getServerID(), request.getFunctionCode(), TCP_HEAD_MISMATCH);
        }
+     } else {
+       LOG_E("size < 8");
      }
      delay(1);
      // Do we have a response to send?

I'll try to find another esp with an additional HardwareSerial, so I can get a debug log...

Well, at least we now have learned there are no fragmented packets.

Since we do a straight read() as soon as available() gets true, there is no way to drop data in between. Hence available() seems not to get >0 at all.

Do you happen to have another task running that is handling TCP traffic at the same time?

The only other TCP traffic is the web interface from https://github.com/me-no-dev/ESPAsyncWebServer.

Yesterday I was able to capture a debug log but it shows no error or hint:

[D] 900189| ModbusServer.cpp     [  38] getWorker: Worker found for 01/ANY
[D] 900190| ModbusBridgeTemp.h   [ 174] bridgeWorker: Request (01/04) sent
[D] 900201| ModbusClientRTU.cpp  [ 230] addToQueue: RC=01
[D] 900202| ModbusClientRTU.cpp  [ 248] handleConnection: Pulled request from queue
[D] Sent packet: @3FFDE394/6:
  | 0000: 01 04 00 0A 00 03                                 |......          |
[D] 900218| ModbusClientRTU.cpp  [ 253] handleConnection: Request sent.
[D] 900231| RTUutils.cpp         [ 436] receive: C/[D] Received packet: @3FFDF054/9:
  | 0000: 01 04 06 00 ED 00 09 00  08                       |.........       |
[D] 900242| ModbusClientRTU.cpp  [ 268] handleConnection: Data response (9 bytes) received.
[D] 900257| ModbusClientRTU.cpp  [ 289] handleConnection: Response generated.
[D] 900261| ModbusServerTCPtemp.h [ 314] worker: Data response

<-- here is a missing query, which result in the timeout and disconnect of the client

[D] 901344| ModbusServerTCPtemp.h [ 358] worker: Worker stopping due to client disconnect.
[D] 901345| ModbusServerTCPtemp.h [ 210] accept: Started client 1 task 1073610040
[D] 901346| ModbusServerTCPtemp.h [ 261] worker: Worker started, timeout=60000
[D] 901361| ModbusServer.cpp     [  38] getWorker: Worker found for 01/ANY
[D] 901356| ModbusServerTCPtemp.h [ 239] serve: Accepted connection - 1 clients running
[D] 901362| ModbusBridgeTemp.h   [ 174] bridgeWorker: Request (01/04) sent
[D] 901384| ModbusClientRTU.cpp  [ 248] handleConnection: Pulled request from queue
[D] Sent packet: @3FFDF09C/6:
  | 0000: 01 04 00 11 00 02                                 |......          |
[D] 901400| ModbusClientRTU.cpp  [ 253] handleConnection: Request sent.
[D] 901403| RTUutils.cpp         [ 436] receive: C/[D] Received packet: @3FFDE3B0/7:
  | 0000: 01 04 04 00 00 00 00                              |.......         |
[D] 901413| ModbusClientRTU.cpp  [ 268] handleConnection: Data response (7 bytes) received.
[D] 901424| ModbusClientRTU.cpp  [ 289] handleConnection: Response generated.
[D] 901390| ModbusClientRTU.cpp  [ 230] addToQueue: RC=01
[D] 901436| ModbusServerTCPtemp.h [ 314] worker: Data response

grafik

Beats me. Even if the Modbus bridge (TCP server side) would be pausing/dreaming/whatever, the TCP packet would be held in the TCP stack until someone was going to read it. The connection is closed for inactivity, so nothing was available() in the meantime.

One long shot came to my mind, though: what if the request came in exactly while the connection was cut? You should try with a longer timeout on the server side, like 10s or more - anything longer than the interval between two requests will do.

I wrote a small client which queries the same register over and over with a timeout of 1s. After the first query timed out it nevers recovers:

Code:

using var client = new TcpClient();
await client.ConnectAsync("192.168.23.134", 502, cancellationToken);
var stream = client.GetStream();
var query = new byte[12]
{
    0x00, 0x00, //transaction identifier
    0x00, 0x00, //protocol identifier
    0x00, 0x06, //length
    0x01, //unit identifier
    0x04, //function code
    0x00,0x06, // reference number
    0x00,0x03 // word count
};
var response = new byte[15];
short trans = 0;
while (true)
{
    trans++;
    query[0] = (byte)(trans >> 8);
    query[1] = (byte)(trans & 0xff);
    await stream.WriteAsync(query, cancellationToken);
    Console.WriteLine("query sent");
    var read = stream.ReadAsync(response, cancellationToken).AsTask();
    var timeout = Task.Delay(1000, cancellationToken);
    var finished = await Task.WhenAny(read, timeout);
    if (finished == timeout)
    {
        Console.WriteLine("timed out");
    }
    else
    {
        await read;
        Console.WriteLine($"response received");
    }
}

Output:

starting
query sent
response received
query sent
response received
query sent
timed out
query sent
timed out
query sent
timed out
query sent
timed out
query sent
timed out

grafik

Next I'll build a firmware which does directly respond to requests without the bridge - hopefully I can further pinpoint the problem.

My bad - it didn't recover because of my client code. With the following code I can successfully read again after a timeout:

using var client = new TcpClient();
await client.ConnectAsync("192.168.23.134", 502, cancellationToken);
var stream = client.GetStream();
var query = new byte[12]
{
    0x00, 0x00, //transaction identifier
    0x00, 0x00, //protocol identifier
    0x00, 0x06, //length
    0x01, //unit identifier
    0x04, //function code
    0x00,0x06, // reference number
    0x00,0x03 // word count
};
var response = new byte[15];
short trans = 0;
stream.ReadTimeout = 1000;
while (true)
{
    trans++;
    query[0] = (byte)(trans >> 8);
    query[1] = (byte)(trans & 0xff);
    await stream.WriteAsync(query, cancellationToken);
    Console.WriteLine("query sent");
    try
    {
        stream.Read(response);
        Console.WriteLine($"response received");
    }
    catch (IOException)
    {
        Console.WriteLine("timed out");
    }
}
starting
query sent
response received
query sent
response received
query sent
response received
query sent
response received
query sent
response received
query sent
timed out
query sent
response received
query sent
timed out
query sent
response received

yeah, but what happens if you will set the bridge's timeout to some 10s? It is shorter now, so if my assumption was right your test client will have less timeouts with a longer timeout.

Do you mean to change the timout in my client I posted above like so?

- stream.ReadTimeout = 1000;
+ stream.ReadTimeout = 10000;

This doesn't resolve the error.

I've build a minimal reproducable and it looks like the ESPAsyncWebServer is interfering with the modbus traffic. I'm not even starting the webserver, but only setting up a handler which does nothing useful. I also haven't sent a single HTTP request:

#include <Arduino.h>
#include <WiFiManager.h>
#include <ESPAsyncWebServer.h>
#include <ModbusServerWifi.h>

#define debugSerial Serial
#define dbgln(x...) debugSerial.println(x);

ModbusMessage worker(ModbusMessage msg);
void setupPages(AsyncWebServer* server);

AsyncWebServer webServer(80);
ModbusServerWiFi MBserver;
WiFiManager wm;

void setup() {
  debugSerial.begin(115200);
  dbgln("[wifi] start");
  WiFi.mode(WIFI_STA);
  wm.setClass("invert");
  wm.autoConnect();
  dbgln("[wifi] finished");
  dbgln("[modbus] start");
  MBserver.registerWorker(1, READ_INPUT_REGISTER, worker);
  MBserver.start(502, 10, 30000);
  dbgln("[modbus] finished");
  setupPages(&webServer); //<- if I remove this line it works
  //webServer.begin(); //I don't even start the webServer
  dbgln("[setup] finished");
}

void loop() {
  // put your main code here, to run repeatedly:
}

ModbusMessage worker(ModbusMessage msg){
  uint16_t addr = 0;
  uint16_t words = 0;
  msg.get(2, addr);
  dbgln(addr);
  msg.get(4, words);
  ModbusMessage response;
  response.add(msg.getServerID(), msg.getFunctionCode(), (uint8_t)(words * 2));
  for (size_t i = 0; i < words; i++)
  {
    response.add((uint16_t)i);
  }
  return response;
}

void setupPages(AsyncWebServer *server){
  server->on("/", HTTP_GET, [](AsyncWebServerRequest *request){
    dbgln("[webserver] GET /");
    auto *response = request->beginResponseStream("text/html");
    request->send(response);
  });
}

Log

...
[D] 16033| ModbusServer.cpp     [  30] getWorker: Worker found for 01/04
579
[D] 16035| ModbusServerTCPtemp.h [ 314] worker: Data response
[D] 16045| ModbusServer.cpp     [  30] getWorker: Worker found for 01/04
580
[D] 16047| ModbusServerTCPtemp.h [ 314] worker: Data response
[D] 26414| ModbusServerTCPtemp.h [ 358] worker: Worker stopping due to client disconnect.

I've modified my client so that the queried register and transaction id are the same:

grafik

The esp logs the queried register and the following packet capture shows the transaction id. You can clearly see that ESP has processed up to id 580 but the packet capture also shows the query for id 581 which was ACKed by the ESP.

grafik

Excellent! This indeed narrows it down to some async interferences.

@bertmelis Would you please have a look as it is the async stuff again? ESP32 this time, I am afraid, so it most probably is not caused by the missing multi tasking in the ESP8266.

@Miq1 I've found the root cause and opened a PR to fix it.

@zivillian Thanks, that sounds reasonable as a cause. I am merging the PR ATM. There may be side effects, but we will see.