apache / plc4x

PLC4X The Industrial IoT adapter

Home Page:https://plc4x.apache.org/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[Bug]: plc4j-tools-connection-cache: broken connections remaing in the cache on timeout

QuanticPony opened this issue · comments

What happened?

Summary

When a connection stored in the connection-cache breaks due to a network failure, the connection is not removed from the cache and blocks future uses of the same connection string.

Context

Encountered while trying to solve a similar problem as #623 in the NiFi integration:
When a processor is running and the network connection to the PLC is interrupted, the processors continues to throw errors even if the network connection is restored.

This was brought up in a mail by me (https://lists.apache.org/thread/xm38nh8xzh1m1kj0y74dx0goo81cos82) that sparked a pull request by heyoulin (#818), an issue by splatch (#821) and a commit from @chrisdutz (9b06c2d).

The commit (9b06c2d) did not fully addressed the problem, so I bring my attempt to fix it.

Replicate the problem

In order to replicate the problem use the code at the end and follow the steps:

  1. Start the main below
  2. Disconnect network
  3. Wait until errors are shown in the stdout
  4. You will see the connection is been used after it fails:
16:38:22.486 [main] DEBUG o.a.p.j.u.c.CachedPlcConnectionManager.getConnection:72 - Reusing exising connection
Failed to read due to: 
java.util.concurrent.TimeoutExceptio
  1. Reconnect network. The problem persists.

Possible Solution

The LeasedConnection returns a Future that encapsulates the Future that connects to the PLC. The second one is the one that can mark the connection as invalid for removal. For the moment I have been able to work around this by overriding the get method of the first Future:

@Override
public PlcReadResponse get(long timeout, TimeUnit unit)
        throws InterruptedException, ExecutionException, TimeoutException {
    try {
        return super.get(timeout, unit);
    } catch (TimeoutException e) {
        future.completeExceptionally(e);
        throw e;
    }
}

You can see my solution in the zylklab fork (https://github.com/zylklab/plc4x/tree/Fix/nifi-integration-timeout). If you could give me some feedback I would like to make this into a PR as soon as posible.


public class ManualTest {

    public static void main(String[] args) throws InterruptedException {
        CachedPlcConnectionManager cachedPlcConnectionManager = CachedPlcConnectionManager.getBuilder(new DefaultPlcDriverManager()).withMaxLeaseTime(Duration.ofMinutes(5)).build();
        for (int i = 0; i < 100; i++){
            Thread.sleep(1000);
            try (PlcConnection connection = cachedPlcConnectionManager.getConnection("s7://10.105.143.7:102?remote-rack=0&remote-slot=1&controller-type=S7_1200")) {
                PlcReadRequest.Builder plcReadRequestBuilder = connection.readRequestBuilder();
                plcReadRequestBuilder.addTagAddress("foo", "%DB1:DBX0.0:BOOL");
                PlcReadRequest plcReadRequest = plcReadRequestBuilder.build();
                
                PlcReadResponse plcReadResponse =  plcReadRequest.execute().get(1000, TimeUnit.MILLISECONDS);
                System.out.printf("Run %d: Value: %f%n", i, plcReadResponse.getFloat("foo"));
            } catch (Exception e) {
                System.out.println("Failed to read due to: ");
                e.printStackTrace();
            }
        }
    }
}

Version

v0.11.0-SNAPSHOT

Programming Languages

  • plc4j
  • plc4go
  • plc4c
  • plc4net

Protocols

  • AB-Ethernet
  • ADS /AMS
  • BACnet/IP
  • CANopen
  • DeltaV
  • DF1
  • EtherNet/IP
  • Firmata
  • KNXnet/IP
  • Modbus
  • OPC-UA
  • S7

Can you try my repository? If work. I can push the plcconnection bug

@chrisdutz can you look into this?

Sorry for the late response. I was able to reproduce the problem.

Interrestingly it did recover after quite some time ... will look into this.

Run 0: Value: 1,000000
Run 1: Value: 1,000000
Run 2: Value: 1,000000
Run 3: Value: 1,000000
Run 4: Value: 1,000000
Failed to read due to:
java.util.concurrent.TimeoutException
at java.base/java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1960)
at java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2095)
at org.apache.plc4x.java.utils.cache.ManualTest.main(ManualTest.java:39)
Failed to read due to:
java.util.concurrent.TimeoutException
at java.base/java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1960)
at java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2095)
at org.apache.plc4x.java.utils.cache.ManualTest.main(ManualTest.java:39)
Failed to read due to:
java.util.concurrent.TimeoutException
at java.base/java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1960)
at java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2095)
at org.apache.plc4x.java.utils.cache.ManualTest.main(ManualTest.java:39)
Failed to read due to:
java.util.concurrent.TimeoutException
at java.base/java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1960)
at java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2095)
at org.apache.plc4x.java.utils.cache.ManualTest.main(ManualTest.java:39)
Failed to read due to:
java.util.concurrent.TimeoutException
at java.base/java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1960)
at java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2095)
at org.apache.plc4x.java.utils.cache.ManualTest.main(ManualTest.java:39)
Failed to read due to:
java.util.concurrent.TimeoutException
at java.base/java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1960)
at java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2095)
at org.apache.plc4x.java.utils.cache.ManualTest.main(ManualTest.java:39)
Failed to read due to:
java.util.concurrent.TimeoutException
at java.base/java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1960)
at java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2095)
at org.apache.plc4x.java.utils.cache.ManualTest.main(ManualTest.java:39)
Run 12: Value: 1,000000
Run 13: Value: 1,000000
Run 14: Value: 1,000000
Run 15: Value: 1,000000
Run 16: Value: 1,000000
Run 17: Value: 1,000000
Run 18: Value: 1,000000
Run 19: Value: 1,000000

So I guess the problem is, that we have a connection timeout that is longer than your 1 second timeout ... so it looks as if the connection fails and the next connect runs into the void ... we then wait for this to timeout and then create a new attempt ... if the connection is available again, it connects then and the application recovers or if it's not available, we wait for the next connection timeout and then try again.

Ok ... keeping the connection separated longer than the connection timeout resulted in a different set of errors. Unfortunately I didn't quite understand your solution or couldn't find it in the branch.

It is not connection-cache bug. It is base Plc4xNettyWrapper bug exsit long long time. I fixed in #818. But you did not accept my fix. You can check #801. It is partly resove this hug bug

@spnettec I tried your branch a long time ago and seemed to work. I only used your changes to the connection-cache in that testing.

@chrisdutz When I created this issue I didn't get a single re-connection. I have pulled from develop and I also get re-connections. Will try the nifi integration that was not working. Will let you know if it works properly!

Ok ... keeping the connection separated longer than the connection timeout resulted in a different set of errors. Unfortunately I didn't quite understand your solution or couldn't find it in the branch.

If you are looking for my changes you can view them here.

@spnettec I think I didn't accept your PR as I think it addressed multiple things and changed the API of PLC4X in an undesirable way. That was why I tried implementing an alternate path.

@QuanticPony thanks for the pointer ... that link helps a lot. Will have a look at it after work.

Ok ... so if I have a look at the changes it seems your branch is quite diverged from develop ... I'd like to help work on fixing any issues you might be having here.

Hi again. I had a look into the nifi integration after updating the branch:
Found the S7 driver reconnecting, but the opcua didn't.
Tried the manual tests that posted in this issue with both drivers. Same result.
Additionally found that the s7 driver always reconnected after 15 disconnect messages.
From this I take that it depends on how the driver interprets when there is a disconnection.

Not sure how should we handle the reconnection from the nifi integration part. Right now using it to connect to an opcua plc means after a disconnection a user must restart manually the processors or restart nifi.

Tried again with the changes I made. It reconnects in both cases

I think my main problem is, how can I simulate the situation in a unit-test to prove the cache works correctly.
Could you please describe the situation exactly and what you're seeing the system do? I'd try to replicate this, however I couldn't replicate the driver not coming back to life.

@chrisdutz I have made a unit test for the behavior that I think is correct. You can check it here

The problem that I see is that any exception while (for example) reading from the plc triggers the LeasedPlcConnection to be invalidated. But a timeout due to a broken connection does not.

Thanks for that ... of course do I first have to finish my changes on the subscription API first, or nothing will compile ;-)

Ok ... so the first test passes in my setup ... however the second one I'm not really sure how it should fail ... the timeout is on the request execution ... so the client says "give me that in 50 ms" this timeout is only in the CompletableFuture and the driver has no way to know how long the client is willing to wait.

I guess I should think of a way, that the operation times out internally (without waiting 10 seconds or so)

Or how does the driver know about your completable future timeout?

The connection cache gives a leased connection with the real connection inside. If an exception occurs in the real connection the leased connection is invalidated. This is the first test.

I think is a problem that if the client gets a timeout in the leased connection the CompletableFuture of the real connection, the one that invalidates the leased connection, does not. This is the case that breaks the cache in my case, as that connection is no longer usable. I think a timeout in the leased connection should propagate into the real connection. Making the real connection invalid. Else the client has no way of removing the real connection from the connection cache.
This one is the second test.

I see 2 solutions. Either propagate the timeout or allow the client to manually invalidate a connection from the cache

Well I did locally change the LeasedConnection to simply react on Exception instead of PlcRuntimeException ... this should now catch timeout errors too ... but I'm looking, if we shouldn't be catcing the timeouts and converting them to Plc-exceptions.

Think I found it ... so the NettyHashTimerTimeoutManager.java in line 54 creates a Timeout exception ... this is not a PlcException or PlcRuntimeException, so it falls outside all catch blocks ... I think changing this to a PlcTimeoutException that extends PlcRuntimeException should also fix many of these issues.

Could you folks prease try this again and give me feedback, if this issue is now fixed?

So ... is this fixed or was the thumb up just an acknowledgement that you'll be doing that?

Acknowledgement that I will test this next week

@chrisdutz I have tested it in the NiFi integration with a S7 and an OPC UA. Same as last time:
The S7 driver detected the invalid connection and successfully reconnected. The OPC UA driver did not.
I am pretty sure this is a driver related problem right now.
Nevertheless I would suggest adding a way of removing LeasedPlcConnections from the cache manually.
I will close this issue. Thank you for your help!

I just added two methods that allow manual removal of connections from the cache ... not tested at all ... feel free to try it out. There are now two methods:

  • getCachedConnections which returns a set of connection urls
  • removeCachedConnection(String) removes a conection with the given url from the cache

@chrisdutz Implemented already in our fork. Working pretty well. Will be posting a PR soon with the changes needed for the NiFi integration to work properly again

well ... we're planning on cutting the RC for 0.10.0 on monday ...