CommunitySolidServer / CommunitySolidServer

An open and modular implementation of the Solid specifications

Home Page:https://communitysolidserver.github.io/CommunitySolidServer/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Lock expired error when multiple clients fetch a particular resource from the Community Solid Server.

argahsuknesib opened this issue · comments

Environment

  • Server version: 6.0.1
  • Node.js version: 16.20.2
  • npm version: 8.19.4

Description

I have a solid server located at http://n061-14a.wall2.ilabt.iminds.be:3000/ with 24 workers. (please use a IDLab-IGent VPN)

Upon fetching the same resource by multiple clients (i.e multiple GET requests), the server throws an error:

2024-02-08T15:15:40.090Z [BasicResponseWriter] {W-191651} error: Aborting streaming response because of server error; headers already sent.
2024-02-08T15:15:40.090Z [BasicResponseWriter] {W-191651} error: Response error: Lock expired after 6000ms on http://n061-14a.wall2.ilabt.iminds.be:3000/participant6/skt/1706796226630/e1e1828b-017a-4484-8c9b-a0cae6d93af2
2024-02-08T15:15:40.104Z [WrappedExpiringReadWriteLocker] {W-191658} error: Lock expired after 6000ms on http://n061-14a.wall2.ilabt.iminds.be:3000/participant6/skt/1706796226630/e1e1828b-017a-4484-8c9b-a0cae6d93af2
2024-02-08T15:15:40.106Z [HandlerServerConfigurator] {W-191658} error: Request error: aborted
2024-02-08T15:15:40.106Z [StreamUtil] {W-???} warn: Piped stream errored with Lock expired after 6000ms on http://n061-14a.wall2.ilabt.iminds.be:3000/participant6/skt/1706796226630/e1e1828b-017a-4484-8c9b-a0cae6d93af2

This behaviour can be reproduced by

for (let i = 0; i < 300; i++) {
    fetch('http://n061-14a.wall2.ilabt.iminds.be:3000/participant6/skt/1706796226630/e1e1828b-017a-4484-8c9b-a0cae6d93af2')
}

As the server implements a multiple-read/single write lock, it is unexpected as the amount of GET requests is lower than the 789 (with 24 workers) as demonstrated in the graph 3 of the test here.

Moreover,
Using the LDES reader to read between a window as specified in the code below does 4 GET requests to the CSS.

Please install the versionawareldesinldp package before executing the code using,

npm i @treecg/versionawareldesinldp
import { LDESinLDP, LDPCommunication} from "@treecg/versionawareldesinldp";

async function main() {
    let ldes_location = "http://n061-14a.wall2.ilabt.iminds.be:3000/participant6/skt/";
    let ldes = new LDESinLDP(ldes_location, new LDPCommunication());
    let to_date_skt = new Date("2024-02-01T17:54:03.024Z");
    let from_date_skt = new Date("2024-02-01T17:49:03.012Z");
    let readable_stream = await ldes.readMembersSorted({
        from: from_date_skt,
        until: to_date_skt,
        chronological: true
    });
    readable_stream.on('data', async (data) => {
        console.log(data);
    })
}
main();

However, if I simulate 25 clients (i.e 25*4 = 100 GET requests) with the following code

import { LDESinLDP, LDPCommunication} from "@treecg/versionawareldesinldp";

async function main() {
    let ldes_location = "http://n061-14a.wall2.ilabt.iminds.be:3000/participant6/skt/";
    let ldes = new LDESinLDP(ldes_location, new LDPCommunication());
    let to_date_skt = new Date("2024-02-01T17:54:03.024Z");
    let from_date_skt = new Date("2024-02-01T17:49:03.012Z");
    let readable_stream = await ldes.readMembersSorted({
        from: from_date_skt,
        until: to_date_skt,
        chronological: true
    });
    readable_stream.on('data', async (data) => {
        console.log(data);
    })
}

for (let i = 0; i < 25; i++) {
    main();
}

I get the following error on the server side,

2024-02-08T15:41:54.838Z [BasicResponseWriter] {W-207433} error: Response error: Lock expired after 6000ms on http://n061-14a.wall2.ilabt.iminds.be:3000/participant6/skt/1706808237692/753550fe-85e8-45fe-9ce0-ae2b13ee2582
2024-02-08T15:41:55.035Z [WrappedExpiringReadWriteLocker] {W-207419} error: Lock expired after 6000ms on http://n061-14a.wall2.ilabt.iminds.be:3000/participant6/skt/1706808237692/753550fe-85e8-45fe-9ce0-ae2b13ee2582
2024-02-08T15:41:55.037Z [HandlerServerConfigurator] {W-207419} error: Request error: aborted
2024-02-08T15:41:55.037Z [StreamUtil] {W-???} warn: Piped stream errored with Lock expired after 6000ms on http://n061-14a.wall2.ilabt.iminds.be:3000/participant6/skt/1706808237692/753550fe-85e8-45fe-9ce0-ae2b13ee2582
2024-02-08T15:41:55.037Z [BasicResponseWriter] {W-207419} error: Aborting streaming response because of server error; headers already sent.
2024-02-08T15:41:55.037Z [BasicResponseWriter] {W-207419} error: Response error: Lock expired after 6000ms on http://n061-14a.wall2.ilabt.iminds.be:3000/participant6/skt/1706808237692/753550fe-85e8-45fe-9ce0-ae2b13ee2582

and on the client side,

TypeError: terminated
    at Fetch.onAborted (node:internal/deps/undici/undici:11323:53)
    at Fetch.emit (node:events:513:28)
    at Fetch.emit (node:domain:489:12)
    at Fetch.terminate (node:internal/deps/undici/undici:10578:14)
    at Object.onError (node:internal/deps/undici/undici:11418:36)
    at Request.onError (/home/kush/Code/RSP/solid-stream-aggregator-evaluation/node_modules/undici/lib/core/request.js:314:27)
    at errorRequest (/home/kush/Code/RSP/solid-stream-aggregator-evaluation/node_modules/undici/lib/client.js:2280:13)
    at Socket.onSocketClose (/home/kush/Code/RSP/solid-stream-aggregator-evaluation/node_modules/undici/lib/client.js:1163:5)
    at Socket.emit (node:events:513:28)
    at Socket.emit (node:domain:489:12) {
  [cause]: SocketError: other side closed
      at Socket.onSocketEnd (/home/kush/Code/RSP/solid-stream-aggregator-evaluation/node_modules/undici/lib/client.js:1129:22)
      at Socket.emit (node:events:525:35)
      at Socket.emit (node:domain:489:12)
      at endReadableNT (node:internal/streams/readable:1359:12)
      at processTicksAndRejections (node:internal/process/task_queues:82:21) {
    code: 'UND_ERR_SOCKET',
    socket: {
      localAddress: '192.168.124.202',
      localPort: 36220,
      remoteAddress: '10.2.32.126',
      remotePort: 3000,
      remoteFamily: 'IPv4',
      timeout: undefined,
      bytesWritten: 498,
      bytesRead: 314671
    }

The server works unexpectedly when responding to the amount of GET requests.

Some preliminary results. I did some small tests with sending a lot of requests to a server on my own machine with a single document on a single worker thread. Sending all requests at the same time as the for loop in the issue above and awaiting all their results. 2000 requests were fine, but when I did a loop of 3000 they no longer got a response, even after waiting much longer than the lock expiration time. Except for the very first request, which immediately returned a 401 (instead of a 200). In this case, the server also only looks up the ACL for the first requests, for all other 2999 there is no log entry of trying to access the ACL. After stopping the client that is sending all the requests, and starting a new one with only 1 there is still no result, so it seems once the server gets stuck it stays stuck, or perhaps it takes longer to get rid of the original connections than I waited. Question then is where it gets stuck and why.

I tried the same for loop but with the default.json config, which uses an in-memory locker and backend. There I still had no issues after sending 10,000 requests. 100,000 did start throwing errors but that is still different than the file-based situation where there was just no response. And there the logs showed that the ACL was still being accessed at least.

Thanks for the reply @joachimvh , indeed it performs better when awaiting the results. However, using an await isn't applicable in a real-world scenario when multiple clients are requesting at the same time without any communication among themselves.Client 57 doesn't know that it has to wait till a previous client 12's GET process's promise is resolved.

indeed it performs better when awaiting the results

I meant that I created 300 promises, each doing a fetch, and put them all in a Promise.all. So they were all executed in parallel.

2000 requests were fine, but when I did a loop of 3000 they no longer got a response

do you mean 200 and 300 here or 2k and 3k? because I am getting lock error with 300 clients.
Although, the status code is 200, but when logging the response.text() I still get the socket closed error on the server side and can't log the content of the particular resource.

do you mean 200 and 300 here or 2k and 3k

2k and 3k. But the same machine was client and server in this case which helps with the results. The core point is that there is a number of requests at which the server seems to become unresponsive.

Indeed, do you think of some ways with which we can improve the responsivity of server or something in the architecture that is currently an obstacle?

To find the cause more investigation would be necessary. It could have something to do with how the file system is used, but no way to really tell with this what exactly is causing it.

Adding caching of resources in memory could probably help if most requests are GETs, as could be seen from the better results of the memory backend. But then you would also need some way to invalidate cache of other worker threads when you're using more than one.

GETs are only a part of the experiment when sensor data is read from the pod, but for new data to arrive and to be written to the solid pod. I would expect more PATCH requests to be done.

I spent some time running evaluations with different server configurations trying to see if I could find the cause. Lots of text incoming before I reach my conclusion.

I only tested a for loop with multiple connections. I did not look into all the ldes stuff. This was the test code:

const total = 10000;
const mod = Math.floor(total / 100);

(async function() {
  await fetch('http://localhost:3000/foo', {
    method: 'PUT',
    headers: {
      'content-type': 'text/plain',
    },
    body: 'hello',
  });
  console.log('starting runs');
  const promises = [];
  for (let i = 0; i < total; ++i) {
    promises.push(doCall(i));
  }
  console.log(await Promise.all(promises));
})();

async function doCall(i) {
  const res = await fetch('http://localhost:3000/foo');
  if (i % mod === 0) {
    console.log(i, res.status);
  }
  return res.status;
}

This is a table with the max requests I could do on my machine before running into issues. All of these started from the config/file-root.json config before making changes. I also ran tests where I removed the entire WrappedExpiringReadWriteLocker from the lock setup but that seemed to have no impact. Mostly posting the results here for posterity.

locker backend RW locker max requests
file file PartialReadWriteLocker 1,000*
void file / 15,000+
file memory PartialReadWriteLocker 1,000*
file memory EqualReadWriteLocker 2,000**
memory file GreedyReadWriteLocker 1,000*
memory memory GreedyReadWriteLocker 5,000***
memory file EqualReadWriteLocker 1,000*
  • *: Trying more requests resulted in a 401. The server seemed to get stuck after getting the ACL of the first request.
  • **: More requests caused the locks to time out as this locking method does not allow simultaneous reads.
  • ***: Seemed to get ECONNRESET errors after more. Not sure why. The default.json config gives the same result.

The results seemed to indicate that the, or at least one, problem is quite probably related to the locking system. After some more digging, it seemed that the issue was mostly caused by the lock created on the resource that keeps track of the amount of open read requests on a resource as done here:

const read = this.getCountLockIdentifier(identifier);
await this.countLocker.acquire(read);
try {
return await whileLocked();
} finally {
await this.countLocker.release(read);
}

For both the PartialReadWriteLocker and the GreedyReadWriteLocker from the table above, that locker is a MemoryResourceLocker, which makes use of the async-lock library. Instead of using the MemoryResourceLocker I tried replacing it with a simple locker:

class SimpleLocker implements ResourceLocker {
  protected locked: Record<string, (() => void)[] | undefined> = {};

  public async acquire(identifier: ResourceIdentifier): Promise<void> {
    const promises = this.locked[identifier.path];
    if (!promises) {
      this.locked[identifier.path] = [];
      return;
    }
    let resolve: () => void;
    const prom = new Promise<void>((res): void => {
      resolve = res;
    });
    promises.push(resolve!);
    await prom;
  }

  public async release(identifier: ResourceIdentifier): Promise<void> {
    // Unlock the next promise if there is one
    const promises = this.locked[identifier.path];
    if (!promises) {
      throw new InternalServerError(`Trying to unlock resource that is not locked: ${identifier.path}`);
    }
    if (promises.length === 0) {
      delete this.locked[identifier.path];
      return;
    }
    promises.splice(0, 1)[0]();
  }
}

Using this locker with the default file-root.json config allowed for 10,000 simultaneous requests, seemingly fixing the issue. The only reason it's not more is because then I hit the lock expiration time.

I'm not exactly sure how robust and correct that locking code is, so I also looked into a different locking library. I tried out the async-mutex library, with which I created a similar locker. That one got up to 2,000 requests before hitting the expiration limit, but increasing the expiration allowed for more requests, unlike the current locker using async-lock.

So probably going to look into replacing the MemoryResourceLocker with a different implementation.

Note that all of this is only relevant when using the file or memory locker. None of this is relevant when using the Redis locker. So if you also have problems when using that one it would not be solved by this.

I also ran some tests using the redis locker just now. While it gave 401s for some requests due to the locker timing out, if you put the expiration high enough that these don't occur, the behaviour is similar to not having a locker at all. So while we should probably replace our memory locker, the redis locker can already support situations with more requests. So if you also have issues using that locker, and increasing the expiration, there is a different problem that can't be reproduced by just running a bunch of simultaneous requests.