Error using chol Matrix must be positive definite on liblsl 1.14 + LabRecorderCLI 1.13.1

Question

Error using chol Matrix must be positive definite on liblsl 1.14 + LabRecorderCLI 1.13.1

garygan89 opened this issue 4 years ago · comments

Environment Info

debian@sr-imx8:~/liblsl-build/liblsl/build/install/bin$ ./lslver
LSL version: 114
git:v1.14.0b4-2-g1eaaf08c/branch:master/build:/compiler:GNU-10.2.0/link:shared

Tried both LabRecorderCLI tagged v1.13.1 and the latest from [master] branch, but still the same problem.

Issue

When using LSL v1.14 with LabRecorder (master branch from https://github.com/labstreaminglayer/App-LabRecorder), the following error occurred in 1 out of 5 XDF recordings (or none if I'm lucky). It is really hard to reproduce, and Google search points me to #15, which mentioned a clock offset bug introduced in liblsl1.13. Is it still unsolved and somehow creep to 1.14?

I'm running the consumer using the sample code, SendDataC. I also make sure to press 'Enter' to correctly closed the file, as follows:

debian@sr-imx8:~$ LabRecorderCLI SendDataC4.xdf 'type="EEG"'
Found SendDataC@sr-imx8 matching 'type="EEG"'
Starting the recording, press Enter to quit
2020-10-16 00:09:39.348 (   1.003s) [        AE0941C0]             common.cpp:50    INFO| git:v1.14.0b4-2-g1eaaf08c/branch:master/build:/compiler:GNU-10.2.0/link:shared
Opened the stream SendDataC.
Received header for stream SendDataC.
Started data collection for stream SendDataC.

Offsets thread is finished
Wrote footer for stream SendDataC.
Closing the file.

Not sure if this issue is related to liblsl or LabRecorderCLI.

Chadwick Boulay · Answer 1 · Fri Oct 16 2020 09:30:09 GMT+0800 (China Standard Time)

What's the longest duration xdf file you've had with this problem? Can you attach a problematic one that is at least 1 minute long to this issue?
And you're using the Matlab loader to import it?

If I recall correctly, the clock offsets are retrieved via UDP whereas the data come via TCP. If you have any reason to think that your network configuration might be dropping a large number of UDP packets then this could be the source of your problem.

David Medine · Answer 2 · Fri Oct 16 2020 11:13:06 GMT+0800 (China Standard Time)

I've been encountering this bug lately too. But, I have only gotten it when sending data from a Linux computer and recording on a Windows PC. However, since this has been a pseudo-random problem, this may just be a coincidence. On the other hand, the clock offsets between Windows and non-Windows PCs is getting bigger and bigger all the time. At some point there will be numerical problems when trying to invert the matrix when load_xdf.m does the clock synchronization. I can report that in these cases, I have not seen missing clock offset measurements in the xdf footer. If memory serves, this was the case when we had the UDP packet bug a couple of years ago.

…

On 16/10/2020 12:30, Chadwick Boulay wrote: What's the longest duration xdf file you've had with this problem? Can you attach a problematic one that is at least 1 minute long to this issue? And you're using the Matlab loader to import it? If I recall correctly, the clock offsets are retrieved via UDP whereas the data come via TCP. If you have any reason to think that your network configuration might be dropping a large number of UDP packets then this could be the source of your problem. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#44 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA3SD3T7CCDONLFAPFABCD3SK6O25ANCNFSM4SSV5W4A>.

David Medine · Answer 3 · Fri Oct 16 2020 12:57:25 GMT+0800 (China Standard Time)

@garygan89, can you please confirm what Chad asked and also give details of your setup? Specifically, I am interested in what OS you are using on the outlet side and which OS you are using to host LabRecorder.

I can also say that my recent bouts with this issue have only occurred when using liblsl (or is it just LSL now?) >=14. When I have a chance I will downgrade to 13 and see if the problem persists.

garygan89 · Answer 4 · Sat Oct 17 2020 02:04:00 GMT+0800 (China Standard Time)

What's the longest duration xdf file you've had with this problem? Can you attach a problematic one that is at least 1 minute long to this issue?

It seems like the problem seems to be unrelated to the file size / duration, since I could have the error in 5MB or 20MB file. I will upload the problemetic file when I'm at the lab later.

And you're using the Matlab loader to import it?

Yes I loaded using load_xdf function from EEGLAB (SCCN repo), both command line and the EEGLAB GUI.

If I recall correctly, the clock offsets are retrieved via UDP whereas the data come via TCP. If you have any reason to think that your network configuration might be dropping a large number of UDP packets then this could be the source of your problem.

I ran both LSL consumer and producer (SendDataC) in a closed loop system, in particular the Freescale IMX8 SOM (ARM64/aarch64 architecture) that we mount in our custom board, with no IP assign to eth0 because our custom board does not have a real ethernet port. I suspect it was the missing IP at first, but it happened to my reference board with eth0 IP assigned.

garygan89 · Answer 5 · Sat Oct 17 2020 02:06:13 GMT+0800 (China Standard Time)

@garygan89, can you please confirm what Chad asked and also give details of your setup? Specifically, I am interested in what OS you are using on the outlet side and which OS you are using to host LabRecorder.

I can also say that my recent bouts with this issue have only occurred when using liblsl (or is it just LSL now?) >=14. When I have a chance I will downgrade to 13 and see if the problem persists.

Yes I just followed up on that with Chad. I'm running it on a Freescale IMX8 SOM on our custom board, OS is Debian 10 (bullseye). Both consumer (LabRecorderCLI) v1.13.1, libLSL v1.14 and producer (SendDataC from the liblsl example) are running on the same host so that we could form a closed loop system. I didn't assign IP to the eth0 interface.

This issue seems to start happening on liblsl v1.13 as reported, but I'm not sure whether it somehow creep to v1.14.

Chadwick Boulay · Answer 6 · Sat Oct 17 2020 03:58:42 GMT+0800 (China Standard Time)

I've never tried that kind of network setup. I'm happy you're using LSL and we'll try to fix this problem as best we can, but if you're running everything in a closed system on a custom platform, why not use shared memory? Shared memory will definitely have lower latency and be more efficient than LSL. LSL wins on flexibility, network synchronization across computers, and compatibility with many devices, but it sounds like you aren't using any of those features. Maybe you plan to?

As you're debugging this, please use https://github.com/xdf-modules/xdf-Matlab instead of the loader that comes with EEGLAB. Ultimately they should be the same thing, but if we provide a fix then it'll appear in xdf-Matlab before EEGLAB.

Also note in the load_xdf function there are many command line options like load_xdf(..., 'HandleClockSynchronization', false);
That'll stop it from trying to do clock synchronization, and clock synchronization is unnecessary when everything is on the same system.

I hope to get Matlab again in a couple weeks. Until then I'll use pyxdf. Please attach the file when you can so I can try loading it in pyxdf to see if it loads and if it doesn't where the error is coming from, then maybe work backwards to find the source.

garygan89 · Answer 7 · Sat Oct 17 2020 12:18:58 GMT+0800 (China Standard Time)

Thanks Chad. The primary motivation to use LSL in our closed loop setting is really the how LSL is able to synchronize multiple stream (we have EEG and visual stimuli presentation and marker all running in the same board). And I reckon using LSL is the fastest way for me to pipe them together.

Here are the list of XDF I uploaded to MF. http://www.mediafire.com/folder/osblwmlc4u9at/LSL_XDF

The one that gave the error is in the "Problematic folder". The consumer is the SendC code from liblsl examples.

The -noeth in the filename just identifies that the LabRecorderCLI is run in the system without any IP assigned to eth0 interface. eth0 is the only network interface in the system.

I will further try to load them using xdf-Matlab after the weekend and see if that improves.

David Medine · Answer 8 · Mon Oct 19 2020 08:47:42 GMT+0800 (China Standard Time)

I believe this problem results from a numerical issue. When calculating the mapping from outlet to LabRecorder inlet, load_xdf.m must perform a Cholesky decomposition of the matrix that is a combination of timestamps on the outlet PC.

What appears to be happening is that when the timestamps are very, very high---which they are when timestamps are the number of seconds since January 1, 1970---the combination A'A (where the first column of A is 1/.0001 and the second column is the timestamps/.0001) results by definition in a square, symmetric 2x2 diagonal matrix. This is theoretically guaranteed to be positive definite, and therefor theoretically it can always be decomposed into LL* by Cholesky. But, what appears to be happening is that sometimes A'A has an eigenvalue that is a very, very, very small negative! number. This appears to happen randomly as a result of the limits of numerical precision when performing eigen value and Cholesky decomposition. I suspect that this eigenvalue test, or something similar, is what Matlab's chol function does to test for positive definite-ness, and this is why it is reporting an error.

For example, when examining eig(A'*A) when breaking into https://github.com/xdf-modules/xdf-Matlab/blob/master/load_xdf.m#L459 and step into the robust fit function (https://github.com/xdf-modules/xdf-Matlab/blob/master/load_xdf.m#L747-L783) on the sample @garygan89 provided called 'SendData-C-LabRecorderv1.13.1-noeth0-run1.xdf' I get the following:

K>> eig(A'*A)

ans =

   1.0e+27 *

   0.000000000000000
   2.312098056682916

And here chol works. However, if I do the same for the 'problematic' data in 'SendDataC-LabRecorderv1.13.1-noeth0-run3.xdf' I get this:

K>> eig(A'*A)

ans =

   1.0e+27 *

  -0.000000000000000
   3.082798078938963

Note the negative sign in the first value. In both cases the first eigenvalue should be 0, or very close to it on the positive side, but due to precision, it sometimes ends up on the negative and this (I am guessing) is what stops chol in its tracks.

The workaround seems to be to increase the WinsorThreshold value. I confess that I have never fully understood how this works or how this parameter truly has a Winsorizing effect on the ADMM algorithm, but when I set it to 1 (as I mentioned above, the default is .0001), the matrix A is smaller by a factor of 10e4 and the numerical problem disappears:

K>> eig(A'*A)

ans =

   1.0e+19 *

   0.000000000000000
   3.082798078938964

by calling load_xdf.m with this option (s = load_xdf('SendDataC-LabRecorderv1.13.1-noeth0-run3.xdf', 'WinsorThreshold', 1.0);), I can successfully load the data set.

Again, I am not sure how this affects the precision of the clock offset mapping, but this will allow you to load these problematic sets. I am also unsure what to do to fix this. If we are at the point where the time since the Epoch is so great that this is going to happen, then this whole mechanism needs to be fixed. After all, this workaround will stop working in about 100 million seconds ;-).

I am also unsure where to re-open this issue. Is it a problem with xdf-Matlab or liblsl? It is definitely not, however, a problem with LabRecorder, and that is a good thing.

David Medine · Answer 9 · Mon Oct 19 2020 08:49:50 GMT+0800 (China Standard Time)

Also, 100 million seconds is only 3 years, so the clock is literally ticking!

garygan89 · Answer 10 · Tue Oct 20 2020 06:17:20 GMT+0800 (China Standard Time)

I believe this problem results from a numerical issue. When calculating the mapping from outlet to LabRecorder inlet, load_xdf.m must perform a Cholesky decomposition of the matrix that is a combination of timestamps on the outlet PC.

What appears to be happening is that when the timestamps are very, very high---which they are when timestamps are the number of seconds since January 1, 1970---the combination A'A (where the first column of A is 1/.0001 and the second column is the timestamps/.0001) results by definition in a square, symmetric 2x2 diagonal matrix. This is theoretically guaranteed to be positive definite, and therefor theoretically it can always be decomposed into LL* by Cholesky. But, what appears to be happening is that sometimes A'A has an eigenvalue that is a very, very, very small negative! number. This appears to happen randomly as a result of the limits of numerical precision when performing eigen value and Cholesky decomposition. I suspect that this eigenvalue test, or something similar, is what Matlab's chol function does to test for positive definite-ness, and this is why it is reporting an error.

For example, when examining eig(A'*A) when breaking into https://github.com/xdf-modules/xdf-Matlab/blob/master/load_xdf.m#L459 and step into the robust fit function (https://github.com/xdf-modules/xdf-Matlab/blob/master/load_xdf.m#L747-L783) on the sample @garygan89 provided called 'SendData-C-LabRecorderv1.13.1-noeth0-run1.xdf' I get the following:
K>> eig(A'*A)

ans =

   1.0e+27 *

   0.000000000000000
   2.312098056682916
And here chol works. However, if I do the same for the 'problematic' data in 'SendDataC-LabRecorderv1.13.1-noeth0-run3.xdf' I get this:
K>> eig(A'*A)

ans =

   1.0e+27 *

  -0.000000000000000
   3.082798078938963
Note the negative sign in the first value. In both cases the first eigenvalue should be 0, or very close to it on the positive side, but due to precision, it sometimes ends up on the negative and this (I am guessing) is what stops chol in its tracks.

The workaround seems to be to increase the WinsorThreshold value. I confess that I have never fully understood how this works or how this parameter truly has a Winsorizing effect on the ADMM algorithm, but when I set it to 1 (as I mentioned above, the default is .0001), the matrix A is smaller by a factor of 10e4 and the numerical problem disappears:
K>> eig(A'*A)

ans =

   1.0e+19 *

   0.000000000000000
   3.082798078938964
by calling load_xdf.m with this option (s = load_xdf('SendDataC-LabRecorderv1.13.1-noeth0-run3.xdf', 'WinsorThreshold', 1.0);), I can successfully load the data set.

Again, I am not sure how this affects the precision of the clock offset mapping, but this will allow you to load these problematic sets. I am also unsure what to do to fix this. If we are at the point where the time since the Epoch is so great that this is going to happen, then this whole mechanism needs to be fixed. After all, this workaround will stop working in about 100 million seconds ;-).

I am also unsure where to re-open this issue. Is it a problem with xdf-Matlab or liblsl? It is definitely not, however, a problem with LabRecorder, and that is a good thing.

Thanks for the detailed investigation @dmedine ! I must admin I have little knowledge about Cholesky decomposition, but it is certainly good that these problematic XDF are still loadable with no data lose. The loss of clock precision offset might not be as important in my case since everything is streamed and timestamped from a same closed loop host (hope this is correct statement for a same host recording and streaming). Probably more investigation need to be done to see its effect on synchronizing multiple data stream.

As mentioned it does sound like this is more of an issue inherent to liblsl instead of LabRecorder, since the data stream are captured without any loss.

I will try your method and see if I can salvage all of those problematic ones.

David Medine · Answer 11 · Mon Oct 26 2020 12:12:01 GMT+0800 (China Standard Time)

I have been trying to do some more experiments with Raspberry Pi, and I believe this problem is very terrible. Currently, I am unable to synchronize streams between Windows and Raspbian when recording on Windows. The winsor threshold trick is distorting the signal beyond recognition.

I am not sure how to confirm my original hypothesis that this is a numerical issue, but I will try some XDF surgery and see what I can figure out. In the meantime, I would say that you should proceed with extreme caution. Sorry.

Chadwick Boulay · Answer 12 · Thu Dec 31 2020 05:24:27 GMT+0800 (China Standard Time)

The fix has been merged in both xdf-Matlab and pyxdf.
@garygan89 , please let us know if you are still experiencing any problems.