read error UTF8

Question

read error UTF8

bungernut opened this issue 2 years ago · comments

I have a Digilent USB scope saving waveforms into tdms format with the WaveForms program, I think it's built on NI software.

I have done a few experiments where triggers save waveforms and everything works.
The last experiment all of the files are not readable by nptdms but are all readable with NI Scout and their Excel-Plugin.

Any help would be apprecieated why these files are not readable. I have zipped and attached a good and a bad file.

Traceback of bad file read

~\Anaconda3\envs\pydan\lib\site-packages\nptdms\types.py in read(file, endianness)
    206         size_bytes = file.read(4)
    207         size = _struct_unpack(endianness + 'L', size_bytes)[0]
--> 208         return file.read(size).decode('utf-8')
    209 
    210     @classmethod

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb0 in position 2: invalid start byte

Good:
Bad:

good_file.zip
bad_utf8.zip

Adam Reeve · Answer 1 · Mon Jan 16 2023 16:50:29 GMT+0800 (China Standard Time)

Hi @bungernut, it looks like your file has a property with an invalid UTF-8 encoded string. I tested changing the string decoding to replace any invalid bytes with a replacement character:

--- a/nptdms/types.py
+++ b/nptdms/types.py
@@ -205,7 +205,7 @@ class String(TdmsType):
     def read(file, endianness="<"):
         size_bytes = file.read(4)
         size = _struct_unpack(endianness + 'L', size_bytes)[0]
-        return file.read(size).decode('utf-8')
+        return file.read(size).decode('utf-8', errors='replace')
 
     @classmethod
     def read_values(cls, file, number_values, endianness="<"):

With this I can successfully read the file but the Phase property of the file is decoded as "0 �".

Looking at the raw bytes, the property is b'0 \xb0'. It looks like your file might actually be using the ISO/IEC 8859-1 encoding, which uses a single \xb0 byte for the degree symbol. But the TDMS documentation clearly states that all strings should be encoded in UTF-8, and in UTF-8 a degree symbol is \xc2\xb0.

Are you able to change how you write files to avoid this issue? Otherwise it might make sense to make the replacement error-handling behaviour the default but log a warning if an error is encountered, or make the error-handling behaviour configurable.

Adam Reeve · Answer 2 · Tue Jan 17 2023 04:39:38 GMT+0800 (China Standard Time)

I checked how the LabView TDMS File Viewer handles this property and it displays a "?" instead of the degree symbol:

Brian Mong · Answer 3 · Tue Jan 17 2023 06:18:08 GMT+0800 (China Standard Time)

That's really fantastic you figured that out. I will contact Digilent and see if they realize the UTF-8 issue. I also really appreciate the fix idea, I will test that and probably fork this suggestion for myself. Do you think this would be something you'd support in the future?

…

________________________________ From: Adam Reeve ***@***.***> Sent: Monday, January 16, 2023 12:39 PM To: adamreeve/npTDMS ***@***.***> Cc: Mong, Brian E ***@***.***>; Mention ***@***.***> Subject: Re: [adamreeve/npTDMS] read error UTF8 (Issue #294) I checked how the LabView TDMS File Viewer handles this property and it displays a "?" instead of the degree symbol: [NI TDMS File Viewer properties]<https://user-images.githubusercontent.com/626438/212762101-dab9eb6c-89c1-4e66-aff2-1e3b537ed9fe.png> — Reply to this email directly, view it on GitHub<#294 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AAY3TG3UECMLESRQVTEQQULWSWWZJANCNFSM6AAAAAAT2YWO2U>. You are receiving this because you were mentioned.Message ID: ***@***.***>

Brian Mong · Answer 4 · Tue Jan 17 2023 06:31:28 GMT+0800 (China Standard Time)

For posterity I have reported the issue on Digilent's forms: https://forum.digilent.com/topic/24719-waveforms-tdms-output-files-not-utf-8/

Adam Reeve · Answer 5 · Tue Jan 17 2023 06:47:27 GMT+0800 (China Standard Time)

Yeah I think changing npTDMS to use the replacement error handling approach would be a good idea, with logging a warning so that users know the data might have an issue. That way a minor issue like this doesn't prevent the rest of the file from being read. I'm happy to make this change.