adamreeve / npTDMS

NumPy based Python module for reading TDMS files produced by LabView

Home Page:http://nptdms.readthedocs.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

read error UTF8

bungernut opened this issue · comments

I have a Digilent USB scope saving waveforms into tdms format with the WaveForms program, I think it's built on NI software.

I have done a few experiments where triggers save waveforms and everything works.
The last experiment all of the files are not readable by nptdms but are all readable with NI Scout and their Excel-Plugin.

Any help would be apprecieated why these files are not readable. I have zipped and attached a good and a bad file.

Traceback of bad file read

~\Anaconda3\envs\pydan\lib\site-packages\nptdms\types.py in read(file, endianness)
    206         size_bytes = file.read(4)
    207         size = _struct_unpack(endianness + 'L', size_bytes)[0]
--> 208         return file.read(size).decode('utf-8')
    209 
    210     @classmethod

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb0 in position 2: invalid start byte

Good: image
Bad: image

good_file.zip
bad_utf8.zip

Hi @bungernut, it looks like your file has a property with an invalid UTF-8 encoded string. I tested changing the string decoding to replace any invalid bytes with a replacement character:

--- a/nptdms/types.py
+++ b/nptdms/types.py
@@ -205,7 +205,7 @@ class String(TdmsType):
     def read(file, endianness="<"):
         size_bytes = file.read(4)
         size = _struct_unpack(endianness + 'L', size_bytes)[0]
-        return file.read(size).decode('utf-8')
+        return file.read(size).decode('utf-8', errors='replace')
 
     @classmethod
     def read_values(cls, file, number_values, endianness="<"):

With this I can successfully read the file but the Phase property of the file is decoded as "0 �".

Looking at the raw bytes, the property is b'0 \xb0'. It looks like your file might actually be using the ISO/IEC 8859-1 encoding, which uses a single \xb0 byte for the degree symbol. But the TDMS documentation clearly states that all strings should be encoded in UTF-8, and in UTF-8 a degree symbol is \xc2\xb0.

Are you able to change how you write files to avoid this issue? Otherwise it might make sense to make the replacement error-handling behaviour the default but log a warning if an error is encountered, or make the error-handling behaviour configurable.

I checked how the LabView TDMS File Viewer handles this property and it displays a "?" instead of the degree symbol:

NI TDMS File Viewer properties

For posterity I have reported the issue on Digilent's forms: https://forum.digilent.com/topic/24719-waveforms-tdms-output-files-not-utf-8/

Yeah I think changing npTDMS to use the replacement error handling approach would be a good idea, with logging a warning so that users know the data might have an issue. That way a minor issue like this doesn't prevent the rest of the file from being read. I'm happy to make this change.