read error UTF8
bungernut opened this issue · comments
I have a Digilent USB scope saving waveforms into tdms format with the WaveForms program, I think it's built on NI software.
I have done a few experiments where triggers save waveforms and everything works.
The last experiment all of the files are not readable by nptdms but are all readable with NI Scout and their Excel-Plugin.
Any help would be apprecieated why these files are not readable. I have zipped and attached a good and a bad file.
Traceback of bad file read
~\Anaconda3\envs\pydan\lib\site-packages\nptdms\types.py in read(file, endianness)
206 size_bytes = file.read(4)
207 size = _struct_unpack(endianness + 'L', size_bytes)[0]
--> 208 return file.read(size).decode('utf-8')
209
210 @classmethod
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb0 in position 2: invalid start byte
Hi @bungernut, it looks like your file has a property with an invalid UTF-8 encoded string. I tested changing the string decoding to replace any invalid bytes with a replacement character:
--- a/nptdms/types.py
+++ b/nptdms/types.py
@@ -205,7 +205,7 @@ class String(TdmsType):
def read(file, endianness="<"):
size_bytes = file.read(4)
size = _struct_unpack(endianness + 'L', size_bytes)[0]
- return file.read(size).decode('utf-8')
+ return file.read(size).decode('utf-8', errors='replace')
@classmethod
def read_values(cls, file, number_values, endianness="<"):
With this I can successfully read the file but the Phase
property of the file is decoded as "0 �".
Looking at the raw bytes, the property is b'0 \xb0'
. It looks like your file might actually be using the ISO/IEC 8859-1 encoding, which uses a single \xb0
byte for the degree symbol. But the TDMS documentation clearly states that all strings should be encoded in UTF-8, and in UTF-8 a degree symbol is \xc2\xb0
.
Are you able to change how you write files to avoid this issue? Otherwise it might make sense to make the replacement error-handling behaviour the default but log a warning if an error is encountered, or make the error-handling behaviour configurable.
For posterity I have reported the issue on Digilent's forms: https://forum.digilent.com/topic/24719-waveforms-tdms-output-files-not-utf-8/
Yeah I think changing npTDMS to use the replacement error handling approach would be a good idea, with logging a warning so that users know the data might have an issue. That way a minor issue like this doesn't prevent the rest of the file from being read. I'm happy to make this change.