logpai / loghub

A large collection of system log datasets for AI-driven log analytics [ISSRE'23]

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Encoding issue with Linux log

asrmnw opened this issue · comments

Not sure whether it's an issue from here. But when try to read the current Linux.log (zenodo, md5:6d1802d7778126f21c001c6aa7b6b106) with python i got

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbf in position 20: invalid start byte

can you confirm that or is that something probably going wrong on my side?

my fault. never mind

Sorry for the back and forth. After recognizing my fault, there is still the UnicodeDecodeError. This time:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf7 in position 4536: invalid start byte

This is one place i found when opening the original decompressed log file Linux.log with vim:
image

it requires me to use the errors= option for pythons open function to read the file without exception.

If you use raw logs from production, such errors are not uncommon. Please just skip such rows if they are not so many.