Encoding issue (German Umlaute)
GeroVanMi opened this issue · comments
Preface: Thank you for taking the time to read my issue. I am unsure whether this is an issue with this library or some underlying dependency and could not find a workaround to solve it. Just reporting this in case others have the same issue.
Have you tried latest version of polars?
- [yes]
What version of polars are you using?
0.13.0
What operating system are you using polars on?
Ubuntu 24.04
What node version are you using
node v20.12.2
Describe your bug.
[German Umlaute](https://en.wikipedia.org/wiki/Umlaut_(linguistics) are not parsed by the read_csv() function even though they should be part of the UTF8 encoding. Instead they are replaced with null
or in the case of uft8-lossy
they are replaced with special error characters.
I could not figure out why this is the case.
What are the steps to reproduce the behavior?
Dataset (copy this into a file umlaut.csv
)
Deutsch;English
Fassadenbegrünung;Green wall
Code:
let df = pl.readCSV(`umlaut.csv`, {
sep: ";",
// encoding: "utf8",
encoding: "utf8-lossy",
});
What is the actual behavior?
With utf8-lossy
:
│ Fassadenbegr�nung ┆ Green wall ┆
With utf8
:
│ null ┆ Green wall ┆
What is the expected behavior?
│ Fassadenbegrünung ┆ Green wall ┆
Thank you for your time spent building this library!
This is a minor issue and I will just re-write my code to python and use pandas for the time being, so there is no rush from my side.
Either encoding works fine on MacOS Sonoma 14.5
. I have tried Bun and Python.
Can you please check your OS or terminal settings? Thx
shape: (1, 2)
┌───────────────────┬────────────┐
│ Deutsch ┆ English │
│ --- ┆ --- │
│ str ┆ str │
╞═══════════════════╪════════════╡
│ Fassadenbegrünung ┆ Green wall │
└───────────────────┴────────────┘
Ah indeed, thank you!
The issue lies in the original CSV, which is encoded in ISO-8859-3 instead of UTF-8. (Microsoft!!!)
Sorry for the inconvenience and thank you for the quick reply!