pola-rs / nodejs-polars

nodejs front-end of polars

Home Page:https://pola-rs.github.io/nodejs-polars/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Encoding issue (German Umlaute)

GeroVanMi opened this issue · comments

Preface: Thank you for taking the time to read my issue. I am unsure whether this is an issue with this library or some underlying dependency and could not find a workaround to solve it. Just reporting this in case others have the same issue.

Have you tried latest version of polars?

  • [yes]

What version of polars are you using?

0.13.0

What operating system are you using polars on?

Ubuntu 24.04

What node version are you using

node v20.12.2

Describe your bug.

[German Umlaute](https://en.wikipedia.org/wiki/Umlaut_(linguistics) are not parsed by the read_csv() function even though they should be part of the UTF8 encoding. Instead they are replaced with null or in the case of uft8-lossy they are replaced with special error characters.

I could not figure out why this is the case.

What are the steps to reproduce the behavior?

Dataset (copy this into a file umlaut.csv)

Deutsch;English
Fassadenbegrünung;Green wall

Code:

let df = pl.readCSV(`umlaut.csv`, {
      sep: ";",
      // encoding: "utf8",
      encoding: "utf8-lossy",
    });

What is the actual behavior?

With utf8-lossy:

│ Fassadenbegr�nung             ┆ Green wall                      ┆

With utf8:

│ null             ┆ Green wall                      ┆

What is the expected behavior?

│ Fassadenbegrünung             ┆ Green wall                      ┆

Thank you for your time spent building this library!
This is a minor issue and I will just re-write my code to python and use pandas for the time being, so there is no rush from my side.

Either encoding works fine on MacOS Sonoma 14.5. I have tried Bun and Python.
Can you please check your OS or terminal settings? Thx

shape: (1, 2)
┌───────────────────┬────────────┐
│ Deutsch           ┆ English    │
│ ---               ┆ ---        │
│ str               ┆ str        │
╞═══════════════════╪════════════╡
│ Fassadenbegrünung ┆ Green wall │
└───────────────────┴────────────┘

Ah indeed, thank you!
The issue lies in the original CSV, which is encoded in ISO-8859-3 instead of UTF-8. (Microsoft!!!)

Sorry for the inconvenience and thank you for the quick reply!