databricks / koalas

Koalas: pandas API on Apache Spark

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

pandas fixed width file support

johnayoub opened this issue · comments

Is there a plan to support the fixed width api ?

Koalas is now in Apache Spark officially. Let's file an issue there. From a cursory look, looks like we can implement it by 1. distributing input StringIO, 2. reading any file from the distributed file source.

cc @xinrong-databricks and @itholic since you guys are triaging the issues.

Thanks @HyukjinKwon. let me know if you need me to open an issue there.

@johnayoub Sure, can you open an issue to Apache Spark JIRA ?

@itholic opened a new issue there.

@itholic @HyukjinKwon any update on this and when I can expect it to be included with koalas?

Hi, @johnayoub

Unfortunately, we have no clear plan to add read_fwf yet (at least it's available after Spark 3.3 or later)

Anyway, at least it will be added to the PySpark first, and added to the Koalas after then.
(So, we'd recommend to use PySpark rather than Koalas since Koalas is now in maintenance mode)

FYI, you can easily convert your Koalas code to PySpark with single line change as below:

# import databricks.koalas as ks
import pyspark.pandas as ks

btw, just in case, maybe if you want to read files from http, it will take longer since PySpark doesn't support reading from such file sources yet. Refer to #1219 for more detail about http support.

thanks @itholic!

The import that you mentioned isn't supported yet is it?

import pyspark.pandas as ks

Also, in the meantime any recommendation for dealing with fixed width format in spark?