databricks / koalas

Koalas: pandas API on Apache Spark

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Series.to_json(orient='records') does not return records-based JSON

klenium opened this issue · comments

df = ks.DataFrame([['a', 'b'], ['c', 'd']], columns=['col 1', 'col 2'])

def add_json(row):
  row['serialized_row_content'] = row.to_json()
  return row

df = df.apply(add_json, axis = 1)

print(df)

  col 1 col 2     serialized_row_content
0     a     b  {"col 1":"a","col 2":"b"}
1     c     d  {"col 1":"c","col 2":"d"}

That works as expected. The documentation says:

orient str, default ‘records’
It should be always ‘records’ for now.

So if instead of row.to_json() I write row.to_json(orient = 'records'), the output must be the same. But it's not:

  col 1 col 2 serialized_row_content
0     a     b              ["a","b"]
1     c     d              ["c","d"]

Which is rather the values format from Pandas.

Very interesting, I don't see the reason for this behavior in its source code. :)

row['type'] = str(type(row)) -> <class 'pandas.core.series.Series'>
Well that's unexpected, why is a Pandas Series used there?
Also why wouldn't it return records-based JSON uh.

The same applies to Pandas on Spark. If I follow the documentation and call to_json('records'), then the output is None thus I get errors.