databricks / koalas

Koalas: pandas API on Apache Spark

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Koalas.idxmin() is not picking the minimum value from a dataframe, but pandas.idxmin() gives

nikeshv opened this issue · comments

Hi,
I have a koalas dataframe with age and income and I calculated Zscore on age and income and then norms is calculated using age_zscore and income_zscore(new column name is sq_dist). Then I tried to do an idxmin on the new column, but its not giving the minimum value.
I did the same operations on a Pandas dataframe, but it gives the minimum value .

Please find attached the notebook for step by step operations I performed.

cmd1
import databricks.koalas as ks
import pandas as pd
import random

cmd2
#Create Sample dataframe in Koalas
df = ks.DataFrame.from_dict({
'Age': [random.randint(0, 100000) for i in range(100000)],
'Income': [random.randint(0, 100000) for i in range(100000)]
})

print(df.head(5))

cmd3
import scipy.stats as stats
import numpy as np
ks.set_option('compute.ops_on_diff_frames', True)
df['Income_zscore'] = ks.Series(stats.zscore(df['Income'].to_numpy()))
df['Age_zscore'] = ks.Series(stats.zscore(df['Age'].to_numpy()))
df['sq_dist'] = [np.linalg.norm(i) for i in df[['Income_zscore','Age_zscore']].to_numpy()]
ks.set_option('compute.ops_on_diff_frames', False)

cmd4
#display(df)

cmd5
#calculate min of sq_dist
minindex=df['sq_dist'].idxmin()
minindex

cmd6
#display min value of sq_dist
df['sq_dist'].iloc[minindex]

cmd7
df.to_spark().createOrReplaceTempView("koalastable")

cmd8
%sql
select min(sq_dist) from koalastable -- THis doesnt match with the value we got in cmd6

cmd9
#do same operations with Pandas
df_spark = df.to_spark()
stats_array = np.array(df_spark.select('Age', 'Income').collect())
normalized_data = stats.zscore(stats_array, axis=0)
df_pd = pd.DataFrame(data=normalized_data, columns=['Age', 'Income'])
df_pd['sq_dist'] = [np.linalg.norm(i) for i in normalized_data]
df_pd.head(5)

cmd10
minindex_pd=df_pd['sq_dist'].idxmin()
minindex_pd

cmd11
#minimum of sq_dist using Koalas
df_pd['sq_dist'].iloc[minindex_pd]

cmd12
spark.createDataFrame(df_pd).createOrReplaceTempView("pandastable")

cmd13
%sql
select min(sq_dist) from pandastable -- This match with the value we got in cmd11