hablapps / sparkOptics

Optics for Spark DataFrames

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Execution slows down after 3 modifications of complex structure.

chethan1212 opened this issue · comments

I am trying to modify multiple values in a complex DF.
I have copied the relevant parts below.

df.printSchema
root
..|--eventData
........|--stringVals
............|--idNumber
............|--firstName
............|--secondName
............|--email
............|--phone
............|--age
............|--dob
..|--idNumber
..|--firstName
..|--secondName
..|--email
..|--phone

var tempDF = df
val listOfFields = List("idNumber","firstName","secondName,","email","phone")

listOfFields.forEach( eachField => {
val lens = Lens("eventData.stringVals." + eachField)(tempDF.schema)
val tempFunc = lens.setDF(col(eachField)) //df contains column with values to replace at root level
tempDF = tempFunc(tempDF)
}

with 3 fileds in the listOfFields code executes in fast. But when I add 5 fields it slows down. I am trying to replace around 25 values in a complex DF which contians 100s of columns at multilelvels.

Please review and suggest a better option.

Thank you,
Che

Hi @chethan1212 , what version of spark are you using? in newer versions, there's an official alternative instead of using sparkOptics. https://towardsdatascience.com/spark-3-nested-fields-not-so-nested-anymore-9b8d34b00b95

Hi @alfonsorr ,

Thanks for the quick response. Unfortunately we are still on spark 2.4.
Is there anything I could differently with the Optics API ?

Thank you,
Che