AnkushKhanna / spark-common

Spark Commons, some hacks to simplify programming with Spark.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Chaining multiple transformers:

Multiple times while trying to use more than one transformation, I was required to chain up Transformers or build a pipeline.

Although pipeline was a go to way. Sometime it became overloaded while experimenting with different transformations.

Ex: Maintaining intermediate transformation columns and passing column names between transformers.

Thus I built a Transform class which was extended by the most common transformers I used, Tokenizer, Hashing, TFIDF.

Thus now to chain Tokenizer and Hashing, we can use:

val transform = new Transform with TTokenize with THashing

To add TFIDF the transformer would look like:

val transform = new Transform with TTokenize with THashing with TIDF

This works from left to right. So first Tokenizer would be applied then Hashing and at last TFIDF.

This make life easier while trying out different Transformer combinations, without the headache of maintaining intermediate columns.

See source code

See usage code

To extend this class with further transformations, you can check out Source code extension

About

Spark Commons, some hacks to simplify programming with Spark.


Languages

Language:Scala 100.0%