iemejia / spark-playground

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

README

Tests if double aggregations are allowed on Spark Structured Streaming. We define a Dataset where each element is tuple of 3 elements: (key, window, value).

  1. We group by key and window and sum the values (key, window, sum_per_key_and_window).
  2. We group the result of (1) by key and count the number of elements per key (key, num_windows).

The watermark definition is available but inactive to validate if we can do multiple aggregations without it.

How to run

In a terminal run:

nc -lk 9999

Then run com.talend.beam.SparkStructuredStreamingDoubleAggregation class which will connect to the socket. You can do this via maven by running

mvn exec:java

Then start sending data into the socket by pasting data into the terminal, for example:

1 10 1
1 20 2
2 10 1
2 10 1
2 30 2

Conclusion

With our without the watermark definition the Dataset operation fails:

Exception in thread "main" org.apache.spark.sql.AnalysisException: Multiple streaming aggregations are not supported with streaming DataFrames/Datasets;;

About


Languages

Language:Java 100.0%