README

Tests if double aggregations are allowed on Spark Structured Streaming. We define a Dataset where each element is tuple of 3 elements: (key, window, value).

We group by key and window and sum the values (key, window, sum_per_key_and_window).
We group the result of (1) by key and count the number of elements per key (key, num_windows).

The watermark definition is available but inactive to validate if we can do multiple aggregations without it.

How to run

In a terminal run:

nc -lk 9999

Then run com.talend.beam.SparkStructuredStreamingDoubleAggregation class which will connect to the socket. You can do this via maven by running

mvn exec:java

Then start sending data into the socket by pasting data into the terminal, for example:

Conclusion

With our without the watermark definition the Dataset operation fails:

Exception in thread "main" org.apache.spark.sql.AnalysisException: Multiple streaming aggregations are not supported with streaming DataFrames/Datasets;;

iemejia / spark-playground

README

How to run

Conclusion

About

Languages