README
Tests if double aggregations are allowed on Spark Structured Streaming.
We define a Dataset
where each element is tuple of 3 elements: (key, window, value).
- We group by
key
andwindow
andsum
the values (key, window, sum_per_key_and_window). - We group the result of (1) by
key
andcount
the number of elements per key (key, num_windows).
The watermark definition is available but inactive to validate if we can do multiple aggregations without it.
How to run
In a terminal run:
nc -lk 9999
Then run com.talend.beam.SparkStructuredStreamingDoubleAggregation
class which will connect to the socket.
You can do this via maven by running
mvn exec:java
Then start sending data into the socket by pasting data into the terminal, for example:
1 10 1
1 20 2
2 10 1
2 10 1
2 30 2
Conclusion
With our without the watermark definition the Dataset operation fails:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Multiple streaming aggregations are not supported with streaming DataFrames/Datasets;;