arthurdysart / ChargeTracker

Near real-time ingestion, analysis, and visualization of rechargeable battery data. For 2018 Insight data engineering fellowship.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Issue: Spark Streaming does not consume from Kafka

arthurdysart opened this issue · comments

commented

Running PySpark Streaming script does not yield or print intermediate output for micro-RDD evaluations.

Resulting error from PySpark : "TypeError: can't pickle generator objects", "ERROR PythonDStream$$anon$1:91 - Cannot connect to Python process. It's probably dead. Stopping StreamingContext."

Error points to function "save_to_database()" in "cycle_step_analysis.py". But what generator object is being pickled?

commented

Solved: in PySpark Streaming script, lambda argument of "parsed_rdd" must be mapped with explicit tuple(); otherwise, lazily evaluated as generator object.

For example, see tuple comprehension combined with function "sum()".