benjaminbluhm/spark_parallel_forecasting

time-series-econometrics distributed-computing pyspark-python

The repository contains the source code and dataset to reproduce the parallel computing exercise described in the paper:

Time Series Econometrics at Scale - A Practical Guide to Parallel Computing in (Py)Spark

Abstract

This paper provides a practical programming guide to setting up a minimum working example of a distributed system for parallel time series analysis. The system is built in Apache Spark on top of Amazon's Hadoop-based service Elastic MapReduce (EMR). A simple forecasting exercise with 1,000 time series illustrates the proposed parallelization scheme, which reduces total runtime performance by about 95% relative to a single-core, single-machine setting. The ease of implementing this scheme makes this guide a useful reference for econometricians with a limited background in parallel programming. To facilitate reproducibility of the practical steps in this guide, the PySpark/Python code is available for download on github.

Link to the paper: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3226976

About

time-series-econometrics distributed-computing pyspark-python

Languages

Language:TeX 86.0%Language:Python 12.8%Language:R 0.9%Language:Shell 0.3%