prestodb / presto

The official home of the Presto distributed SQL query engine for big data

Home Page:http://prestodb.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Fault Tolerance for Presto Clusters on long running queries

voycey opened this issue · comments

We have recently moved over to using Hadoop with Presto and we are very impressed at the speeds for Geospatial Joins and Queries. We query a lot of data, often having to run long running jobs in order to process and join billions of rows, Presto is very efficient at doing this until it comes to node failures which currently cause the query to fail.
I was wondering if there were any any plans to implement some kind of fault tolerance within Presto so that these queries either don't fail or can pick up where they left off?

(or if anyone has any pointers as to how we can achieve something similar I would be interested in hearing it - we have explored batch processing, query optimisation and custom partitioning so far as methods to either reduce the query time or restart failed queries).

Thanks

Hi @voycey

Support for fault tolerance is on the community roadmap for the near future. This would be achieved via combination of failure recovery, temporary tables, multi-stage and bucket-by-bucket execution.
@martint has talked about it in his presentation: https://www.slideshare.net/kbajda/presto-summit-2018-01-facebook-presto/
at Presto Summit. For recap of Presto summit you can visit: https://www.starburstdata.com/technical-blog/presto-summit-2018-recap/
There were other very interesting Presto related presentations.

Let me close this issue in favor of #9855.

Thanks all - this is great its on the roadmap for the near future!