Fault Tolerance for Presto Clusters on long running queries

Question

Fault Tolerance for Presto Clusters on long running queries

voycey opened this issue 6 years ago · comments

We have recently moved over to using Hadoop with Presto and we are very impressed at the speeds for Geospatial Joins and Queries. We query a lot of data, often having to run long running jobs in order to process and join billions of rows, Presto is very efficient at doing this until it comes to node failures which currently cause the query to fail.
I was wondering if there were any any plans to implement some kind of fault tolerance within Presto so that these queries either don't fail or can pick up where they left off?

(or if anyone has any pointers as to how we can achieve something similar I would be interested in hearing it - we have explored batch processing, query optimisation and custom partitioning so far as methods to either reduce the query time or restart failed queries).

Thanks

Karol Sobczak · Answer 1 · Fri Aug 10 2018 16:37:10 GMT+0800 (China Standard Time)

Hi @voycey

Support for fault tolerance is on the community roadmap for the near future. This would be achieved via combination of failure recovery, temporary tables, multi-stage and bucket-by-bucket execution.
@martint has talked about it in his presentation: https://www.slideshare.net/kbajda/presto-summit-2018-01-facebook-presto/
at Presto Summit. For recap of Presto summit you can visit: https://www.starburstdata.com/technical-blog/presto-summit-2018-recap/
There were other very interesting Presto related presentations.

Piotr Findeisen · Answer 2 · Sat Aug 11 2018 23:39:02 GMT+0800 (China Standard Time)

Let me close this issue in favor of #9855.

Dan Voyce · Answer 3 · Mon Aug 13 2018 08:01:43 GMT+0800 (China Standard Time)

Thanks all - this is great its on the roadmap for the near future!