prestodb / presto

Links/Resources:

Slides: https://www.slideshare.net/databricks/presto-on-apache-spark-a-tale-of-two-computation-engines
Spark Summit Talk: https://www.youtube.com/watch?v=obLnTw_pyDw
Detailed Design Doc: https://docs.google.com/document/d/1aURQWDY1NJZ7xPS6jnsFcQY7pOIqHGkHzJrcPu-42Tk

Abstract

The architecture tradeoff between MapReduce and parallel database has been an open discussion since the dawn of MapReduce system over a decade ago. At Facebook, we have been spent past several years in scaling Presto to Facebook-scale batch workload.

Presto Unlimited aims at solving such scalability challenges. After revisiting the key architecture change (e.g. disaggregated shuffle) required to further scale Presto, we decided Presto-on-Spark as the path to further scale Presto. See the rest of the design doc for details.

We believe this is only a first step towards more confluence between the Spark and the Presto communities, and a major step towards enabling unified SQL experience between interactive and batch use cases.

Introduction

Presto was originally designed for interactive queries but has evolved into a unified engine for both interactive and batch use cases. Scaling an MPP architecture database to batch data processing over Internet-scale datasets is known to be an extremely difficult problem [1].

Presto Unlimited aims at solving such scalability challenges. To truly scale Presto Unlimited to Internet-scale batch workloads we need the following (excluding coordinator scaling and spilling):

Scales shuffle. This requires to either implement MapReduce-style shuffle or integrate with a disaggregated shuffle service such as Cosco.
Scales Presto worker execution. This includes resource isolation, straggler detection, speculative execution, etc.
Scales Presto resource management. A fine grained resource management is required when a single query can take years of CPU. Such concept is known as Mapper/Reducer in MapReduce, executor in Spark, and lifespan in Presto, similar to YARN/Mesos.

We realized these work lays down the foundation for a general-purpose parallel data processing system, such as Spark, FlumeJava, Dryad. Note such data processing system has its own usage and well-defined programming abstraction, and requires years to mature.

We found Presto should leverage existing well-developed systems to scale to large batch workload, instead of “embedding” such a system inside Presto. We also believe such collaboration would help the whole Big Data community to better understand the abstraction between SQL engine and data processing system, as well as evolve and refine the execution primitives to provide near-optimal performance without sacrificing the abstractions.

We choose to leverage Spark as the parallel data processing system to further scale Presto Unlimited as it’s the most widely used open source system in this category. However, the design and architecture here should apply to any other parallel data processing system as well.

Architecture

Presto Planner needs to know it’s generating plan for Spark execution, and can thus reduce unnecessary nodes (e.g. LocalExchange)
On Spark worker, it includes:
Construct operator factory chain (a.k.a DriverFactory) through LocalExecutionPlanner
Instatinate driver by binding the input split, and run the driver
Send the data to a SparkOutputBuffer which will emit to Spark.

excellent job! A unify entry for batch data processing and ad-hoc is very import for user. spark,hive,flink,mysql,elasticsearch,mongodb and so on, some is for calculate, and other is for store data, but user could connect them through Presto!

TODOs (for tracking purpose, keep updating):

Memory related config refactor: #13760 (comment)
Refactor Map<String, Iterator<Tuple2<Integer, byte[]>>> -- basically maps from PlanNodeId to a scala tuple as reducer inputs: #13760 (comment)
I understand I originally suggest the package name presto-spark-classloader-interface. But now I revisit it, maybe presto-spark-common is better since there are also some common classes shared between presto-spark and presto-spark-launcher.
Reuse serialized byte array in OutputBuffer: #13760 (comment)
Refactor SparkRddFactory more close to SqlQueryScheduler#createStageExecutions : #13760 (comment)

presto-spark-classloader-interface

As per earlier discussion, we decided to go with this name explicitly to emphasize that this module is only needed for the classloader isolation, and not for anything fundamental. Once Spark supports classloader isolation internally (or once it is migrated to Java 9+ that supports Java modules), this artificial module should be removed.

@arhimondr :

I see. But I do see we might also want to put some common classes into classloader-interface package, I think TaskProcessors is already there. See for example
#13760 (comment) , always use serialized byte array makes code more difficult to understand

What's the difference between doing this and sparksql?

@wubiaoi : From user experience perspective, Presto-on-Spark will provide the exact language and semantic between interactive and batch. While both Presto and SparkSQL is ANSI-SQL compatible, note there is no “ANSI SQL” as a language: ANSI SQL is an (in some way loose) specification. Many SQL dialects are claimed to be ANSI SQL compatible (notably, Oracle, SQL Server and DB2), yet they are significantly incompatible with each other.

As more details explained in this Quora answer:

ANSI SQL is a specification, not a particular product. It's a document, describing the official features of the SQL language.

Every brand of SQL RDBMS implements a subset of ANSI SQL. Every brand of SQL RDBMS I'm aware of adds some features to the language that are not in the ANSI SQL specification (example: indexes). And each brand implements features in its own way, not necessarily compatible with the others.

Even the language and semantic can be exactly the same, Presto-on-Spark provides unified SQL experience for interactive and batch use case. The unified SQL experience means not only the SQL language and semantic is the same, but the experience should also be similar. This is because while SQL is originally designed to be a declarative language, in almost all practice, user depends on engine-specific implementation details, and use it as imperative language in some part, to get the best performance. The SQL experience includes, but not limited to:

Semantics (e.g. NULL handle)
Subtle behavior (e.g. the maximum array/map can be handled, emit NULL vs. throw exception)
Language hint
UDF experience
How the plan will be optimized
How the SQL will be executed (e.g. performance implication for different way to write SQL, using UNNEST vs. lambda)

I will explain the technical perspective in a separate comment :)

@wubiaoi : From technical perspective, SparkSQL execution model is row-oriented + whole stage codegen[1], while Presto execution model is columnar processing + vectorization. So architecture-wise Presto-on-Spark will be more similar to the early research prototype Shark [2].

The design trade-offs between row-oriented + whole stage codegen vs. columnar processing + vectorization deserves a very long discussion , I will let @oerling to provide more insights :) . However, with modern Big Data where denormalization is omnipresent, we do see an ever-increasing value of columnar processing + vectorization [3]

[1] Apache Spark as a Compiler: Joining a Billion Rows per Second on a Laptop: https://databricks.com/blog/2016/05/23/apache-spark-as-a-compiler-joining-a-billion-rows-per-second-on-a-laptop.html
[2] Shark: SQL and Rich Analytics at Scale: https://cs.stanford.edu/~matei/papers/2013/sigmod_shark.pdf
[3] Everything You Always Wanted To Do in Table Scan: https://prestodb.io/blog/2019/06/29/everything-you-always-wanted-to-do-in-a-table-scan

@wenleix 👍 Thank you very much for the explanation.
Is it better to run only the costly stage on spark?

@wubiaoi : While this is certainly possible, this complicates the execution a lot as it requires coordination between two (heterogeneous) execution engine. Also, why not use Presto Unlimited in this case? :)

@wenleix Hi, there is doc/markdown to run this without command lines? I see PrestoSparkRunner.run prints results, there is way to run and get results?

(Using a Spark cluster)

@KannarFr Currently it is not supported. But it is supported by the https://github.com/prestodb/presto/blob/master/presto-spark-launcher/src/main/java/com/facebook/presto/spark/launcher/PrestoSparkRunner.java#L81. It allows you to specify the output file where to store the query results in Json format (queryDataOutputPath ). We should expose this parameter as a CLI argument:

presto/presto-spark-launcher/src/main/java/com/facebook/presto/spark/launcher/PrestoSparkClientOptions.java

Line 18 in 21c9c62

public class PrestoSparkClientOptions

@arhimondr Ok, can I help to work on a streaming way to output this as http chunks? WDYT?

@KannarFr Could you please elaborate a little bit more on that?

Currently the way it's implemented outputs results in files. I wonder if we can expose results as a stream.

After checking the code it seems that we need the query to end and accumulate results in RAM before write it in a file, correct? So I would like to know if we can imagine to directly expose results to stream as we get them.

@KannarFr Usually the technique is to pipe the results to system out. We just have to make sure the system out is not spoiled with any logging messages.

In a CLI context, I understand this POV @arhimondr.

If we want to extend the usage and provide a service that receives queries and answers through HTTP (for example), we can't in a clean way yet.

So If you agree, I would like to write this behavior.

Could you please elaborate more on how you are going to implement this feature? (If you think it is easy to implement a prototype that will also do)

@arhimondr Let's say Presto receives an HTTP query, we run this query and answer results as HTTP chunks. We need a way to stream results. This will reduce the RAM usage as each resulting part should be sent as an HTTP chunk.

I wrote https://github.com/CleverCloud/presto/pull/1/files, but I'm not familiar with Java stream. I think it waits for all parts to be run before returning stream :/.

Is it more clear? :)

Let's say Presto receives an HTTP query, we run this query and answer results as HTTP chunks. We need a way to stream results. This will reduce the RAM usage as each resulting part should be sent as an HTTP chunk.

In Presto on Spark there's no HTTP server in the picture. The query is submitted via the spark-submit, which is a CLI tool

I wrote https://github.com/CleverCloud/presto/pull/1/files, but I'm not familiar with Java stream. I think it waits for all parts to be run before returning stream :/.

Left a comment: https://github.com/CleverCloud/presto/pull/1/files#r494322290

As I'm looking for HTTP service instead of spark-submit I can work on it. But now you get what I want to do, right? WDYT about it?

Classic presto acts like a service, with an HTTP endpoint to fetch the results. Are you hitting the scalability wall with the classic presto?

As I'm looking for HTTP service instead of spark-submit I can work on it. But now you get what I want to do, right? WDYT about it?

I'm not sure if Spark even supports gradual fetching of the results. You can investigate it. But currently we are collecting results via the collect call, that returns all the results all at once.

As a middle ground you can change your workload to slightly different

Run INSERT INTO tmp_table .... in Presto on Spark that will write the results into a temporary table
Run SELECT * FROM tmp_table in classic Presto to fetch the results

Generally speaking Presto on Spark is mostly designed to run insert queries, that's why we don't care much about returning the results.

Presto on Spark allows changing the catalog for each query by creating Presto runner for each query, correct? Classic Presto does not support to load/unload catalogs: #12605.

My main goal is to provide context (presto catalog) for classic Presto for each query. But in fact, we need to support a very high scale. And I found this project that seems to match my requirements.

Could you please describe your usecase a little bit more? Maybe there's a better way to achieve this dynamic catalog behaviour?

Considering millions of catalogs of different types (mysql, psql, ...). Thousands of clients. So a lot of queries.

A client comes with its catalog and query on an HTTP service.

This service sends to Presto/Presto-on-Spark (let's call it system) catalog list and query to run.

Then the system should run the query and stream results through HTTP chunks to limit RAM usage if possible to answer to the client.

This is the use case, it seems simple but its implementation is not.

@KannarFr : From operation/service perspective, Presto-on-Spark is more like Spark. Thus in my opinion we should leverage what Spark provides for such service (instead of thinking it in the Presto coordinator way).

@arhimondr @wenleix Is it possible to run multiple SQL queries in the query file?

@djiangc Unfortunately no. But that should be an easy feature to add.

@arhimondr @wenleix another question. It seems I can't use cluster deploy-mode on spark-submit for presto-spark-launcher, only client is supported. Is this true or am I missing something?

@djiangc Yes, currently only the client mode is supported.

@arhimondr thanks for your response. I have another question, can I do insert overwrite?
set session hive.insert_existing_partitions_behavior='OVERWRITE';insert into test3 select *,2 from test

@djiangc Currently the launcher doesn't support setting session properties. You must enable the OVERWRITE behaviour with a configuration property: https://github.com/prestodb/presto/blob/master/presto-hive/src/main/java/com/facebook/presto/hive/HiveClientConfig.java#L561

Also it should be pretty easy to add parameter to the launcher that accepts session properties.

many thanks for your help and pointer, I got the partition overwrite working with spark-presto-launcher @arhimondr

@arhimondr I am not able to run insert in overwrite mode by setting above property. Is it not supported with s3?
@djiangc Are you using s3 or hdfs?
Getting bellow exception:
java.lang.IllegalStateException: Overwriting existing partition doesn't support DIRECT_TO_TARGET_EXISTING_DIRECTORY write mode

Hi,

@arhimondr I understand the philosophy behind this sentence

Generally speaking Presto on Spark is mostly designed to run insert queries

but the insert needs a predefined destination table with a schema, format, location, right?

As an AWS user, what I would find very usefull is to write the result of a presto-on-spark Select to an S3 location and run a Glue crawler on that location to have the table and the infered schema automatically created.

Maybe a CLI argument configuring the dataOutputLocation would do the trick ?

@rguillome Hi! Thanks for reaching out.

In our case we know the output schema in advance, thus we always ending up running INSERT INTO ... an existing table. If the schema is unknown for your usecase did you consider running CREATE TABLE AS SELECT ... to create a temporary table with a well defined schema?

Hi @arhimondr

I was trying to CREATE TABLE AS SELECT ... with an external location but encountered this line in the presto-hive HiveMetadata.beginCreateTable method:

if (getExternalLocation(tableMetadata.getProperties()) != null) { throw new PrestoException(NOT_SUPPORTED, "External tables cannot be created using CREATE TABLE AS"); }

So basically I will try to push a MR with those current changes already made in trinodb

I wonder if the the ultimate solution should'nt be an option to write each final split to a hdfs or S3 location directly to avoid the gathering at the driver level. We could imagine having all the benefits of Hadoop FS organisation (partionning, bucketing, sort and splits). But I'm not already cumfortable with all the details that It would involve to dig into this for now.

Is the presto-on-spark's physical plan be applied by DynamicFilter(vs DynamicPartitionPrune) ?

@wubiaoi : From technical perspective, SparkSQL execution model is row-oriented + whole stage codegen[1], while Presto execution model is columnar processing + vectorization. So architecture-wise Presto-on-Spark will be more similar to the early research prototype Shark [2].

The design trade-offs between row-oriented + whole stage codegen vs. columnar processing + vectorization deserves a very long discussion , I will let @oerling to provide more insights :) . However, with modern Big Data where denormalization is omnipresent, we do see an ever-increasing value of columnar processing + vectorization [3]

[1] Apache Spark as a Compiler: Joining a Billion Rows per Second on a Laptop: https://databricks.com/blog/2016/05/23/apache-spark-as-a-compiler-joining-a-billion-rows-per-second-on-a-laptop.html [2] Shark: SQL and Rich Analytics at Scale: https://cs.stanford.edu/~matei/papers/2013/sigmod_shark.pdf [3] Everything You Always Wanted To Do in Table Scan: https://prestodb.io/blog/2019/06/29/everything-you-always-wanted-to-do-in-a-table-scan

SparkSQL 3.0+ execution model is aslo columnar processing + vectorization

Is the presto-on-spark's physical plan be applied by DynamicFilter(vs DynamicPartitionPrune) ?

It is !

@wenleix Hello, I have a question. Although the compatibility is increased, for queries with small amount of data, isn't the query speed slowed down after adding materialized shuffle? At the same time, I would like to ask how the improved Presto and sparksql compare in terms of a large amount of data?

@wenleix Hello, I have a question. Although the compatibility is increased, for queries with small amount of data, isn't the query speed slowed down after adding materialized shuffle? At the same time, I would like to ask how the improved Presto and sparksql compare in terms of a large amount of data?

The idea is to run small queries on classic Presto, and run large (won't fit within memory limit) / long running queries (more likely to be affected by cluster stability issues) using Presto-on-Spark.

@rongrong
One more question. How presto-on-spark deal with the large amount of data transportation when execute large queries.
As I know, data transport by broadcast machanism. Will all the these moved data go through from spark-driver, which is a single point to coordinate all global data streams. Any bottleneck ?

@rongrong Does this mean that if the user fails to execute through Presto and finds that the SQL is a large query, then submit it through Presto on spark? Does the user have a process of switching the submission method?
At first, I mistakenly thought that all Presto queries were submitted through Presto on spark.

@rongrong Does this mean that if the user fails to execute through Presto and finds that the SQL is a large query, then submit it through Presto on spark? Does the user have a process of switching the submission method?
At first, I mistakenly thought that all Presto queries were submitted through Presto on spark.

As I know, these are totally two processes; You must develop the judge logic to decide whether it is a large query.
Presto-on-spark is exactly a spark process if ignoring the presto's code logic. Either has nothing to do with other. @whutpencil

use same sql. does the presto-on-spark use less memory and more times?

[Design] Presto-on-Spark: A Tale of Two Computation Engines

Links/Resources:

Abstract

Introduction

Architecture