The SeqDataSourceV2 package allows reading Hadoop Sequence File from Spark SQL.
It's compatible only with Spark 2.4
- The SeqDataSourceV2 automatically detects the type unlike the RDD API that requires prior knowledge.
- The SeqDataSourceV2 is 1.3x faster than the RDD API (See Benchmark at
SeqDataSourceV2Benchmark
).
The following list contains the type mapping and the supported types by this Data Source.
Some types support the vectorized read optimization (aka Arrow optimization)
Spark Types | Spark (Vectorized Read Path) | Hadoop |
---|---|---|
LongType | Supported | LongWritable |
DoubleType | Supported | DoubleWritable |
FloatType | Supported | FloatWritable |
IntegerType | Supported | IntWritable |
BooleanType | Supported | BooleanWritable |
NullType | Not Supported | NullWritable |
StringType | Not Supported | BytesWritable |
StringType | Not Supported | Text |
N.B:
- The vectorized read path is disabled by default. You can turn it by setting
spark.sql.seq.enableVectorizedReader
to true.
val spark = SparkSession
.builder()
.master("local[1]")
.config("spark.sql.seq.enableVectorizedReader", "true")
.getOrCreate()
-
If one column doesn't support vectorized read path, the SeqDataSourceV2 will fall back to the normal read path.
Example:- The following schema (key : IntegerType, value: FloatType) supports vectorized read path.
- The following schema (key : IntegerType, value: StringType) doesn't support vectorized read path.
-
It's possible to control the number of rows of the batch in the vectorized read path with
spark.sql.seq.columnarReaderBatchSize
.
By default, the size of the batch is4096
rows.
You need to download the latest release from the packages page and include it in the spark-submit.
Example with spark-submit:
$ spark-submit --class Main --jars seq-datasource-v2-0.2.0.jar Example-SNAPSHOT.jar
Example with pyspark:
$ pyspark --jars seq-datasource-v2-0.2.0.jar
You can directly include the package with pacakges
parameters, you can find the latest release in the spark packages.
Example with spark-submit:
$ spark-submit --class Main --packages garawalid:seq-datasource-v2:0.2.0
You can include the SeqDataSourceV2 as a dependency with Maven, the latest release is in the the packages page.
Example with Maven:
<dependency>
<groupId>org.gwalid</groupId>
<artifactId>seq-datasource-v2</artifactId>
<version>0.2.0</version>
</dependency>
The SeqDataSourceV2 is compatible with all the API. Here are some examples with both Scala and Python API.
Scala API
val spark = SparkSession.builder()
.master("local[0]")
.getOrCreate()
val df = spark.read.format("seq").load("data.seq")
df.show()
Python API
df = spark.read.format("seq").load("data.seq")
df.printSchema()
Schema
It's possible to pass a schema to DataFrame API. There are few rules around schema.
- The filed names must be key and/or value.
The name key will project the key field of the Seq file. The same goes for the value
- The filed type should match the type of the seq file.
val schema = new StructType()
.add("key", IntegerType, true)
.add("value", LongType, true)
val df = spark.read.format("seq").schema(schema).load("path")
You are welcome to submit pull requests with any changes for this repository at any time. I'll be very glad to see any contributions.