Error reading file greater than 2GB

Question

Error reading file greater than 2GB

chitra-psg opened this issue 6 months ago · comments

Chitra commented 6 months ago

Expected behavior

Expecting tif files of all size to be readable

sample_raster = sedona.read.format("binaryFile").load(vFilePath)
.withColumn("raster", expr("RS_FromGeoTiff(content)"))

sample_raster .createOrReplaceTempView("sample_raster")

Actual behavior

Error while reading file dbfs:/mnt/XYZ/sample.tif.
Caused by: SparkException: The length of dbfs:/mnt/XYZ/sample.tif is 11894624583, which exceeds the max length allowed: 2147483647.

Steps to reproduce the problem

User files of size greater than 2GB as source

Settings

Sedona version = 1.5.0

Jia Yu · Answer 1 · Mon Dec 18 2023 13:01:36 GMT+0800 (China Standard Time)

@chitra-psg

There is no direct way to fix this if you use Databricks. Sedona's in-memory raster computation engine is not intended to load large GeoTiff in-memory. It is designed to deal with a massive amount of small geotiff images.

The correct way to do this is, split this huge image to small tiffs files on S3, then load them to Sedona.

SedonaDB from Wherobots (https://wherobots.com/) offers a new raster processing mode called out-db mode (https://docs.wherobots.services/latest/references/havasu/raster/out-db-rasters/). It can solve this exact problem.

df = sedona.read.format("binaryFile")
               .load("s3a://XXX/*.tif")
               .drop("content").withColumn("rast", expr("RS_FromPath(path)"))
df.selectExpr("RS_TileExplode(rast) as (x, y, rast)").show()

If you are interested, please try it on Wherobots Cloud (https://www.wherobots.services/)