apache / sedona

A cluster computing framework for processing large-scale geospatial data

Home Page:https://sedona.apache.org/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Error reading file greater than 2GB

chitra-psg opened this issue · comments

Expected behavior

Expecting tif files of all size to be readable

sample_raster = sedona.read.format("binaryFile").load(vFilePath)
.withColumn("raster", expr("RS_FromGeoTiff(content)"))

sample_raster .createOrReplaceTempView("sample_raster")

Actual behavior

Error while reading file dbfs:/mnt/XYZ/sample.tif.
Caused by: SparkException: The length of dbfs:/mnt/XYZ/sample.tif is 11894624583, which exceeds the max length allowed: 2147483647.

Steps to reproduce the problem

User files of size greater than 2GB as source

Settings

Sedona version = 1.5.0

@chitra-psg

There is no direct way to fix this if you use Databricks. Sedona's in-memory raster computation engine is not intended to load large GeoTiff in-memory. It is designed to deal with a massive amount of small geotiff images.

The correct way to do this is, split this huge image to small tiffs files on S3, then load them to Sedona.

SedonaDB from Wherobots (https://wherobots.com/) offers a new raster processing mode called out-db mode (https://docs.wherobots.services/latest/references/havasu/raster/out-db-rasters/). It can solve this exact problem.

df = sedona.read.format("binaryFile")
               .load("s3a://XXX/*.tif")
               .drop("content").withColumn("rast", expr("RS_FromPath(path)"))
df.selectExpr("RS_TileExplode(rast) as (x, y, rast)").show()

If you are interested, please try it on Wherobots Cloud (https://www.wherobots.services/)