Error reading file greater than 2GB
chitra-psg opened this issue · comments
Expected behavior
Expecting tif files of all size to be readable
sample_raster = sedona.read.format("binaryFile").load(vFilePath)
.withColumn("raster", expr("RS_FromGeoTiff(content)"))
sample_raster .createOrReplaceTempView("sample_raster")
Actual behavior
Error while reading file dbfs:/mnt/XYZ/sample.tif.
Caused by: SparkException: The length of dbfs:/mnt/XYZ/sample.tif is 11894624583, which exceeds the max length allowed: 2147483647.
Steps to reproduce the problem
User files of size greater than 2GB as source
Settings
Sedona version = 1.5.0
There is no direct way to fix this if you use Databricks. Sedona's in-memory raster computation engine is not intended to load large GeoTiff in-memory. It is designed to deal with a massive amount of small geotiff images.
The correct way to do this is, split this huge image to small tiffs files on S3, then load them to Sedona.
SedonaDB from Wherobots (https://wherobots.com/) offers a new raster processing mode called out-db
mode (https://docs.wherobots.services/latest/references/havasu/raster/out-db-rasters/). It can solve this exact problem.
df = sedona.read.format("binaryFile")
.load("s3a://XXX/*.tif")
.drop("content").withColumn("rast", expr("RS_FromPath(path)"))
df.selectExpr("RS_TileExplode(rast) as (x, y, rast)").show()
If you are interested, please try it on Wherobots Cloud (https://www.wherobots.services/)