Unable to use sedona.global.charset in ShapefileReader

Question

Unable to use sedona.global.charset in ShapefileReader

adamaps opened this issue 2 months ago · comments

Expected behavior

ShapefileReader.readToGeometryRDD(sedona_context, shp_file) should use the sedona.global.charset configuration property set in the spark session when reading shapefiles containing non-ASCII characters.

E.g. A shapefile containing an attribute value "Ariñiz/Aríñez" should appear in a dataframe as "Ariñiz/Aríñez".

Actual behavior

ShapefileReader.readToGeometryRDD(sedona_context, shp_file) is not using the charset configuration property set in the spark context.

E.g. A shapefile containing an attribute value "Ariñiz/Aríñez" appears in a dataframe as "AriÃ±iz/ArÃÃ±ez" instead.

Steps to reproduce the problem

from pyspark.conf import SparkConf
from pyspark.sql import SparkSession
from sedona.core.formatMapper.shapefileParser import ShapefileReader
from sedona.spark import SedonaContext
from sedona.utils.adapter import Adapter

conf = SparkConf()
conf.set("sedona.global.charset", "utf8")
spark = SparkSession.builder.config(conf=conf).getOrCreate()

sedona = SedonaContext.create(spark)
sedona_context = sedona.sparkContext

shp_file = '[aws s3 path to shapefile]'
shp_rdd = ShapefileReader.readToGeometryRDD(sedona_context, shp_file)
shp_df = Adapter.toDf(shp_rdd, sedona)

I can confirm that ("sedona.global.charset", "utf8") appears in the configuration settings by using:

print(sedona_context.getConf().getAll())

I also tried setting the charset property after creating the sedona context as follows (although this appears to be an older solution):

sedona_context.setSystemProperty("sedona.global.charset", "utf8")

Please confirm how to set this configuration property correctly.

Settings

Sedona version = 1.5.1

Apache Spark version = 3.3.0

API type = Python

Python version = 3.10

Environment = AWS Glue 4.0 using sedona-spark-shaded-3.0_2.12-1.5.1.jar and geotools-wrapper-1.5.1-28.2.jar

Jia Yu · Answer 1 · Sat Apr 20 2024 02:26:40 GMT+0800 (China Standard Time)

@adamaps If you are running Sedona in a cluster mode, this needs to be set via spark.executorEnv.[EnvironmentVariableName]: https://spark.apache.org/docs/latest/configuration.html

In your case, you might want to try this:

spark.executorEnv.sedona.global.charset utf8

spark.executorEnv is a runtime config that can be set after your SparkSession or SedonaContext has been initiated.

spark.conf.set("spark.executorEnv.sedona.global.charset","utf8")
sedona.conf.set("spark.executorEnv.sedona.global.charset","utf8")

Adam Thomas · Answer 2 · Tue Apr 23 2024 06:56:16 GMT+0800 (China Standard Time)

Thank you for the quick response, @jiayuasu !

I tested the following in client mode before creating the Sedona SparkSession/SparkContext (via a local Docker container):

conf = SparkConf()
conf.set("sedona.global.charset", "utf8")  # I have other conf settings not shown here
spark = SparkSession.builder.config(conf=conf).getOrCreate()
sedona = SedonaContext.create(spark)

And I tested both of the following in cluster mode after creating the Sedona SparkSession/SparkContext (via AWS Glue):

spark.conf.set("spark.executorEnv.sedona.global.charset","utf8")
sedona.conf.set("spark.executorEnv.sedona.global.charset","utf8")

Unfortunately I still see the same issue in both cases.
Are you able to replicate (or reject) the issue using the attached shapefile sample?

shapefile_sample.zip

Kristin Cowalcijk · Answer 3 · Fri Apr 26 2024 23:49:11 GMT+0800 (China Standard Time)

sedona.global.charset has to be set as a Java system property. You can try setting the following spark configurations:

spark.driver.extraJavaOptions  -Dsedona.global.charset=utf8
spark.executor.extraJavaOptions  -Dsedona.global.charset=utf8

The dataframe loaded from the sample shapefile:

+--------------------+--------------------+--------------------+--------------------+
|            geometry|                  ID|                Name|          Name_ASCII|
+--------------------+--------------------+--------------------+--------------------+
|MULTIPOLYGON (((-...|01015               |Ariñiz/Aríñez    ...|Ariniz/Arinez    ...|
+--------------------+--------------------+--------------------+--------------------+

Adam Thomas · Answer 4 · Wed May 01 2024 00:34:15 GMT+0800 (China Standard Time)

Thank you, @Kontinuation! 🎉

I can confirm that setting the following configuration parameter in PySpark worked for my local setup. And thanks @jiayuasu for updating the docs.

conf.set("spark.driver.extraJavaOptions", "-Dsedona.global.charset=utf8")

Running on AWS/Glue is still causing issues, but this seems specific to our setup.