[SUPPORT] URI too long error
michael1991 opened this issue · comments
Describe the problem you faced
I'm using Spark3.5 + Hudi0.15.0 for partitioned table, when I choose req_date
and req_hour
for partition column name, I will get this error, but task would be executed successfully finally;
when I choose date
and hour
for partition column name, error disappeared.
Expected behavior
We should get no errors when we just make partition column names a bit longer.
Environment Description
-
Hudi version : 0.15.0
-
Spark version : 3.5.0
-
Hive version : NA
-
Hadoop version : 3.3.6
-
Storage (HDFS/S3/GCS..) : GCS
-
Running on Docker? (yes/no) : no
Stacktrace
2024-06-13 13:21:13 ERROR PriorityBasedFileSystemView:129 - Got error running preferred function. Trying secondary
org.apache.hudi.exception.HoodieRemoteException: URI Too Long
at org.apache.hudi.common.table.view.RemoteHoodieTableFileSystemView.loadPartitions(RemoteHoodieTableFileSystemView.java:447) ~[hudi-spark3.5-bundle_2.12-0.15.0.jar:0.15.0]
at org.apache.hudi.common.table.view.RemoteHoodieTableFileSystemView.loadPartitions(RemoteHoodieTableFileSystemView.java:465) ~[hudi-spark3.5-bundle_2.12-0.15.0.jar:0.15.0]
at org.apache.hudi.common.table.view.PriorityBasedFileSystemView.lambda$loadPartitions$6e5c444d$1(PriorityBasedFileSystemView.java:187) ~[hudi-spark3.5-bundle_2.12-0.15.0.jar:0.15.0]
at org.apache.hudi.common.table.view.PriorityBasedFileSystemView.execute(PriorityBasedFileSystemView.java:69) ~[hudi-spark3.5-bundle_2.12-0.15.0.jar:0.15.0]
at org.apache.hudi.common.table.view.PriorityBasedFileSystemView.loadPartitions(PriorityBasedFileSystemView.java:185) ~[hudi-spark3.5-bundle_2.12-0.15.0.jar:0.15.0]
at org.apache.hudi.table.action.clean.CleanPlanActionExecutor.requestClean(CleanPlanActionExecutor.java:133) ~[hudi-spark3.5-bundle_2.12-0.15.0.jar:0.15.0]
at org.apache.hudi.table.action.clean.CleanPlanActionExecutor.requestClean(CleanPlanActionExecutor.java:174) ~[hudi-spark3.5-bundle_2.12-0.15.0.jar:0.15.0]
at org.apache.hudi.table.action.clean.CleanPlanActionExecutor.execute(CleanPlanActionExecutor.java:200) ~[hudi-spark3.5-bundle_2.12-0.15.0.jar:0.15.0]
at org.apache.hudi.table.HoodieSparkCopyOnWriteTable.scheduleCleaning(HoodieSparkCopyOnWriteTable.java:212) ~[hudi-spark3.5-bundle_2.12-0.15.0.jar:0.15.0]
at org.apache.hudi.client.BaseHoodieTableServiceClient.scheduleTableServiceInternal(BaseHoodieTableServiceClient.java:647) ~[hudi-spark3.5-bundle_2.12-0.15.0.jar:0.15.0]
at org.apache.hudi.client.BaseHoodieTableServiceClient.clean(BaseHoodieTableServiceClient.java:746) ~[hudi-spark3.5-bundle_2.12-0.15.0.jar:0.15.0]
at org.apache.hudi.client.BaseHoodieWriteClient.clean(BaseHoodieWriteClient.java:843) ~[hudi-spark3.5-bundle_2.12-0.15.0.jar:0.15.0]
at org.apache.hudi.client.BaseHoodieWriteClient.clean(BaseHoodieWriteClient.java:816) ~[hudi-spark3.5-bundle_2.12-0.15.0.jar:0.15.0]
at org.apache.hudi.client.BaseHoodieWriteClient.clean(BaseHoodieWriteClient.java:847) ~[hudi-spark3.5-bundle_2.12-0.15.0.jar:0.15.0]
at org.apache.hudi.client.BaseHoodieWriteClient.autoCleanOnCommit(BaseHoodieWriteClient.java:581) ~[hudi-spark3.5-bundle_2.12-0.15.0.jar:0.15.0]
at org.apache.hudi.client.BaseHoodieWriteClient.mayBeCleanAndArchive(BaseHoodieWriteClient.java:560) ~[hudi-spark3.5-bundle_2.12-0.15.0.jar:0.15.0]
at org.apache.hudi.client.BaseHoodieWriteClient.commitStats(BaseHoodieWriteClient.java:251) ~[hudi-spark3.5-bundle_2.12-0.15.0.jar:0.15.0]
at org.apache.hudi.client.SparkRDDWriteClient.commit(SparkRDDWriteClient.java:108) ~[hudi-spark3.5-bundle_2.12-0.15.0.jar:0.15.0]
at org.apache.hudi.HoodieSparkSqlWriterInternal.commitAndPerformPostOperations(HoodieSparkSqlWriter.scala:1082) ~[hudi-spark3.5-bundle_2.12-0.15.0.jar:0.15.0]
at org.apache.hudi.HoodieSparkSqlWriterInternal.writeInternal(HoodieSparkSqlWriter.scala:508) ~[hudi-spark3.5-bundle_2.12-0.15.0.jar:0.15.0]
at org.apache.hudi.HoodieSparkSqlWriterInternal.write(HoodieSparkSqlWriter.scala:187) ~[hudi-spark3.5-bundle_2.12-0.15.0.jar:0.15.0]
at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:125) ~[hudi-spark3.5-bundle_2.12-0.15.0.jar:0.15.0]
at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:168) ~[hudi-spark3.5-bundle_2.12-0.15.0.jar:0.15.0]
at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:48) ~[spark-sql_2.12-3.5.0.jar:0.15.0]
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75) ~[spark-sql_2.12-3.5.0.jar:3.5.0]
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73) ~[spark-sql_2.12-3.5.0.jar:3.5.0]
at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84) ~[spark-sql_2.12-3.5.0.jar:3.5.0]
at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:107) ~[spark-sql_2.12-3.5.0.jar:0.15.0]
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:125) ~[spark-sql_2.12-3.5.0.jar:0.15.0]
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:201) ~[spark-sql_2.12-3.5.0.jar:0.15.0]
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:108) ~[spark-sql_2.12-3.5.0.jar:0.15.0]
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:900) ~[spark-sql_2.12-3.5.0.jar:3.5.0]
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:66) ~[spark-sql_2.12-3.5.0.jar:0.15.0]
at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:107) ~[spark-sql_2.12-3.5.0.jar:0.15.0]
at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:98) ~[spark-sql_2.12-3.5.0.jar:0.15.0]
at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:473) ~[spark-catalyst_2.12-3.5.0.jar:3.5.0]
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(origin.scala:76) ~[spark-sql-api_2.12-3.5.0.jar:3.5.0]
at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:473) ~[spark-catalyst_2.12-3.5.0.jar:3.5.0]
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:32) ~[spark-catalyst_2.12-3.5.0.jar:3.5.0]
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267) ~[spark-catalyst_2.12-3.5.0.jar:3.5.0]
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263) ~[spark-catalyst_2.12-3.5.0.jar:3.5.0]
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:32) ~[spark-catalyst_2.12-3.5.0.jar:3.5.0]
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:32) ~[spark-catalyst_2.12-3.5.0.jar:3.5.0]
at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:449) ~[spark-catalyst_2.12-3.5.0.jar:3.5.0]
at org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:98) ~[spark-sql_2.12-3.5.0.jar:0.15.0]
at org.apache.spark.sql.execution.QueryExecution.commandExecuted$lzycompute(QueryExecution.scala:85) ~[spark-sql_2.12-3.5.0.jar:0.15.0]
at org.apache.spark.sql.execution.QueryExecution.commandExecuted(QueryExecution.scala:83) ~[spark-sql_2.12-3.5.0.jar:0.15.0]
at org.apache.spark.sql.execution.QueryExecution.assertCommandExecuted(QueryExecution.scala:142) ~[spark-sql_2.12-3.5.0.jar:0.15.0]
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:859) ~[spark-sql_2.12-3.5.0.jar:3.5.0]
at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:388) ~[spark-sql_2.12-3.5.0.jar:3.5.0]
at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:361) ~[spark-sql_2.12-3.5.0.jar:3.5.0]
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:240) ~[spark-sql_2.12-3.5.0.jar:3.5.0]
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) [scala-library-2.12.18.jar:?]
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) [scala-library-2.12.18.jar:?]
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) [scala-library-2.12.18.jar:?]
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:?]
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:?]
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:?]
at java.base/java.lang.reflect.Method.invoke(Method.java:566) ~[?:?]
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) [spark-core_2.12-3.5.0.jar:3.5.0]
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:1032) [spark-core_2.12-3.5.0.jar:3.5.0]
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:194) [spark-core_2.12-3.5.0.jar:3.5.0]
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:217) [spark-core_2.12-3.5.0.jar:3.5.0]
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:91) [spark-core_2.12-3.5.0.jar:3.5.0]
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1124) [spark-core_2.12-3.5.0.jar:3.5.0]
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1133) [spark-core_2.12-3.5.0.jar:3.5.0]
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) [spark-core_2.12-3.5.0.jar:3.5.0]
Caused by: org.apache.hudi.org.apache.http.client.HttpResponseException: URI Too Long
at org.apache.hudi.org.apache.http.impl.client.AbstractResponseHandler.handleResponse(AbstractResponseHandler.java:69) ~[hudi-spark3.5-bundle_2.12-0.15.0.jar:0.15.0]
at org.apache.hudi.org.apache.http.client.fluent.Response.handleResponse(Response.java:90) ~[hudi-spark3.5-bundle_2.12-0.15.0.jar:0.15.0]
at org.apache.hudi.org.apache.http.client.fluent.Response.returnContent(Response.java:97) ~[hudi-spark3.5-bundle_2.12-0.15.0.jar:0.15.0]
at org.apache.hudi.common.table.view.RemoteHoodieTableFileSystemView.executeRequest(RemoteHoodieTableFileSystemView.java:189) ~[hudi-spark3.5-bundle_2.12-0.15.0.jar:0.15.0]
at org.apache.hudi.common.table.view.RemoteHoodieTableFileSystemView.loadPartitions(RemoteHoodieTableFileSystemView.java:445) ~[hudi-spark3.5-bundle_2.12-0.15.0.jar:0.15.0]
... 71 more
@michael1991 Thanks for raising this. Can you help me to reproduce this issue. I tried below but it was working fine for me.
fake = Faker()
data = [{"ID": fake.uuid4(), "EventTime": "2023-03-04 14:44:42.046661",
"FullName": fake.name(), "Address": fake.address(),
"CompanyName": fake.company(), "JobTitle": fake.job(),
"EmailAddress": fake.email(), "PhoneNumber": fake.phone_number(),
"RandomText": fake.sentence(), "CityNameDummyBigFieldName": fake.city(), "ts":"1",
"StateNameDummyBigFieldName": fake.state(), "Country": fake.country()} for _ in range(1000)]
pandas_df = pd.DataFrame(data)
hoodie_properties = {
'hoodie.datasource.write.table.type': 'COPY_ON_WRITE',
'hoodie.datasource.write.operation': 'upsert',
'hoodie.datasource.write.hive_style_partitioning': 'true',
'hoodie.datasource.write.recordkey.field': 'ID',
'hoodie.datasource.write.partitionpath.field': 'StateNameDummyBigFieldName,CityNameDummyBigFieldName',
'hoodie.table.name' : 'test'
}
spark.sparkContext.setLogLevel("WARN")
df = spark.createDataFrame(pandas_df)
df.write.format("hudi").options(**hoodie_properties).mode("overwrite").save(PATH)
for i in range(1, 50):
df.write.format("hudi").options(**hoodie_properties).mode("append").save(PATH)
Hi @ad1happy2go , glad to hear you again ~
Can you try column name with underscore, i'm not sure if enable urlencode for partition and partition column name with underscore could make this happen.
@michael1991
How many number of partitions in the table? Is it possible to get the URI? I was not able to reproduce this though.
@ad1happy2go Partitions are hours, for example, gs://bucket/tables/hudi/r_date=2024-06-17/r_hour=00. But problem only occurs on two partitions and underscore, we are using one partition column like yyyyMMddHH and it's going on well. Not sure the exact cause.
Can you try reproducing this issue with the sample code. @michael1991 , That will help us to triage it better