apache / hudi

Upserts, Deletes And Incremental Processing on Big Data.

Home Page:https://hudi.apache.org/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[SUPPORT] URI too long error

michael1991 opened this issue · comments

Describe the problem you faced

I'm using Spark3.5 + Hudi0.15.0 for partitioned table, when I choose req_date and req_hour for partition column name, I will get this error, but task would be executed successfully finally;
when I choose date and hour for partition column name, error disappeared.

Expected behavior

We should get no errors when we just make partition column names a bit longer.

Environment Description

  • Hudi version : 0.15.0

  • Spark version : 3.5.0

  • Hive version : NA

  • Hadoop version : 3.3.6

  • Storage (HDFS/S3/GCS..) : GCS

  • Running on Docker? (yes/no) : no

Stacktrace

2024-06-13 13:21:13 ERROR PriorityBasedFileSystemView:129 - Got error running preferred function. Trying secondary
org.apache.hudi.exception.HoodieRemoteException: URI Too Long
	at org.apache.hudi.common.table.view.RemoteHoodieTableFileSystemView.loadPartitions(RemoteHoodieTableFileSystemView.java:447) ~[hudi-spark3.5-bundle_2.12-0.15.0.jar:0.15.0]
	at org.apache.hudi.common.table.view.RemoteHoodieTableFileSystemView.loadPartitions(RemoteHoodieTableFileSystemView.java:465) ~[hudi-spark3.5-bundle_2.12-0.15.0.jar:0.15.0]
	at org.apache.hudi.common.table.view.PriorityBasedFileSystemView.lambda$loadPartitions$6e5c444d$1(PriorityBasedFileSystemView.java:187) ~[hudi-spark3.5-bundle_2.12-0.15.0.jar:0.15.0]
	at org.apache.hudi.common.table.view.PriorityBasedFileSystemView.execute(PriorityBasedFileSystemView.java:69) ~[hudi-spark3.5-bundle_2.12-0.15.0.jar:0.15.0]
	at org.apache.hudi.common.table.view.PriorityBasedFileSystemView.loadPartitions(PriorityBasedFileSystemView.java:185) ~[hudi-spark3.5-bundle_2.12-0.15.0.jar:0.15.0]
	at org.apache.hudi.table.action.clean.CleanPlanActionExecutor.requestClean(CleanPlanActionExecutor.java:133) ~[hudi-spark3.5-bundle_2.12-0.15.0.jar:0.15.0]
	at org.apache.hudi.table.action.clean.CleanPlanActionExecutor.requestClean(CleanPlanActionExecutor.java:174) ~[hudi-spark3.5-bundle_2.12-0.15.0.jar:0.15.0]
	at org.apache.hudi.table.action.clean.CleanPlanActionExecutor.execute(CleanPlanActionExecutor.java:200) ~[hudi-spark3.5-bundle_2.12-0.15.0.jar:0.15.0]
	at org.apache.hudi.table.HoodieSparkCopyOnWriteTable.scheduleCleaning(HoodieSparkCopyOnWriteTable.java:212) ~[hudi-spark3.5-bundle_2.12-0.15.0.jar:0.15.0]
	at org.apache.hudi.client.BaseHoodieTableServiceClient.scheduleTableServiceInternal(BaseHoodieTableServiceClient.java:647) ~[hudi-spark3.5-bundle_2.12-0.15.0.jar:0.15.0]
	at org.apache.hudi.client.BaseHoodieTableServiceClient.clean(BaseHoodieTableServiceClient.java:746) ~[hudi-spark3.5-bundle_2.12-0.15.0.jar:0.15.0]
	at org.apache.hudi.client.BaseHoodieWriteClient.clean(BaseHoodieWriteClient.java:843) ~[hudi-spark3.5-bundle_2.12-0.15.0.jar:0.15.0]
	at org.apache.hudi.client.BaseHoodieWriteClient.clean(BaseHoodieWriteClient.java:816) ~[hudi-spark3.5-bundle_2.12-0.15.0.jar:0.15.0]
	at org.apache.hudi.client.BaseHoodieWriteClient.clean(BaseHoodieWriteClient.java:847) ~[hudi-spark3.5-bundle_2.12-0.15.0.jar:0.15.0]
	at org.apache.hudi.client.BaseHoodieWriteClient.autoCleanOnCommit(BaseHoodieWriteClient.java:581) ~[hudi-spark3.5-bundle_2.12-0.15.0.jar:0.15.0]
	at org.apache.hudi.client.BaseHoodieWriteClient.mayBeCleanAndArchive(BaseHoodieWriteClient.java:560) ~[hudi-spark3.5-bundle_2.12-0.15.0.jar:0.15.0]
	at org.apache.hudi.client.BaseHoodieWriteClient.commitStats(BaseHoodieWriteClient.java:251) ~[hudi-spark3.5-bundle_2.12-0.15.0.jar:0.15.0]
	at org.apache.hudi.client.SparkRDDWriteClient.commit(SparkRDDWriteClient.java:108) ~[hudi-spark3.5-bundle_2.12-0.15.0.jar:0.15.0]
	at org.apache.hudi.HoodieSparkSqlWriterInternal.commitAndPerformPostOperations(HoodieSparkSqlWriter.scala:1082) ~[hudi-spark3.5-bundle_2.12-0.15.0.jar:0.15.0]
	at org.apache.hudi.HoodieSparkSqlWriterInternal.writeInternal(HoodieSparkSqlWriter.scala:508) ~[hudi-spark3.5-bundle_2.12-0.15.0.jar:0.15.0]
	at org.apache.hudi.HoodieSparkSqlWriterInternal.write(HoodieSparkSqlWriter.scala:187) ~[hudi-spark3.5-bundle_2.12-0.15.0.jar:0.15.0]
	at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:125) ~[hudi-spark3.5-bundle_2.12-0.15.0.jar:0.15.0]
	at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:168) ~[hudi-spark3.5-bundle_2.12-0.15.0.jar:0.15.0]
	at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:48) ~[spark-sql_2.12-3.5.0.jar:0.15.0]
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75) ~[spark-sql_2.12-3.5.0.jar:3.5.0]
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73) ~[spark-sql_2.12-3.5.0.jar:3.5.0]
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84) ~[spark-sql_2.12-3.5.0.jar:3.5.0]
	at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:107) ~[spark-sql_2.12-3.5.0.jar:0.15.0]
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:125) ~[spark-sql_2.12-3.5.0.jar:0.15.0]
	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:201) ~[spark-sql_2.12-3.5.0.jar:0.15.0]
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:108) ~[spark-sql_2.12-3.5.0.jar:0.15.0]
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:900) ~[spark-sql_2.12-3.5.0.jar:3.5.0]
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:66) ~[spark-sql_2.12-3.5.0.jar:0.15.0]
	at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:107) ~[spark-sql_2.12-3.5.0.jar:0.15.0]
	at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:98) ~[spark-sql_2.12-3.5.0.jar:0.15.0]
	at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:473) ~[spark-catalyst_2.12-3.5.0.jar:3.5.0]
	at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(origin.scala:76) ~[spark-sql-api_2.12-3.5.0.jar:3.5.0]
	at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:473) ~[spark-catalyst_2.12-3.5.0.jar:3.5.0]
	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:32) ~[spark-catalyst_2.12-3.5.0.jar:3.5.0]
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267) ~[spark-catalyst_2.12-3.5.0.jar:3.5.0]
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263) ~[spark-catalyst_2.12-3.5.0.jar:3.5.0]
	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:32) ~[spark-catalyst_2.12-3.5.0.jar:3.5.0]
	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:32) ~[spark-catalyst_2.12-3.5.0.jar:3.5.0]
	at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:449) ~[spark-catalyst_2.12-3.5.0.jar:3.5.0]
	at org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:98) ~[spark-sql_2.12-3.5.0.jar:0.15.0]
	at org.apache.spark.sql.execution.QueryExecution.commandExecuted$lzycompute(QueryExecution.scala:85) ~[spark-sql_2.12-3.5.0.jar:0.15.0]
	at org.apache.spark.sql.execution.QueryExecution.commandExecuted(QueryExecution.scala:83) ~[spark-sql_2.12-3.5.0.jar:0.15.0]
	at org.apache.spark.sql.execution.QueryExecution.assertCommandExecuted(QueryExecution.scala:142) ~[spark-sql_2.12-3.5.0.jar:0.15.0]
	at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:859) ~[spark-sql_2.12-3.5.0.jar:3.5.0]
	at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:388) ~[spark-sql_2.12-3.5.0.jar:3.5.0]
	at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:361) ~[spark-sql_2.12-3.5.0.jar:3.5.0]
	at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:240) ~[spark-sql_2.12-3.5.0.jar:3.5.0]
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) [scala-library-2.12.18.jar:?]
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) [scala-library-2.12.18.jar:?]
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) [scala-library-2.12.18.jar:?]
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:?]
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:?]
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:?]
	at java.base/java.lang.reflect.Method.invoke(Method.java:566) ~[?:?]
	at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) [spark-core_2.12-3.5.0.jar:3.5.0]
	at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:1032) [spark-core_2.12-3.5.0.jar:3.5.0]
	at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:194) [spark-core_2.12-3.5.0.jar:3.5.0]
	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:217) [spark-core_2.12-3.5.0.jar:3.5.0]
	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:91) [spark-core_2.12-3.5.0.jar:3.5.0]
	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1124) [spark-core_2.12-3.5.0.jar:3.5.0]
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1133) [spark-core_2.12-3.5.0.jar:3.5.0]
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) [spark-core_2.12-3.5.0.jar:3.5.0]
Caused by: org.apache.hudi.org.apache.http.client.HttpResponseException: URI Too Long
	at org.apache.hudi.org.apache.http.impl.client.AbstractResponseHandler.handleResponse(AbstractResponseHandler.java:69) ~[hudi-spark3.5-bundle_2.12-0.15.0.jar:0.15.0]
	at org.apache.hudi.org.apache.http.client.fluent.Response.handleResponse(Response.java:90) ~[hudi-spark3.5-bundle_2.12-0.15.0.jar:0.15.0]
	at org.apache.hudi.org.apache.http.client.fluent.Response.returnContent(Response.java:97) ~[hudi-spark3.5-bundle_2.12-0.15.0.jar:0.15.0]
	at org.apache.hudi.common.table.view.RemoteHoodieTableFileSystemView.executeRequest(RemoteHoodieTableFileSystemView.java:189) ~[hudi-spark3.5-bundle_2.12-0.15.0.jar:0.15.0]
	at org.apache.hudi.common.table.view.RemoteHoodieTableFileSystemView.loadPartitions(RemoteHoodieTableFileSystemView.java:445) ~[hudi-spark3.5-bundle_2.12-0.15.0.jar:0.15.0]
	... 71 more

@michael1991 Thanks for raising this. Can you help me to reproduce this issue. I tried below but it was working fine for me.

fake = Faker()
data = [{"ID": fake.uuid4(), "EventTime": "2023-03-04 14:44:42.046661",
         "FullName": fake.name(), "Address": fake.address(),
         "CompanyName": fake.company(), "JobTitle": fake.job(),
         "EmailAddress": fake.email(), "PhoneNumber": fake.phone_number(),
         "RandomText": fake.sentence(), "CityNameDummyBigFieldName": fake.city(),  "ts":"1",
         "StateNameDummyBigFieldName": fake.state(), "Country": fake.country()} for _ in range(1000)]
pandas_df = pd.DataFrame(data)

hoodie_properties = {
    'hoodie.datasource.write.table.type': 'COPY_ON_WRITE',
    'hoodie.datasource.write.operation': 'upsert',
    'hoodie.datasource.write.hive_style_partitioning': 'true',
    'hoodie.datasource.write.recordkey.field': 'ID',
    'hoodie.datasource.write.partitionpath.field': 'StateNameDummyBigFieldName,CityNameDummyBigFieldName',
    'hoodie.table.name' : 'test'

}
spark.sparkContext.setLogLevel("WARN")
df = spark.createDataFrame(pandas_df)
df.write.format("hudi").options(**hoodie_properties).mode("overwrite").save(PATH)

for i in range(1, 50):
    df.write.format("hudi").options(**hoodie_properties).mode("append").save(PATH)

Hi @ad1happy2go , glad to hear you again ~
Can you try column name with underscore, i'm not sure if enable urlencode for partition and partition column name with underscore could make this happen.

@michael1991
How many number of partitions in the table? Is it possible to get the URI? I was not able to reproduce this though.

@ad1happy2go Partitions are hours, for example, gs://bucket/tables/hudi/r_date=2024-06-17/r_hour=00. But problem only occurs on two partitions and underscore, we are using one partition column like yyyyMMddHH and it's going on well. Not sure the exact cause.

Can you try reproducing this issue with the sample code. @michael1991 , That will help us to triage it better