[Bug]: 通过hdfswriter写入orc格式数据异常

Question

[Bug]: 通过hdfswriter写入orc格式数据异常

chen-rj88 opened this issue a month ago · comments

What happened?

通过hdfswriter写入orc格式可以写入，但是通过hive查询时会提示ORC split generation failed with exception: java.lang.ArrayIndexOutOfBoundsException，此功能datax中是正常的

Version

4.1.3 (Default)

OS Type

Linux (Default)

Java JDK Version

Oracle JDK 1.8.0

Relevant log output

No response

wgzhao · Answer 1 · Thu Jul 11 2024 14:52:47 GMT+0800 (China Standard Time)

感谢反馈，请给出完整的运行日志以便于复现并定位问题

chen - · Answer 2 · Thu Jul 11 2024 15:16:02 GMT+0800 (China Standard Time)

您好，详细执行日志请见附件，另外截图是执行成功后，通过sql进行查询的结果。期待您的回复~
addax_oracle-hive-orc_20240711_150451_186301.log

wgzhao · Answer 3 · Thu Jul 11 2024 18:43:25 GMT+0800 (China Standard Time)

请给出addax 执行的完整日志，尽可能不要截图

chen - · Answer 4 · Thu Jul 11 2024 19:00:35 GMT+0800 (China Standard Time)

以下为日志信息：

2024-07-11 15:04:51.673 [        main] INFO  VMInfo               - VMInfo# operatingSystem class => sun.management.OperatingSystemImpl
2024-07-11 15:04:51.689 [        main] INFO  Engine               - 
{
	"setting":{
		"speed":{
			"channel":1
		},
		"errorLimit":{
			"record":0,
			"percentage":0.02
		}
	},
	"content":{
		"reader":{
			"name":"sqlserverreader",
			"parameter":{
				"username":"YjsySjdj",
				"password":"*****",
				"column":[
					"[termcode]",
					"[termname]",
					"[year]",
					"[term]",
					"[weeks]",
					"[begindate]",
					"[enddate]",
					"[currentxq]"
				],
				"where":"termcode != '3'",
				"connection":[
					{
						"table":[
							"dbo.py_tabterm"
						],
						"jdbcUrl":[
							"jdbc:sqlserver://10.100.98.16:1433;DatabaseName=gmis5;CurrentSchema=dbo;encrypt=false"
						]
					}
				]
			}
		},
		"writer":{
			"name":"hdfswriter",
			"parameter":{
				"defaultFS":"hdfs://10.100.98.94:8020",
				"fileType":"orc",
				"fieldDelimiter":"\u0001",
				"path":"/data/hive/warehouse/gmc_ods.db/ods_py_tabterm2/ds=20240709",
				"fileName":"ods_py_tabterm2",
				"writeMode":"truncate",
				"column":[
					{
						"name":"termcode",
						"type":"string"
					},
					{
						"name":"termname",
						"type":"string"
					},
					{
						"name":"year",
						"type":"int"
					},
					{
						"name":"term",
						"type":"string"
					},
					{
						"name":"weeks",
						"type":"int"
					},
					{
						"name":"begindate",
						"type":"string"
					},
					{
						"name":"enddate",
						"type":"string"
					},
					{
						"name":"currentxq",
						"type":"string"
					}
				]
			}
		}
	}
}

2024-07-11 15:04:51.717 [        main] INFO  JobContainer         - The jobContainer begins to process the job.
2024-07-11 15:04:52.148 [       job-0] INFO  OriginalConfPretreatmentUtil - Available jdbcUrl [jdbc:sqlserver://10.100.98.16:1433;DatabaseName=gmis5;CurrentSchema=dbo;encrypt=false].
2024-07-11 15:04:52.188 [       job-0] INFO  OriginalConfPretreatmentUtil - The table [dbo.py_tabterm] has columns [termcode,termname,year,term,weeks,begindate,enddate,currentxq].
2024-07-11 15:04:52.427 [       job-0] WARN  NativeCodeLoader     - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2024-07-11 15:04:52.905 [       job-0] INFO  JobContainer         - The Reader.Job [sqlserverreader] perform prepare work .
2024-07-11 15:04:52.905 [       job-0] INFO  JobContainer         - The Writer.Job [hdfswriter] perform prepare work .
2024-07-11 15:04:53.053 [       job-0] INFO  JobContainer         - Job set Channel-Number to 1 channel(s).
2024-07-11 15:04:53.055 [       job-0] INFO  JobContainer         - The Reader.Job [sqlserverreader] is divided into [1] task(s).
2024-07-11 15:04:53.056 [       job-0] INFO  HdfsWriter$Job       - Begin splitting ...
2024-07-11 15:04:53.066 [       job-0] INFO  HdfsWriter$Job       - The split wrote files :[/data/hive/warehouse/gmc_ods.db/ods_py_tabterm2/ds=20240709/.94a5ee51_7494_4de4_9093_5ceca6589faa/ods_py_tabterm2_20240711_150453_065_8tw5v1gu.orc]
2024-07-11 15:04:53.067 [       job-0] INFO  HdfsWriter$Job       - Finish splitting.
2024-07-11 15:04:53.067 [       job-0] INFO  JobContainer         - The Writer.Job [hdfswriter] is divided into [1] task(s).
2024-07-11 15:04:53.088 [       job-0] INFO  JobContainer         - The Scheduler launches [1] taskGroup(s).
2024-07-11 15:04:53.095 [ taskGroup-0] INFO  TaskGroupContainer   - The taskGroupId=[0] started [1] channels for [1] tasks.
2024-07-11 15:04:53.097 [ taskGroup-0] INFO  Channel              - The Channel set byte_speed_limit to -1, No bps activated.
2024-07-11 15:04:53.097 [ taskGroup-0] INFO  Channel              - The Channel set record_speed_limit to -1, No tps activated.
2024-07-11 15:04:53.106 [  reader-0-0] INFO  CommonRdbmsReader$Task - Begin reading records by executing SQL query: [select [termcode],[termname],[year],[term],[weeks],[begindate],[enddate],[currentxq] from dbo.py_tabterm where (termcode != '3')].
2024-07-11 15:04:53.112 [  writer-0-0] INFO  HdfsWriter$Task      - Begin to write file : [/data/hive/warehouse/gmc_ods.db/ods_py_tabterm2/ds=20240709/.94a5ee51_7494_4de4_9093_5ceca6589faa/ods_py_tabterm2_20240711_150453_065_8tw5v1gu.orc]
2024-07-11 15:04:53.141 [  writer-0-0] INFO  HadoopShimsCurrent   - Can't get KeyProvider for ORC encryption from hadoop.security.key.provider.path.
2024-07-11 15:04:53.151 [  reader-0-0] INFO  CommonRdbmsReader$Task - Finished reading records by executing SQL query: [select [termcode],[termname],[year],[term],[weeks],[begindate],[enddate],[currentxq] from dbo.py_tabterm where (termcode != '3')].
2024-07-11 15:04:53.189 [  writer-0-0] INFO  PhysicalFsWriter     - ORC writer created for path: /data/hive/warehouse/gmc_ods.db/ods_py_tabterm2/ds=20240709/.94a5ee51_7494_4de4_9093_5ceca6589faa/ods_py_tabterm2_20240711_150453_065_8tw5v1gu.orc with stripeSize: 67108864 blockSize: 268435456 compression: Compress: none buffer: 262144
2024-07-11 15:04:53.218 [  writer-0-0] INFO  WriterImpl           - ORC writer created for path: /data/hive/warehouse/gmc_ods.db/ods_py_tabterm2/ds=20240709/.94a5ee51_7494_4de4_9093_5ceca6589faa/ods_py_tabterm2_20240711_150453_065_8tw5v1gu.orc with stripeSize: 67108864 options: Compress: none buffer: 262144
2024-07-11 15:04:53.405 [  writer-0-0] INFO  HdfsWriter$Task      - Finish write
2024-07-11 15:04:56.104 [       job-0] INFO  AbstractScheduler    - The scheduler has completed all tasks.
2024-07-11 15:04:56.104 [       job-0] INFO  JobContainer         - The Writer.Job [hdfswriter] perform post work.
2024-07-11 15:04:56.106 [       job-0] INFO  HdfsHelper           - Begin to move file from [hdfs://10.100.98.94:8020/data/hive/warehouse/gmc_ods.db/ods_py_tabterm2/ds=20240709/.94a5ee51_7494_4de4_9093_5ceca6589faa/ods_py_tabterm2_20240711_150453_065_8tw5v1gu.orc] to [ds=20240709].
2024-07-11 15:04:56.114 [       job-0] INFO  HdfsHelper           - Finish move file(s).
2024-07-11 15:04:56.114 [       job-0] INFO  HdfsHelper           - Begin to delete temporary dir [/data/hive/warehouse/gmc_ods.db/ods_py_tabterm2/ds=20240709/.94a5ee51_7494_4de4_9093_5ceca6589faa] .
2024-07-11 15:04:56.119 [       job-0] INFO  HdfsHelper           - Finish deleting temporary dir [/data/hive/warehouse/gmc_ods.db/ods_py_tabterm2/ds=20240709/.94a5ee51_7494_4de4_9093_5ceca6589faa] .
2024-07-11 15:04:56.119 [       job-0] INFO  JobContainer         - The Reader.Job [sqlserverreader] perform post work.
2024-07-11 15:04:56.121 [       job-0] INFO  StandAloneJobContainerCommunicator - Total 43 records, 1979 bytes | Speed 659B/s, 14 records/s | Error 0 records, 0 bytes |  All Task WaitWriterTime 0.000s |  All Task WaitReaderTime 0.000s | Percentage 100.00%
2024-07-11 15:04:56.122 [       job-0] INFO  JobContainer         - 
Job start  at             : 2024-07-11 15:04:51
Job end    at             : 2024-07-11 15:04:56
Job took secs             :                  4s
Average   bps             :              659B/s
Average   rps             :             14rec/s
Number of rec             :                  43
Failed record             :                   0

wgzhao · Answer 5 · Thu Jul 11 2024 19:39:08 GMT+0800 (China Standard Time)

从日志来看，似乎没有问题
麻烦再提供以下信息：

hive 表建表语句
hive 版本
如果不涉及数据泄漏，提供查询异常的 orc 文件

chen - · Answer 6 · Thu Jul 11 2024 20:09:51 GMT+0800 (China Standard Time)

收到，具体信息如下：

hive版本：2.1.1
hive建表语句：
···
CREATE TABLE ods_py_tabterm2 (
termcode STRING,
termname STRING,
year INT,
term STRING,
weeks INT,
begindate STRING,
enddate STRING,
currentxq STRING
) PARTITIONED BY (ds STRING) STORED AS ORC;
···
orc数据文件请见附件：
ods_py_tabterm2_20240711_154057_418_n2vwuwwb.orc.zip

wgzhao · Answer 7 · Thu Jul 11 2024 20:37:32 GMT+0800 (China Standard Time)

使用了你提供的数据，在我的 Hive 环境下读取没有问题，过程如下：

创建外部表，增加分区

CREATE  external table ods_py_tabterm2 (
termcode STRING,
termname STRING,
year INT,
term STRING,
weeks INT,
begindate STRING,
enddate STRING,
currentxq STRING
) PARTITIONED BY (ds STRING) STORED AS ORC
location "/tmp/ods_py_tabterm2";

alter table ods_py_tabterm2 add if not exists partition (ds='20240711');

将orc文件上传到对应分区目录

进行查询

select * from ods_py_tabterm2 where ds = '20240711' limit 10;
INFO  : Compiling command(queryId=hive_20240711203322_704e35c0-ab5c-440b-b18d-8946b167ba52): select * from ods_py_tabterm2 where ds = '20240711' limit 10
INFO  : Semantic Analysis Completed (retrial = false)
INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:ods_py_tabterm2.termcode, type:string, comment:null), FieldSchema(name:ods_py_tabterm2.termname, type:string, comment:null), FieldSchema(name:ods_py_tabterm2.year, type:int, comment:null), FieldSchema(name:ods_py_tabterm2.term, type:string, comment:null), FieldSchema(name:ods_py_tabterm2.weeks, type:int, comment:null), FieldSchema(name:ods_py_tabterm2.begindate, type:string, comment:null), FieldSchema(name:ods_py_tabterm2.enddate, type:string, comment:null), FieldSchema(name:ods_py_tabterm2.currentxq, type:string, comment:null), FieldSchema(name:ods_py_tabterm2.ds, type:string, comment:null)], properties:null)
INFO  : Completed compiling command(queryId=hive_20240711203322_704e35c0-ab5c-440b-b18d-8946b167ba52); Time taken: 0.285 seconds
INFO  : Executing command(queryId=hive_20240711203322_704e35c0-ab5c-440b-b18d-8946b167ba52): select * from ods_py_tabterm2 where ds = '20240711' limit 10
INFO  : Completed executing command(queryId=hive_20240711203322_704e35c0-ab5c-440b-b18d-8946b167ba52); Time taken: 0.015 seconds
INFO  : OK
+---------------------------+---------------------------+-----------------------+-----------------------+------------------------+----------------------------+--------------------------+----------------------------+---------------------+
| ods_py_tabterm2.termcode  | ods_py_tabterm2.termname  | ods_py_tabterm2.year  | ods_py_tabterm2.term  | ods_py_tabterm2.weeks  | ods_py_tabterm2.begindate  | ods_py_tabterm2.enddate  | ods_py_tabterm2.currentxq  | ods_py_tabterm2.ds  |
+---------------------------+---------------------------+-----------------------+-----------------------+------------------------+----------------------------+--------------------------+----------------------------+---------------------+
| 10                        | 2006-2007春学期              | 2007                  | 2                     | 20                     | 2007-03-05 00:00:00.0      | 2007-07-14 00:00:00.0    | 0                          | 20240711            |
| 11                        | 2007-2008秋学期              | 2007                  | 1                     | 20                     | 2007-08-26 00:00:00.0      | 2008-01-20 00:00:00.0    | 0                          | 20240711            |
| 12                        | 2007-2008春学期              | 2008                  | 2                     | 20                     | 2008-02-25 00:00:00.0      | 2008-07-12 00:00:00.0    | 0                          | 20240711            |
| 13                        | 2008-2009秋学期              | 2008                  | 1                     | 20                     | 2008-08-25 00:00:00.0      | 2009-01-09 00:00:00.0    | 0                          | 20240711            |
| 14                        | 2008-2009春学期              | 2009                  | 2                     | 20                     | 2009-02-16 00:00:00.0      | 2009-07-03 00:00:00.0    | 0                          | 20240711            |
| 15                        | 2009-2010秋学期              | 2009                  | 1                     | 21                     | 2009-08-23 00:00:00.0      | 2010-01-16 00:00:00.0    | 0                          | 20240711            |
| 16                        | 2009-2010春学期              | 2010                  | 2                     | 20                     | 2010-02-27 00:00:00.0      | 2010-07-11 00:00:00.0    | 0                          | 20240711            |
| 17                        | 2010-2011秋学期              | 2010                  | 1                     | 20                     | 2010-08-28 00:00:00.0      | 2011-01-16 00:00:00.0    | 0                          | 20240711            |
| 18                        | 2010-2011春学期              | 2011                  | 2                     | 20                     | 2011-02-19 00:00:00.0      | 2011-07-08 00:00:00.0    | 0                          | 20240711            |
| 19                        | 2011-2012秋学期              | 2011                  | 1                     | 20                     | 2011-08-29 00:00:00.0      | 2012-01-15 00:00:00.0    | 0                          | 20240711            |
+---------------------------+---------------------------+-----------------------+-----------------------+------------------------+----------------------------+--------------------------+----------------------------+---------------------+
10 rows selected (0.357 seconds)

我的 Hive 版本是 3.1.0 ，目前 hdfswriter 依赖的是 hadoop 3.0 版本，可能存在版本差异问题。

我看能不能找一个 2.1.1 环境下，然后进行测试

chen - · Answer 8 · Thu Jul 11 2024 23:13:24 GMT+0800 (China Standard Time)

您好，我刚才把hdfswriter依赖的hadoop版本改为2.7.1，试了一下还是无法正常读取数据。
另外，我想问一下咱们这个平台有没有交流沟通群，有的话我希望能加入一下，谢谢~