wgzhao / Addax

Addax is a versatile open-source ETL tool that can seamlessly transfer data between various RDBMS and NoSQL databases, making it an ideal solution for data migration.

Home Page:https://wgzhao.github.io/Addax/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[Bug]: Doris 查询 HDFS 上Decimal 类型的数据异常

wgzhao opened this issue · comments

What happened?

  1. 安装最新的 Doris,然后创建连接 hive 的 catalog。
  2. 通过 Addax 最新版本,往 HDFS 上写入包含 Decimal 类型的 ORC 文件
  3. 通过 Doris 去查询该表,Decimal 类型显示异常如下:
mysql> switch hive;
Query OK, 0 rows affected (0.01 sec)

mysql> select * from `default`.addax_test ;
+------+---------------+
| id   | fee           |
+------+---------------+
|   10 | -80444563.314 |
|   10 | -80444563.314 |
|   10 | -80444563.314 |
|   10 | -80444563.314 |
|   10 | -80444563.314 |
|   10 | -80444563.314 |
|   10 | -80444563.314 |
|   10 | -80444563.314 |
|   10 | -80444563.314 |
|   10 | -80444563.314 |
|   10 |       123.120 |
|   10 |       123.120 |
|   10 |       123.120 |
|   10 |       123.120 |
|   10 |       123.120 |
|   10 |       123.120 |
|   10 |       123.120 |
|   10 |       123.120 |
|   10 |       123.120 |
|   10 |       123.120 |
+------+---------------+
20 rows in set (0.12 sec)

表中前 10 条记录是通过 Addax 写入的数据,在 Hive 命令行以及 Trino 查询都是正常的,但在 Doris 里查询异常。
后 10 条记录是在 hive 命令行通过 insert into addax_test select * from addax_test 写入,这 10 条记录查询是正常的。

Version

4.1.3 (Default)

OS Type

Linux (Default)

Java JDK Version

Oracle JDK 1.8.0

Relevant log output

No response

正常 ORC 文件的元数据信息如下:

File Version: 0.12 with ORC_135
Rows: 10
Compression: ZLIB
Compression size: 262144
Type: struct<id:int,fee:decimal(20,3)>

Stripe Statistics:
  Stripe 1:
    Column 0: count: 10 hasNull: false
    Column 1: count: 10 hasNull: false bytesOnDisk: 5 min: 10 max: 10 sum: 100
    Column 2: count: 10 hasNull: false bytesOnDisk: 16 min: 123.12 max: 123.12 sum: 1231.2

File Statistics:
  Column 0: count: 10 hasNull: false
  Column 1: count: 10 hasNull: false bytesOnDisk: 5 min: 10 max: 10 sum: 100
  Column 2: count: 10 hasNull: false bytesOnDisk: 16 min: 123.12 max: 123.12 sum: 1231.2

Stripes:
  Stripe: offset: 3 data: 21 rows: 10 tail: 44 index: 71
    Stream: column 0 section ROW_INDEX start: 3 length 11
    Stream: column 1 section ROW_INDEX start: 14 length 25
    Stream: column 2 section ROW_INDEX start: 39 length 35
    Stream: column 1 section DATA start: 74 length 5
    Stream: column 2 section DATA start: 79 length 11
    Stream: column 2 section SECONDARY start: 90 length 5
    Encoding column 0: DIRECT
    Encoding column 1: DIRECT_V2
    Encoding column 2: DIRECT_V2

File length: 324 bytes
Padding length: 0 bytes
Padding ratio: 0%

异常 ORC 文件的元数据信息如下:

File Version: 0.12 with FUTURE
Rows: 10
Compression: LZ4
Compression size: 262144
Type: struct<id:int,fee:decimal(38,18)>

Stripe Statistics:
  Stripe 1:
    Column 0: count: 10 hasNull: false
    Column 1: count: 10 hasNull: false bytesOnDisk: 5 min: 10 max: 10 sum: 100
    Column 2: count: 10 hasNull: false bytesOnDisk: 21 min: 123.12 max: 123.12 sum: 1231.2

File Statistics:
  Column 0: count: 10 hasNull: false
  Column 1: count: 10 hasNull: false bytesOnDisk: 5 min: 10 max: 10 sum: 100
  Column 2: count: 10 hasNull: false bytesOnDisk: 21 min: 123.12 max: 123.12 sum: 1231.2

Stripes:
  Stripe: offset: 3 data: 26 rows: 10 tail: 59 index: 78
    Stream: column 0 section ROW_INDEX start: 3 length 11
    Stream: column 1 section ROW_INDEX start: 14 length 25
    Stream: column 2 section ROW_INDEX start: 39 length 42
    Stream: column 1 section DATA start: 81 length 5
    Stream: column 2 section DATA start: 86 length 16
    Stream: column 2 section SECONDARY start: 102 length 5
    Encoding column 0: DIRECT
    Encoding column 1: DIRECT_V2
    Encoding column 2: DIRECT_V2

File length: 371 bytes
Padding length: 0 bytes
Padding ratio: 0%

进一步进行测试,可能是因为精度不一致导致的。Doris 要求字段定义的精度和 ORC 文件中字段定义的精度保持一致才能正确读取该字段,否则异常。