[Feature]: Add table summary metrics

Question

[Feature]: Add table summary metrics

zhoujinsong opened this issue 2 months ago · comments

ZhouJinsong commented 2 months ago

Description

Add table summary metrics for each table to help users better understand the detail of tables.

Use case/motivation

Amoro has currently implemented the Metric System to provide various metric information externally.

However, it is also crucial to have metrics for important aspects such as data size, file count, and record numbers related to tables. By adding metrics for these aspects, we aim to assist users in gaining a better understanding of the various situations on the tables.

Describe the solution

We should define proper table summary metrics first.
Then AMS will refresh the table runtime every 3 minutes(by default), it can update the table summary metrics in this stage.

Subtasks

No response

Related issues

No response

Are you willing to submit a PR?

Yes I am willing to submit a PR!

Code of Conduct

I agree to follow this project's Code of Conduct

ZhouJinsong · Answer 1 · Thu Aug 15 2024 17:25:41 GMT+0800 (China Standard Time)

Here are some table summary metrics for Iceberg tables:

Metric Name	Type	Tags	Description
table_summary_total_files	Gauge	catalog, database, table	Total number of files in the table
table_summary_data_files	Gauge	catalog, database, table	Number of data files in the table
table_summary_equality_delete_files	Gauge	catalog, database, table	Number of equality delete files in the table
table_summary_position_delete_files	Gauge	catalog, database, table	Number of position delete files in the table
table_summary_total_files_size	Gauge	catalog, database, table	Total size of files in the table
table_summary_data_files_size	Gauge	catalog, database, table	Size of data files in the table
table_summary_equality_delete_files_size	Gauge	catalog, database, table	Size of equality delete files in the table
table_summary_position_delete_files_size	Gauge	catalog, database, table	Size of position delete files in the table
table_summary_total_records	Gauge	catalog, database, table	Total records in the table
table_summary_data_files_records	Gauge	catalog, database, table	Records of data files in the tablep
table_summary_equality_delete_files_records	Gauge	catalog, database, table	Records of equality delete files in the table
table_summary_position_delete_files_records	Gauge	catalog, database, table	Records of position delete files in the table
table_summary_snapshots	Gauge	catalog, database, table	Number of snapshots in the table

Please feel free to share your idea about table summary metrics.

baiyangtx · Answer 2 · Thu Aug 15 2024 17:57:34 GMT+0800 (China Standard Time)

The meanings of "table_summary_total_files" and "table_summary_data_files" are whether the quantities are in the latest Snapshot or the quantities of all the files existing on HDFS.

baiyangtx · Answer 3 · Thu Aug 15 2024 17:58:19 GMT+0800 (China Standard Time)

Whether to consider making the metrics more generalized to support different table formats?

Congxian Qiu · Answer 4 · Thu Aug 15 2024 18:31:58 GMT+0800 (China Standard Time)

Thanks for driving this, the summary is very helpful for users to understand the underlying things, apart from the table-level metric, 1) Could we add the metric for the tasks -- we can give more information from the task-level metrics and optimize the planning process; 2) could we add a metric for the committing process

We've added some metrics above about this in our inner version and can contribute them if needed.

ZhouJinsong · Answer 5 · Thu Aug 15 2024 19:37:58 GMT+0800 (China Standard Time)

Whether to consider making the metrics more generalized to support different table formats?

I am glad to make the metric name more general if it could be used for multiple formats, like table_summary_total_files or table_summary_data_files.

However, I also agree with adding specific metrics for different table formats, such as table_summary_equality_delete_files and table_summary_position_delete_files. We should allow different table formats to have different metric items.

ZhouJinsong · Answer 6 · Thu Aug 15 2024 19:40:22 GMT+0800 (China Standard Time)

The meanings of "table_summary_total_files" and "table_summary_data_files" are whether the quantities are in the latest Snapshot or the quantities of all the files existing on HDFS.

IMO, they refer to the summary of the current snapshot.

ZhouJinsong · Answer 7 · Thu Aug 15 2024 19:55:55 GMT+0800 (China Standard Time)

Could we add the metric for the tasks -- we can give more information from the task-level metrics and optimize the planning process; 2) could we add a metric for the committing process

Great suggestion for plan/commit metrics.

Adding metrics related to queries and commits is indeed very valuable. However, it may depend on Amoro to implement the Iceberg MetricsReporter for collecting plan/commit information from different engines. This is indeed a planned feature for Amoro in the future, but it may not be included in this current feature.

Nguyễn Quốc Vương · Answer 8 · Fri Aug 16 2024 11:31:08 GMT+0800 (China Standard Time)

Hi @zhoujinsong , We can use this AWS Solution as a reference.

ConradJam · Answer 9 · Fri Aug 16 2024 16:26:21 GMT+0800 (China Standard Time)

Hi @zhoujinsong , We can use this AWS Solution as a reference.

Thanks for your feedback and found,We can refer to this part of the indicators to achieve
https://github.com/aws-samples/monitoring-apache-iceberg-table-metadata-layer

hhippodnsla · Answer 10 · Sat Aug 17 2024 15:39:14 GMT+0800 (China Standard Time)

In my opinion: we can use the concepts from DW such as identifying the dimension and facts.
e.g.
the dimensions such as:

which table type? pure iceberg, mixed iceberg, mixed hive
which data type? data or metadata
is it in use? files expired or in use?
which file type for data/metadata? data/eq-del/pos-del/manifest/manifest-list and so on...
has partition and which parition?

the facts such as:

number of files
total size of files
90% file size
median file size
max file size

Here are our exp sience 0.4:
(let me bring it from my working computer on Monday....)

Furthermore, the idea to reference metrics used in iceberg like czy006 said might be good idea, thus we can get more detail view from the data inside the table, but need more consideration when the table format is mixed hive.

And we need to consider the capability of promethues reporter, since we've stepped on the pit here... (large number of metrics in single page will cause the performance issue)

On the other hand, I've totally agree with klion26, so we can have a better understanding of what's the situation when self-optimizing working. (e.g. add to OptimizerGroupMetric?)