[Feature]: Add table summary metrics
zhoujinsong opened this issue · comments
Description
Add table summary metrics for each table to help users better understand the detail of tables.
Use case/motivation
Amoro has currently implemented the Metric System to provide various metric information externally.
However, it is also crucial to have metrics for important aspects such as data size, file count, and record numbers related to tables. By adding metrics for these aspects, we aim to assist users in gaining a better understanding of the various situations on the tables.
Describe the solution
We should define proper table summary metrics first.
Then AMS will refresh the table runtime every 3 minutes(by default), it can update the table summary metrics in this stage.
Subtasks
No response
Related issues
No response
Are you willing to submit a PR?
- Yes I am willing to submit a PR!
Code of Conduct
- I agree to follow this project's Code of Conduct
Here are some table summary metrics for Iceberg tables:
Metric Name | Type | Tags | Description |
---|---|---|---|
table_summary_total_files | Gauge | catalog, database, table | Total number of files in the table |
table_summary_data_files | Gauge | catalog, database, table | Number of data files in the table |
table_summary_equality_delete_files | Gauge | catalog, database, table | Number of equality delete files in the table |
table_summary_position_delete_files | Gauge | catalog, database, table | Number of position delete files in the table |
table_summary_total_files_size | Gauge | catalog, database, table | Total size of files in the table |
table_summary_data_files_size | Gauge | catalog, database, table | Size of data files in the table |
table_summary_equality_delete_files_size | Gauge | catalog, database, table | Size of equality delete files in the table |
table_summary_position_delete_files_size | Gauge | catalog, database, table | Size of position delete files in the table |
table_summary_total_records | Gauge | catalog, database, table | Total records in the table |
table_summary_data_files_records | Gauge | catalog, database, table | Records of data files in the tablep |
table_summary_equality_delete_files_records | Gauge | catalog, database, table | Records of equality delete files in the table |
table_summary_position_delete_files_records | Gauge | catalog, database, table | Records of position delete files in the table |
table_summary_snapshots | Gauge | catalog, database, table | Number of snapshots in the table |
Please feel free to share your idea about table summary metrics.
The meanings of "table_summary_total_files" and "table_summary_data_files" are whether the quantities are in the latest Snapshot or the quantities of all the files existing on HDFS.
Whether to consider making the metrics more generalized to support different table formats?
Thanks for driving this, the summary is very helpful for users to understand the underlying things, apart from the table-level metric, 1) Could we add the metric for the tasks -- we can give more information from the task-level metrics and optimize the planning process; 2) could we add a metric for the committing process
We've added some metrics above about this in our inner version and can contribute them if needed.
Whether to consider making the metrics more generalized to support different table formats?
I am glad to make the metric name more general if it could be used for multiple formats, like table_summary_total_files or table_summary_data_files.
However, I also agree with adding specific metrics for different table formats, such as table_summary_equality_delete_files and table_summary_position_delete_files. We should allow different table formats to have different metric items.
The meanings of "table_summary_total_files" and "table_summary_data_files" are whether the quantities are in the latest Snapshot or the quantities of all the files existing on HDFS.
IMO, they refer to the summary of the current snapshot.
- Could we add the metric for the tasks -- we can give more information from the task-level metrics and optimize the planning process; 2) could we add a metric for the committing process
Great suggestion for plan/commit metrics.
Adding metrics related to queries and commits is indeed very valuable. However, it may depend on Amoro to implement the Iceberg MetricsReporter for collecting plan/commit information from different engines. This is indeed a planned feature for Amoro in the future, but it may not be included in this current feature.
Hi @zhoujinsong , We can use this AWS Solution as a reference.
Hi @zhoujinsong , We can use this AWS Solution as a reference.
Thanks for your feedback and found,We can refer to this part of the indicators to achieve
https://github.com/aws-samples/monitoring-apache-iceberg-table-metadata-layer
In my opinion: we can use the concepts from DW such as identifying the dimension and facts.
e.g.
the dimensions such as:
- which table type? pure iceberg, mixed iceberg, mixed hive
- which data type? data or metadata
- is it in use? files expired or in use?
- which file type for data/metadata? data/eq-del/pos-del/manifest/manifest-list and so on...
- has partition and which parition?
the facts such as:
- number of files
- total size of files
- 90% file size
- median file size
- max file size
Here are our exp sience 0.4:
(let me bring it from my working computer on Monday....)
Furthermore, the idea to reference metrics used in iceberg like czy006 said might be good idea, thus we can get more detail view from the data inside the table, but need more consideration when the table format is mixed hive.
And we need to consider the capability of promethues reporter, since we've stepped on the pit here... (large number of metrics in single page will cause the performance issue)
On the other hand, I've totally agree with klion26, so we can have a better understanding of what's the situation when self-optimizing working. (e.g. add to OptimizerGroupMetric?)