apache / amoro

Apache Amoro (incubating) is a Lakehouse management system built on open data lake formats.

Home Page:https://amoro.apache.org/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[Feature]: Add table summary metrics

zhoujinsong opened this issue · comments

Description

Add table summary metrics for each table to help users better understand the detail of tables.

Use case/motivation

Amoro has currently implemented the Metric System to provide various metric information externally.

However, it is also crucial to have metrics for important aspects such as data size, file count, and record numbers related to tables. By adding metrics for these aspects, we aim to assist users in gaining a better understanding of the various situations on the tables.

Describe the solution

We should define proper table summary metrics first.
Then AMS will refresh the table runtime every 3 minutes(by default), it can update the table summary metrics in this stage.

Subtasks

No response

Related issues

No response

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Here are some table summary metrics for Iceberg tables:

Metric Name Type Tags Description
table_summary_total_files Gauge catalog, database, table Total number of files in the table
table_summary_data_files Gauge catalog, database, table Number of data files in the table
table_summary_equality_delete_files Gauge catalog, database, table Number of equality delete files in the table
table_summary_position_delete_files Gauge catalog, database, table Number of position delete files in the table
table_summary_total_files_size Gauge catalog, database, table Total size of files in the table
table_summary_data_files_size Gauge catalog, database, table Size of data files in the table
table_summary_equality_delete_files_size Gauge catalog, database, table Size of equality delete files in the table
table_summary_position_delete_files_size Gauge catalog, database, table Size of position delete files in the table
table_summary_total_records Gauge catalog, database, table Total records in the table
table_summary_data_files_records Gauge catalog, database, table Records of data files in the tablep
table_summary_equality_delete_files_records Gauge catalog, database, table Records of equality delete files in the table
table_summary_position_delete_files_records Gauge catalog, database, table Records of position delete files in the table
table_summary_snapshots Gauge catalog, database, table Number of snapshots in the table

Please feel free to share your idea about table summary metrics.

The meanings of "table_summary_total_files" and "table_summary_data_files" are whether the quantities are in the latest Snapshot or the quantities of all the files existing on HDFS.

Whether to consider making the metrics more generalized to support different table formats?

Thanks for driving this, the summary is very helpful for users to understand the underlying things, apart from the table-level metric, 1) Could we add the metric for the tasks -- we can give more information from the task-level metrics and optimize the planning process; 2) could we add a metric for the committing process

We've added some metrics above about this in our inner version and can contribute them if needed.

Whether to consider making the metrics more generalized to support different table formats?

I am glad to make the metric name more general if it could be used for multiple formats, like table_summary_total_files or table_summary_data_files.

However, I also agree with adding specific metrics for different table formats, such as table_summary_equality_delete_files and table_summary_position_delete_files. We should allow different table formats to have different metric items.

The meanings of "table_summary_total_files" and "table_summary_data_files" are whether the quantities are in the latest Snapshot or the quantities of all the files existing on HDFS.

IMO, they refer to the summary of the current snapshot.

  1. Could we add the metric for the tasks -- we can give more information from the task-level metrics and optimize the planning process; 2) could we add a metric for the committing process

Great suggestion for plan/commit metrics.

Adding metrics related to queries and commits is indeed very valuable. However, it may depend on Amoro to implement the Iceberg MetricsReporter for collecting plan/commit information from different engines. This is indeed a planned feature for Amoro in the future, but it may not be included in this current feature.

Hi @zhoujinsong , We can use this AWS Solution as a reference.

Hi @zhoujinsong , We can use this AWS Solution as a reference.

Thanks for your feedback and found,We can refer to this part of the indicators to achieve
https://github.com/aws-samples/monitoring-apache-iceberg-table-metadata-layer

In my opinion: we can use the concepts from DW such as identifying the dimension and facts.
e.g.
the dimensions such as:

  • which table type? pure iceberg, mixed iceberg, mixed hive
  • which data type? data or metadata
  • is it in use? files expired or in use?
  • which file type for data/metadata? data/eq-del/pos-del/manifest/manifest-list and so on...
  • has partition and which parition?

the facts such as:

  • number of files
  • total size of files
  • 90% file size
  • median file size
  • max file size

Here are our exp sience 0.4:
(let me bring it from my working computer on Monday....)

Furthermore, the idea to reference metrics used in iceberg like czy006 said might be good idea, thus we can get more detail view from the data inside the table, but need more consideration when the table format is mixed hive.

And we need to consider the capability of promethues reporter, since we've stepped on the pit here... (large number of metrics in single page will cause the performance issue)

On the other hand, I've totally agree with klion26, so we can have a better understanding of what's the situation when self-optimizing working. (e.g. add to OptimizerGroupMetric?)