[DSIP-35][Cluster Task Insights]:Add a series of monitoring indicators to reflect the running status of tasks

Question

[DSIP-35][Cluster Task Insights]:Add a series of monitoring indicators to reflect the running status of tasks

jiangtaoNuc opened this issue a month ago · comments

Search before asking

I had searched in the DSIP and found no similar DSIP.

Motivation

At present, the monitoring items on the homepage of DS scheduling tasks are too simple to provide clear insights into the overall and sub project workflow, task operation status, including statistics of abnormal situations. It is planned to add relevant analysis indicators to assist administrators, data development, and frontline operations in analyzing and adjusting the execution status.
There are two dimensions. The first is the overall scheduling analysis, which is aimed at cluster administrators. They need to pay attention to the number of projects currently scheduled, the number of online workflows, as well as the daily successful scheduling, the distribution of hourly level scheduling tasks, how many tasks are successfully retried, and which tasks run for a long time and fail more times around the task level. The purpose of this dimension is to enable cluster administrators to quickly determine the operation status and task distribution of the scheduling system, and provide improvement suggestions to various project developers.
The second dimension is project analysis, which is aimed at the administrators of a certain project. Currently, project settings generally have a certain degree of logic, including layering or independent operation according to business scenarios. It is necessary to pay attention to the workflow situation, task situation, hourly adjustment distribution, etc. of the project. Based on the task level, it is important to consider which tasks have longer running times and more failures

Design Detail

The list of planned indicators is shown in the following figure
Numerical type is presented in the form of numerical cards during the development process, with trend proportions planned through discounting or bar charts, and lists presented in the form of bar charts.

Compatibility, Deprecation, and Migration Plan

No response

Test Plan

No response

Code of Conduct

I agree to follow this project's Code of Conduct

指北 · Answer 1 · Fri Apr 26 2024 13:22:03 GMT+0800 (China Standard Time)

The first image is the overall scheduling and monitoring of the overall project, and the second image is the monitoring of the overall project. The following are some things to note:,

Try to avoid processing data separately and summarize the results from existing DS metadata tables during queries
Considering the situation where some users have a large number of task instances in their production environment, excessive metrics can lead to slow queries. So the calculation of indicators should try not to associate too many tables, and for trend indicator levels, especially for multi day task instance statistics, switches need to be added to allow users to choose whether to enable configuration.

XIJIU123 · Answer 2 · Fri May 31 2024 10:30:57 GMT+0800 (China Standard Time)

numeric value

Number of projects

GET /firstPage/query-project-num

parameter：empty

Return value case：

{
  "code": 0,
  "msg": "成功",
  "data": 25,
  "failed": false,
  "success": true
}

Total workflows, number of online workflows, number of lost workflows

GET /firstPage/query-process-num

parameter：empty

Return value case：

{
    "code": 0,
    "msg": "成功",
    "data": {
        "result": [
            {
                "proc_status": 0,
                "proc_count": 475
            },
            {
                "proc_status": 1,
                "proc_count": 599
            }
        ]
    },
    "failed": false,
    "success": true
}

Parameter description：

proc_status：0 indicates the online workflow, 1 indicates the total workflow, and 2 indicates the lost workflow

proc_count：the number of workflows

The number of online tasks

GET /firstPage/query-task-num

Parameter：empty

Return value case：

{
    "code": 0,
    "msg": "成功",
    "data": 5756,
    "failed": false,
    "success": true
}

The number of scheduled tasks, the number of successfully scheduled tasks, and the number of tasks that were successfully scheduled yesterday

GET /firstPage/query-scheduler-num

Parameter：empty

Return value case：

{
    "code": 0,
    "msg": "成功",
    "data": {
        "finishSchedulerNum": 8749,
        "yesterdaySchedulerNum": 8723,
        "totalSchedulerNum": 13638
    },
    "failed": false,
    "success": true
}

Parameter description：

finishSchedulerNum：Today's successful dispatch counts

totalSchedulerNum：The number of tasks that should be scheduled

yesterdaySchedulerNum：The number of successfully scheduled tasks yesterday

manifest

Top 5 Tasks in Running Duration

GET /firstPage/query-timeouttask-top

Parameter：

startDate:（must，type:string，Non-null),start time.

endDate:（must，type:string，Non-null),End time.

Return value case：

{
    "code": 0,
    "msg": "成功",
    "data": [
        {
            "name": "dwi_breed_estrus_qs",
            "count": 0,
            "duration": 468
        }
    ],
    "failed": false,
    "success": true
}

Parameter description：

name：the name of the task

count：the number of executions

duration：time spent (minutes)

Top 5 Failed Tasks

GET /firstPage/query-failtask-top

Parameter：

startDate:（must，type:string，Non-null),start time.

endDate:（must，type:string，Non-null),End time.

Return value case：

{
    "code": 0,
    "msg": "成功",
    "data": [
        {
            "name": "dwi_breed_estrus_qs",
            "count": 0,
            "duration": 468
        }
    ],
    "failed": false,
    "success": true
}

Parameter description：

duration: time spent (minutes)）

Trends (to be determined)

Task status trends

GET /firstPage/query-task-status-num

Parameter：

startDate:(must,type:string,Non-null),start time.

endDate:(must,type:string,Non-null),End time.

projectCode: (must, string, can be empty), end time.

Return value case：

{
    "code": 0,
    "msg": "成功",
    "data": {
        "x": [
            0,
            "...",
            23
        ],
        "y": [
            {
                "data": [
                    0,
                    "...",
                    0
                ],
                "name": "成功"
            },
            {
                "data": [
                    0,
                    "...",
                    0
                ],
                "name": "失败"
            },
            {
                "data": [
                    0,
                    "...",
                    0
                ],
                "name": "停止"
            },
            {
                "data": [
                    0,
                    "...",
                    0
                ],
                "name": "其他"
            },
            {
                "data": [
                    0,
                    "...",
                    0
                ],
                "name": "全部"
            }
        ]
    },
    "failed": false,
    "success": true
}

Parameter description：

x: x-axis coordinates

y: y-axis coordinate

data: data content