[DSIP-35][Cluster Task Insights]:Add a series of monitoring indicators to reflect the running status of tasks
jiangtaoNuc opened this issue · comments
Search before asking
- I had searched in the DSIP and found no similar DSIP.
Motivation
At present, the monitoring items on the homepage of DS scheduling tasks are too simple to provide clear insights into the overall and sub project workflow, task operation status, including statistics of abnormal situations. It is planned to add relevant analysis indicators to assist administrators, data development, and frontline operations in analyzing and adjusting the execution status.
There are two dimensions. The first is the overall scheduling analysis, which is aimed at cluster administrators. They need to pay attention to the number of projects currently scheduled, the number of online workflows, as well as the daily successful scheduling, the distribution of hourly level scheduling tasks, how many tasks are successfully retried, and which tasks run for a long time and fail more times around the task level. The purpose of this dimension is to enable cluster administrators to quickly determine the operation status and task distribution of the scheduling system, and provide improvement suggestions to various project developers.
The second dimension is project analysis, which is aimed at the administrators of a certain project. Currently, project settings generally have a certain degree of logic, including layering or independent operation according to business scenarios. It is necessary to pay attention to the workflow situation, task situation, hourly adjustment distribution, etc. of the project. Based on the task level, it is important to consider which tasks have longer running times and more failures
Design Detail
The list of planned indicators is shown in the following figure
Numerical type is presented in the form of numerical cards during the development process, with trend proportions planned through discounting or bar charts, and lists presented in the form of bar charts.
Compatibility, Deprecation, and Migration Plan
No response
Test Plan
No response
Code of Conduct
- I agree to follow this project's Code of Conduct
The first image is the overall scheduling and monitoring of the overall project, and the second image is the monitoring of the overall project. The following are some things to note:,
- Try to avoid processing data separately and summarize the results from existing DS metadata tables during queries
- Considering the situation where some users have a large number of task instances in their production environment, excessive metrics can lead to slow queries. So the calculation of indicators should try not to associate too many tables, and for trend indicator levels, especially for multi day task instance statistics, switches need to be added to allow users to choose whether to enable configuration.
numeric value
Number of projects
GET /firstPage/query-project-num
parameter:empty
Return value case:
{
"code": 0,
"msg": "成功",
"data": 25,
"failed": false,
"success": true
}
Total workflows, number of online workflows, number of lost workflows
GET /firstPage/query-process-num
parameter:empty
Return value case:
{
"code": 0,
"msg": "成功",
"data": {
"result": [
{
"proc_status": 0,
"proc_count": 475
},
{
"proc_status": 1,
"proc_count": 599
}
]
},
"failed": false,
"success": true
}
Parameter description:
proc_status:0 indicates the online workflow, 1 indicates the total workflow, and 2 indicates the lost workflow
proc_count:the number of workflows
The number of online tasks
GET /firstPage/query-task-num
Parameter:empty
Return value case:
{
"code": 0,
"msg": "成功",
"data": 5756,
"failed": false,
"success": true
}
The number of scheduled tasks, the number of successfully scheduled tasks, and the number of tasks that were successfully scheduled yesterday
GET /firstPage/query-scheduler-num
Parameter:empty
Return value case:
{
"code": 0,
"msg": "成功",
"data": {
"finishSchedulerNum": 8749,
"yesterdaySchedulerNum": 8723,
"totalSchedulerNum": 13638
},
"failed": false,
"success": true
}
Parameter description:
finishSchedulerNum:Today's successful dispatch counts
totalSchedulerNum:The number of tasks that should be scheduled
yesterdaySchedulerNum:The number of successfully scheduled tasks yesterday
manifest
Top 5 Tasks in Running Duration
GET /firstPage/query-timeouttask-top
Parameter:
startDate:(must,type:string,Non-null),start time.
endDate:(must,type:string,Non-null),End time.
Return value case:
{
"code": 0,
"msg": "成功",
"data": [
{
"name": "dwi_breed_estrus_qs",
"count": 0,
"duration": 468
}
],
"failed": false,
"success": true
}
Parameter description:
name:the name of the task
count:the number of executions
duration:time spent (minutes)
Top 5 Failed Tasks
GET /firstPage/query-failtask-top
Parameter:
startDate:(must,type:string,Non-null),start time.
endDate:(must,type:string,Non-null),End time.
Return value case:
{
"code": 0,
"msg": "成功",
"data": [
{
"name": "dwi_breed_estrus_qs",
"count": 0,
"duration": 468
}
],
"failed": false,
"success": true
}
Parameter description:
name: the name of the task
count: the number of executions
duration: time spent (minutes))
Trends (to be determined)
Task status trends
GET /firstPage/query-task-status-num
Parameter:
startDate:(must,type:string,Non-null),start time.
endDate:(must,type:string,Non-null),End time.
projectCode: (must, string, can be empty), end time.
Return value case:
{
"code": 0,
"msg": "成功",
"data": {
"x": [
0,
"...",
23
],
"y": [
{
"data": [
0,
"...",
0
],
"name": "成功"
},
{
"data": [
0,
"...",
0
],
"name": "失败"
},
{
"data": [
0,
"...",
0
],
"name": "停止"
},
{
"data": [
0,
"...",
0
],
"name": "其他"
},
{
"data": [
0,
"...",
0
],
"name": "全部"
}
]
},
"failed": false,
"success": true
}
Parameter description:
x: x-axis coordinates
y: y-axis coordinate
data: data content
name: task state type