kubernetes-sigs / scheduler-plugins

Repository for out-of-tree scheduler plugins based on scheduler framework.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Find correct combination with given NUMA node ID

283713406 opened this issue · comments

Area

  • Scheduler
  • Controller
  • Helm Chart
  • Documents

Other components

No response

What happened?

func findSuitableCombination(identifier string, qos v1.PodQOSClass, numaNodes NUMANodeList, resources v1.ResourceList, numaNodesCombination [][]int) ([]int, bool) {
	minAvgDistance := minAvgDistanceInCombinations(numaNodes, numaNodesCombination)
	var (
		minDistanceCombination []int
		// init as max distance
		minDistance float32 = 256
	)
	for _, combination := range numaNodesCombination {
		combinationResources := combineResources(numaNodes, combination)
		resourcesFit := checkResourcesFit(identifier, qos, resources, combinationResources)

		if resourcesFit {
			distance := nodesAvgDistance(numaNodes, combination...)
			if distance == minAvgDistance {
				// return early if we can fit resources into combination and provide minDistance
				return combination, true
			}
			// we don't have to check which combination bitmask has lower value since we are generating them from lowest value
			if distance < minDistance {
				minDistance = distance
				minDistanceCombination = combination
			}
		}
	}

	return minDistanceCombination, false
}

Sometimes, some servers may not have memory modules inserted in every NUMA node. Assuming that the total resources of the combination satisfy the Pod request, but there is a NUMA node in the combination that does not have a memory module, it is obvious that this combination is incorrect, but we still determine that it is available

What did you expect to happen?

Assuming that the combination contains NUMA nodes without memory modules, the combination should be excluded

For example, in the following NUMANodeList, the existing code logic will select a combination of (0, 1, 5). But should choose (2, 4, 6)

{
	numaNodes: NUMANodeList{
		{
			NUMAID: 0,
			Resources: v1.ResourceList{
				gpuResource:       resource.MustParse("1"),
				v1.ResourceCPU:    *resource.NewQuantity(4, resource.DecimalSI),
				v1.ResourceMemory: resource.MustParse("5Gi"),
			},
			Costs: map[int]int{
				0: 10, 1: 20, 2: 40, 3: 30, 4: 20, 5: 30, 6: 50, 7: 40,
			},
		},
		{
			NUMAID: 3,
			Resources: v1.ResourceList{
				gpuResource:    resource.MustParse("1"),
				v1.ResourceCPU: *resource.NewQuantity(4, resource.DecimalSI),
			},
			Costs: map[int]int{
				0: 30, 1: 40, 2: 20, 3: 10, 4: 30, 5: 20, 6: 40, 7: 50,
			},
		},
		{
			NUMAID: 5,
			Resources: v1.ResourceList{
				gpuResource:    resource.MustParse("1"),
				v1.ResourceCPU: *resource.NewQuantity(4, resource.DecimalSI),
			},
			Costs: map[int]int{
				0: 30, 1: 20, 2: 50, 3: 20, 4: 50, 5: 10, 6: 50, 7: 40,
			},
		},
		{
			NUMAID: 7,
			Resources: v1.ResourceList{
				gpuResource:    resource.MustParse("1"),
				v1.ResourceCPU: *resource.NewQuantity(4, resource.DecimalSI),
			},
			Costs: map[int]int{
				0: 40, 1: 50, 2: 30, 3: 50, 4: 20, 5: 40, 6: 30, 7: 10,
			},
		},
		{
			NUMAID: 1,
			Resources: v1.ResourceList{
				gpuResource:    resource.MustParse("1"),
				v1.ResourceCPU: *resource.NewQuantity(4, resource.DecimalSI),
			},
			Costs: map[int]int{
				0: 20, 1: 10, 2: 30, 3: 40, 4: 50, 5: 20, 6: 40, 7: 50,
			},
		},
		{
			NUMAID: 6,
			Resources: v1.ResourceList{
				gpuResource:       resource.MustParse("1"),
				v1.ResourceCPU:    *resource.NewQuantity(4, resource.DecimalSI),
				v1.ResourceMemory: resource.MustParse("5Gi"),
			},
			Costs: map[int]int{
				0: 50, 1: 40, 2: 20, 3: 40, 4: 30, 5: 50, 6: 10, 7: 30,
			},
		},
		{
			NUMAID: 2,
			Resources: v1.ResourceList{
				gpuResource:       resource.MustParse("1"),
				v1.ResourceCPU:    *resource.NewQuantity(4, resource.DecimalSI),
				v1.ResourceMemory: resource.MustParse("5Gi"),
			},
			Costs: map[int]int{
				0: 40, 1: 30, 2: 10, 3: 20, 4: 40, 5: 50, 6: 20, 7: 30,
			},
		},
		{
			NUMAID: 4,
			Resources: v1.ResourceList{
				gpuResource:       resource.MustParse("1"),
				v1.ResourceCPU:    *resource.NewQuantity(4, resource.DecimalSI),
				v1.ResourceMemory: resource.MustParse("5Gi"),
			},
			Costs: map[int]int{
				0: 20, 1: 50, 2: 40, 3: 30, 4: 10, 5: 50, 6: 30, 7: 20,
			},
		},
	},
	podResources: v1.ResourceList{
		v1.ResourceCPU:    *resource.NewQuantity(3, resource.DecimalSI),
		v1.ResourceMemory: resource.MustParse("2Gi"),
		gpuResource:       resource.MustParse("3"),
	},
}

How can we reproduce it (as minimally and precisely as possible)?

No response

Anything else we need to know?

No response

Kubernetes version

[root@master1 ~]# kubectl version
Client Version: version.Info{Major:"1", Minor:"24", GitVersion:"v1.24.13", GitCommit:"49433308be5b958856b6949df02b716e0a7cf0a3", GitTreeState:"clean", BuildDate:"2023-04-12T12:15:50Z", GoVersion:"go1.19.8", Compiler:"gc", Platform:"linux/arm64"}
Kustomize Version: v4.5.4
Server Version: version.Info{Major:"1", Minor:"24", GitVersion:"v1.24.13", GitCommit:"49433308be5b958856b6949df02b716e0a7cf0a3", GitTreeState:"clean", BuildDate:"2023-04-12T12:08:36Z", GoVersion:"go1.19.8", Compiler:"gc", Platform:"linux/arm64"}

Scheduler Plugins version

master

I agree this is a bug. In our initial design we kinda implicitely assumed all NUMA nodes consistent and having CPU+memory. More complex and unequal scenarios are indeed possible.