LeaderWorkerSet integration when LeaderWorkerSet created with RestartPolicy should only recreate the deleted pod when policy is set to NoneRestartPolicy by vladikkuzn · Pull Request #9755 · kubernetes-sigs/kueue

vladikkuzn · 2026-03-09T10:35:57Z

What type of PR is this?

/kind flake

What this PR does / why we need it:

The old assertion identified the recreated pod by comparing current pods against the original map keyed by pod name. That is brittle because after deletion/recreation, the replacement can be observed in a different way than the test assumes, so the check can temporarily see no “same-name, new-UID” pod and incorrectly conclude that nothing was recreated.

I changed the test to assert the actual behavior more directly:
exactly 3 pods exist again,
exactly 1 pod has a new UID,
2 original pod UIDs are still present,
the newly created pod corresponds to the deleted worker index.
So the problem was: the test was using an unstable way to detect pod replacement, which made it report 0 recreated pods even when the system behavior was valid.

Which issue(s) this PR fixes:

Fixes #9750

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

netlify · 2026-03-09T10:36:05Z

✅ Deploy Preview for kubernetes-sigs-kueue canceled.

Name	Link
🔨 Latest commit	`8a804fa`
🔍 Latest deploy log	https://app.netlify.com/projects/kubernetes-sigs-kueue/deploys/69b010350979570007204491

vladikkuzn · 2026-03-09T10:37:23Z

/assign
/cc @mimowo @mbobrovskyi

test/e2e/singlecluster/leaderworkerset_test.go

mimowo · 2026-03-09T10:38:44Z

Please also describe what was the problem causing the flakes

vladikkuzn · 2026-03-09T10:45:20Z

Added explanation to the description

test/e2e/singlecluster/leaderworkerset_test.go

… Policy should recreate pods when policy is set to RecreateGroupOnPodRestart kubernetes-sigs#9751

vladikkuzn · 2026-03-09T12:08:04Z

/retest-required

test/e2e/singlecluster/leaderworkerset_test.go

mimowo · 2026-03-09T13:36:16Z

Added explanation to the description

I still don't understand what was the issue with the previous code. Sure, I'm +1 to improve the test code to use sets rather than counters for the ease of investigations, but I don't see where was the bug. Could you maybe detail the explanation by tracking the events and correlating them with kubelet logs? This would be very helpful I think.

mbobrovskyi · 2026-03-09T16:54:54Z

@vladikkuzn Could you please also run an experiment with 200 repeats of this test to make sure it’s fixed?

k8s-ci-robot · 2026-03-09T18:03:14Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: vladikkuzn
Once this PR has been reviewed and has the lgtm label, please assign mbobrovskyi for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

vladikkuzn · 2026-03-10T07:27:13Z

@mimowo There're no logs which shows the LWS controller created a replacement; if the LWS controller never created a replacement (so the test’s recreatedCount stayed 0), we cannot distinguish between those from the current artifact set—hence we need to make the test robust by not relying on pod name for “recreated” (e.g. use UID + worker index) so that a replacement with the same or different name is still counted correctly.
Maybe this PR doesn't fully fix the problem, but at least we'll see if it's actually test who flakes or the behaviour + with 200 repeats it doesn't

mimowo · 2026-03-10T07:49:33Z

Interesting, maybe it didn't create the replacement Pod because the previous hasn't finished termination on time?

mimowo · 2026-03-10T07:56:49Z

Ah, I see indeed we don't wait for the Pod termination or deletion. Let's introduce this extra step. Sometimes termination of pods might take long.

cc @sohankunkerkar

…ubernetes-sigs#9415

k8s-ci-robot · 2026-03-10T12:55:00Z

@vladikkuzn: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-kueue-test-e2e-main-1-32	`8a804fa`	link	true	`/test pull-kueue-test-e2e-main-1-32`
pull-kueue-test-e2e-main-1-33	`8a804fa`	link	true	`/test pull-kueue-test-e2e-main-1-33`
pull-kueue-test-e2e-main-1-34	`8a804fa`	link	true	`/test pull-kueue-test-e2e-main-1-34`
pull-kueue-test-e2e-main-1-35	`8a804fa`	link	true	`/test pull-kueue-test-e2e-main-1-35`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

sohankunkerkar · 2026-03-10T12:57:58Z

test/e2e/singlecluster/leaderworkerset_test.go

 			})

 			var podToDelete *corev1.Pod
 			originalPodUIDs := make(map[string]types.UID)


Could you switch to Switch to sets.New[types.UID]() here as well?

sohankunkerkar · 2026-03-10T14:40:48Z

test/e2e/singlecluster/leaderworkerset_test.go

 					}, client.InNamespace(lws.Namespace))).Should(gomega.Succeed())
+					g.Expect(pods.Items).To(gomega.HaveLen(3))

+					remainingOriginalUIDs := sets.New[types.UID]()


The sets package already has Intersection and Difference. So you can replace the entire loop + counter with:

g.Expect(pods.Items).To(gomega.HaveLen(3)) currentUIDs := sets.New[types.UID]() for _, pod := range pods.Items { currentUIDs.Insert(pod.UID) } survived := originalPodUIDSet.Intersection(currentUIDs) newUIDs := currentUIDs.Difference(originalPodUIDSet) g.Expect(survived).To(gomega.HaveLen(2), "2 original pods should survive") g.Expect(survived.Has(deletedPodUID)).To(gomega.BeFalse(), "deleted pod should not survive") g.Expect(newUIDs).To(gomega.HaveLen(1), "exactly 1 replacement pod") var replacement *corev1.Pod for i := range pods.Items { if newUIDs.Has(pods.Items[i].UID) { replacement = &pods.Items[i] break } } g.Expect(replacement).NotTo(gomega.BeNil()) g.Expect(replacement.Labels[leaderworkersetv1.WorkerIndexLabelKey]).To( gomega.Equal(deletedWorkerIndex), "only the deleted worker slot should be recreated", )

PannagaRao · 2026-03-10T22:06:41Z

test/e2e/singlecluster/leaderworkerset_test.go

+				gomega.Eventually(func(g gomega.Gomega) {
+					pod := &corev1.Pod{}
+					err := k8sClient.Get(ctx, deletedPodKey, pod)
+					g.Expect(apierrors.IsNotFound(err)).To(gomega.BeTrue(), "deleted pod should be removed from the API")


IsNotFound may not always work here — LWS uses a StatefulSet which recreates the pod with the same name, and the replacement could be created before we even check for deletion. So Get(ctx, deletedPodKey, pod) will always find a pod and never return NotFound.

Maybe checking for a UID change would be more reliable since that confirms recreation regardless of timing?

ginkgo.By("Wait for the deleted pod to be recreated with a new UID", func() { gomega.Eventually(func(g gomega.Gomega) { pod := &corev1.Pod{} g.Expect(k8sClient.Get(ctx, deletedPodKey, pod)).To(gomega.Succeed()) g.Expect(pod.UID).NotTo(gomega.Equal(deletedPodUID), "pod should be recreated with a new UID") }, util.LongTimeout, util.Interval).Should(gomega.Succeed()) })

k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/flake Categorizes issue or PR as related to a flaky test. labels Mar 9, 2026

k8s-ci-robot requested review from kannon92 and tenzen-y March 9, 2026 10:36

k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Mar 9, 2026

k8s-ci-robot assigned vladikkuzn Mar 9, 2026

k8s-ci-robot requested review from mbobrovskyi and mimowo March 9, 2026 10:37

mimowo reviewed Mar 9, 2026

View reviewed changes

test/e2e/singlecluster/leaderworkerset_test.go Outdated Show resolved Hide resolved

vladikkuzn force-pushed the lws-count-uuids branch from b9a01a2 to 14393b0 Compare March 9, 2026 10:43

vladikkuzn requested a review from mimowo March 9, 2026 10:44

mimowo reviewed Mar 9, 2026

View reviewed changes

test/e2e/singlecluster/leaderworkerset_test.go Outdated Show resolved Hide resolved

LeaderWorkerSet integration when LeaderWorkerSet created with Restart…

ca5d46b

… Policy should recreate pods when policy is set to RecreateGroupOnPodRestart kubernetes-sigs#9751

vladikkuzn force-pushed the lws-count-uuids branch from 14393b0 to ca5d46b Compare March 9, 2026 11:31

vladikkuzn requested a review from mimowo March 9, 2026 11:33

mimowo reviewed Mar 9, 2026

View reviewed changes

test/e2e/singlecluster/leaderworkerset_test.go Outdated Show resolved Hide resolved

vladikkuzn force-pushed the lws-count-uuids branch from 3ce656d to 7b996dd Compare March 9, 2026 21:34

Add an option to E2E_MODE=dev to install a released version of Kueue k…

8a804fa

…ubernetes-sigs#9415

vladikkuzn force-pushed the lws-count-uuids branch from 7b996dd to 8a804fa Compare March 10, 2026 12:36

sohankunkerkar reviewed Mar 10, 2026

View reviewed changes

PannagaRao reviewed Mar 10, 2026

View reviewed changes

Conversation

vladikkuzn commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Uh oh!

netlify bot commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for kubernetes-sigs-kueue canceled.

Uh oh!

vladikkuzn commented Mar 9, 2026

Uh oh!

Uh oh!

mimowo commented Mar 9, 2026

Uh oh!

vladikkuzn commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

vladikkuzn commented Mar 9, 2026

Uh oh!

Uh oh!

mimowo commented Mar 9, 2026

Uh oh!

mbobrovskyi commented Mar 9, 2026

Uh oh!

k8s-ci-robot commented Mar 9, 2026

Uh oh!

vladikkuzn commented Mar 10, 2026

Uh oh!

mimowo commented Mar 10, 2026

Uh oh!

mimowo commented Mar 10, 2026

Uh oh!

k8s-ci-robot commented Mar 10, 2026

Uh oh!

sohankunkerkar Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

sohankunkerkar Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

PannagaRao Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

vladikkuzn commented Mar 9, 2026 •

edited

Loading

netlify bot commented Mar 9, 2026 •

edited

Loading

vladikkuzn commented Mar 9, 2026 •

edited

Loading

PannagaRao Mar 10, 2026 •

edited

Loading