Skip to content

Commit 5710105

Browse files
Fix flakiness in the E2E test e2e_multi_cluster_replica_set_scale_up (#231)
# Summary The E2E test `e2e_multi_cluster_replica_set_scale_up` has been flaky and @lucian-tosa suggested that we should fix it. It has been failing while waiting for statefulsets (STSs) to have correct number of in case multi cluster mongoDB deployment. And the problem was sometimes after the `MongoDBMultiCluster (mdbmc)` resource got into running phase (that would mean all the STSs are ready), some of the STSs got into not ready state. When we see that `mdbmc` resource is Running we try to make sure that STSs have correct number of replicas but because of above problem (STSs transitioning into not ready state from ready), STS didn't have the correct number of replicas and tests failed. The reason why STS was transitioning into not ready state from ready is, the pod that it was maintaining did the same, i.e., it transitioned from ready state to not ready state. After looking into it further we got to know that the pod is behaving like this because sometimes, it's readiness probe fails momentarily. And because of that the pod gets to ready and then transitioned to not ready (readiness probe failed) and then eventually becomes ready. This is documented in much more detail in the document [here](https://jira.mongodb.org/browse/CLOUDP-329231). The ideal fix of the problem would be to figure out why the readiness probe fails and then fix that. But this PR has the workaround that changes the test slightly to wait for STSs to get correct number of replicas. Jira ticket: https://jira.mongodb.org/browse/CLOUDP-329422 ## Proof of Work Ran the test `e2e_multi_cluster_replica_set_scale_up` manually locally to make sure that it's passing consistently. I am not able to reproduce the flakiness now. ## Checklist - [x] Have you linked a jira ticket and/or is the ticket in the title? - [x] Have you checked whether your jira ticket required DOCSP changes? - [x] Have you checked for release_note changes?
1 parent 16e1d78 commit 5710105

File tree

1 file changed

+53
-22
lines changed

1 file changed

+53
-22
lines changed

docker/mongodb-kubernetes-tests/tests/multicluster/multi_cluster_replica_set_scale_up.py

Lines changed: 53 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
from typing import List
22

33
import kubernetes
4+
import kubetester
45
import pytest
56
from kubetester.automation_config_tester import AutomationConfigTester
67
from kubetester.certs_mongodb_multi import create_multi_cluster_mongodb_tls_certs
@@ -80,18 +81,30 @@ def test_statefulsets_have_been_created_correctly(
8081
mongodb_multi: MongoDBMulti,
8182
member_cluster_clients: List[MultiClusterClient],
8283
):
83-
statefulsets = mongodb_multi.read_statefulsets(member_cluster_clients)
84-
cluster_one_client = member_cluster_clients[0]
85-
cluster_one_sts = statefulsets[cluster_one_client.cluster_name]
86-
assert cluster_one_sts.status.ready_replicas == 1
84+
# Even though we already verified, in previous test, that the MongoDBMultiCluster resource's phase is running (that would mean all STSs are ready);
85+
# checking the expected number of replicas for STS makes the test flaky because of an issue mentioned in detail in this ticket https://jira.mongodb.org/browse/CLOUDP-329231.
86+
# That's why we are waiting for STS to have expected number of replicas. This change can be reverted when we make the proper fix as
87+
# mentioned in the above ticket.
88+
def fn():
89+
cluster_one_client = member_cluster_clients[0]
90+
cluster_one_statefulsets = mongodb_multi.read_statefulsets([cluster_one_client])
91+
return cluster_one_statefulsets[cluster_one_client.cluster_name].status.ready_replicas == 1
8792

88-
cluster_two_client = member_cluster_clients[1]
89-
cluster_two_sts = statefulsets[cluster_two_client.cluster_name]
90-
assert cluster_two_sts.status.ready_replicas == 1
93+
kubetester.wait_until(fn, timeout=60, message="Verifying sts has correct number of replicas in cluster one")
9194

92-
cluster_three_client = member_cluster_clients[2]
93-
cluster_three_sts = statefulsets[cluster_three_client.cluster_name]
94-
assert cluster_three_sts.status.ready_replicas == 1
95+
def fn():
96+
cluster_two_client = member_cluster_clients[1]
97+
cluster_two_statefulsets = mongodb_multi.read_statefulsets([cluster_two_client])
98+
return cluster_two_statefulsets[cluster_two_client.cluster_name].status.ready_replicas == 1
99+
100+
kubetester.wait_until(fn, timeout=60, message="Verifying sts has correct number of replicas in cluster two")
101+
102+
def fn():
103+
cluster_three_client = member_cluster_clients[2]
104+
cluster_three_statefulsets = mongodb_multi.read_statefulsets([cluster_three_client])
105+
return cluster_three_statefulsets[cluster_three_client.cluster_name].status.ready_replicas == 1
106+
107+
kubetester.wait_until(fn, timeout=60, message="Verifying sts has correct number of replicas in cluster three")
95108

96109

97110
@pytest.mark.e2e_multi_cluster_replica_set_scale_up
@@ -116,18 +129,36 @@ def test_statefulsets_have_been_scaled_up_correctly(
116129
mongodb_multi: MongoDBMulti,
117130
member_cluster_clients: List[MultiClusterClient],
118131
):
119-
statefulsets = mongodb_multi.read_statefulsets(member_cluster_clients)
120-
cluster_one_client = member_cluster_clients[0]
121-
cluster_one_sts = statefulsets[cluster_one_client.cluster_name]
122-
assert cluster_one_sts.status.ready_replicas == 2
123-
124-
cluster_two_client = member_cluster_clients[1]
125-
cluster_two_sts = statefulsets[cluster_two_client.cluster_name]
126-
assert cluster_two_sts.status.ready_replicas == 1
127-
128-
cluster_three_client = member_cluster_clients[2]
129-
cluster_three_sts = statefulsets[cluster_three_client.cluster_name]
130-
assert cluster_three_sts.status.ready_replicas == 2
132+
# Even though we already verified, in previous test, that the MongoDBMultiCluster resource's phase is running (that would mean all STSs are ready);
133+
# checking the expected number of replicas for STS makes the test flaky because of an issue mentioned in detail in this ticket https://jira.mongodb.org/browse/CLOUDP-329231.
134+
# That's why we are waiting for STS to have expected number of replicas. This change can be reverted when we make the proper fix as
135+
# mentioned in the above ticket.
136+
def fn():
137+
cluster_one_client = member_cluster_clients[0]
138+
cluster_one_statefulsets = mongodb_multi.read_statefulsets([cluster_one_client])
139+
return cluster_one_statefulsets[cluster_one_client.cluster_name].status.ready_replicas == 2
140+
141+
kubetester.wait_until(
142+
fn, timeout=60, message="Verifying sts has correct number of replicas after scale up in cluster one"
143+
)
144+
145+
def fn():
146+
cluster_two_client = member_cluster_clients[1]
147+
cluster_two_statefulsets = mongodb_multi.read_statefulsets([cluster_two_client])
148+
return cluster_two_statefulsets[cluster_two_client.cluster_name].status.ready_replicas == 1
149+
150+
kubetester.wait_until(
151+
fn, timeout=60, message="Verifying sts has correct number of replicas after scale up in cluster two"
152+
)
153+
154+
def fn():
155+
cluster_three_client = member_cluster_clients[2]
156+
cluster_three_statefulsets = mongodb_multi.read_statefulsets([cluster_three_client])
157+
return cluster_three_statefulsets[cluster_three_client.cluster_name].status.ready_replicas == 2
158+
159+
kubetester.wait_until(
160+
fn, timeout=60, message="Verifying sts has correct number of replicas after scale up in cluster three"
161+
)
131162

132163

133164
@pytest.mark.e2e_multi_cluster_replica_set_scale_up

0 commit comments

Comments
 (0)