Skip to content

Commit 0a9a215

Browse files
Pedro Figueiredodahlerlend
authored andcommitted
WL#13574 Include MDL and ACL locks in MTS deadlock detection infra-structure [post-push]
Description ----------- Three issues introduced by WL#13574 patch being addressed in this patch: 1) Multi-threaded applier stopping with a segmentation fault in ARM environments. 2) Pure virtual method invocation error in MDL graph infra-structure, being sporadically observed. 3) The `rpl_gtid.rpl_gtid_mts_spco_deadlock_other_locks` test-case is failing inn 8.0 and trunk branches. Analysis/Fix ------------ The analysis and proposed fix for each of the above issues: 1) Within the `memory::Aligned_atomic` the L1 cache-line size is being fetched programtically and at runtime, in order to optimize memory usage. The method by which this configuratio value is acquired differs from OS to OS. For Linux, the method being used was to read a file from the `proc` filesystem. This is not portable, for instance in ARM such file doesn't exist. The proper way is to use `sysconf()` and the tag `_SC_LEVEL1_DCACHE_LINESIZE`. 2) In `Commit_order_manager::wait_on_graph` method, a `Commit_order_lock_graph` local object is being created, `ticket`, which reference is passed on to the MDL graph as a node to wait for. In the same method, a `raii::Sentry<>` object is created, in order to clean the `ticket` variable reference from the MDL graph at the end of the scope. The `Commit_order_lock_graph` reference stored in the MDL graph is accessed by every thread that executes a deadlock search on the MDL graph. The problem was that the `raii::Sentry<>` object was being instantiated **before** the `Commit_order_lock_graph` object. Since the order of disposal is inverse to the order of creation, the `Commit_order_lock_graph` object was being disposed of prior to the invocation of the clean up by the `raii::Senty<>`. So, a time hiatus between both disposals where the object was already disposed of but still referenced by the MDL graph. This is fixed by inverting the order of creation between the objects. Why not a segmentation fault? Because, since the `Commit_order_lock_graph` object is local, the memory is still there, in the stack. OTOH, the destructor for the object was already invoked, meaning, no memory violation but the information about the object is cleared, hence the _pure virtual method invocation_ error due object access by casting it (the `Commit_order_lock_graph` reference) to the parent class, `MDL_wait_for_subgraph`. 3) Within a multi-threaded applier with _replica-preserve-commit-order_ enabled, there are two execution paths by which the multi-threaded applier coordinator may exit due to a deadlock: all workers are waiting on the commit order queue and they are all asked to back-off by the MDL graph infra-structure; there are workers that haven't arrive the commit stage yet and will back-off due to the state of the commit order queue. The different paths make the applier output different error messages. The test-cases needed to be updated to reflect the difference. For each test case, both execution paths are now exercised and tested. Reviewed-by: Pedro Gomes <[email protected]> Reviewed-by: Sven Sandberg <[email protected]> RB: 25616
1 parent 7daac02 commit 0a9a215

7 files changed

+607
-113
lines changed

mysql-test/extra/rpl_tests/mts_spco_generate_deadlock.inc

Lines changed: 83 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,20 @@
2222
# The statement to be execute by one of the workers that should block the
2323
# client connection.
2424
#
25+
# [--let $mts_spco_gd_worker_3_only_runs_after_deadlock = 0|1]
26+
# Whether or not worker 3 should only proceed with the transaction
27+
# execution after the deadlocak has been detected. If `0`, worker 3 will
28+
# wait on the commit order queue before the deadlock is detected, leading
29+
# to the use-case where both workers 2 and 3 will be force to rollback by
30+
# the MDL graph infra-structure. If `1`, worker 3 will only start the
31+
# transaction execution after the deadlock is detected and the coordinator
32+
# already started the shutdown process, leading to the use-case where
33+
# worker 2 is forced to rollback by the MDL graph infra-structure but
34+
# worker 3 will rollback due to the commit queue infra-structure. These
35+
# two use-cases generate a different set of error messages, hence the need
36+
# to test each of them separately, and prohibit executing a scenario where
37+
# either of them can happen nondeterministically.
38+
#
2539
# [--let $mts_spco_gd_trx_blocking_client = <STATEMENT>]
2640
# The statement to be executed by a client connection that should be
2741
# blocked by one or both workers waiting on the commit queue.
@@ -67,7 +81,7 @@
6781
# 6. On the replica, ensure that client connection B is waiting on the
6882
# lock being held by Worker 2 or Worker 3, reaching the state
6983
# $mts_spco_gd_state_blocking_client.
70-
# 7. On the replica, using client conneciton A, rollback the pending
84+
# 7. On the replica, using client connection A, rollback the pending
7185
# transaction, leading to the following lock acquisition dependencies:
7286
# Client B --statement required lock--> Worker 2 --commit order
7387
# lock--> Worker 1 --statement required lock--> Client B.
@@ -97,6 +111,10 @@ if ($mts_spco_gd_error_expected_replica == '')
97111
{
98112
--let $mts_spco_gd_error_expected_replica = 0
99113
}
114+
if ($mts_spco_gd_worker_3_only_runs_after_deadlock == '')
115+
{
116+
--let $mts_spco_gd_worker_3_only_runs_after_deadlock = 0
117+
}
100118

101119
--source include/rpl_connection_slave.inc
102120
--let $mts_spco_gd_is_replica_sql_running = query_get_value(SHOW REPLICA STATUS, Replica_SQL_Running, 1)
@@ -111,7 +129,7 @@ if ($mts_spco_gd_seqno_to_wait_for == '')
111129
}
112130

113131
--inc $mts_spco_gd_seqno_to_wait_for
114-
--let $mts_spco_gd_gtid_to_wait_for = "aaaaaaaa-1111-bbbb-2222-cccccccccccc:$mts_spco_gd_seqno_to_wait_for"
132+
--let $mts_spco_gd_gtid_to_wait_for_1 = "aaaaaaaa-1111-bbbb-2222-cccccccccccc:$mts_spco_gd_seqno_to_wait_for"
115133

116134
# 1. Execute $mts_spco_gd_trx_blocking_worker_1,
117135
# $mts_spco_gd_trx_assigned_worker_2 and
@@ -121,27 +139,52 @@ if ($mts_spco_gd_seqno_to_wait_for == '')
121139
--source include/rpl_connection_master.inc
122140
--let $debug_point = set_commit_parent_100
123141
--source include/add_debug_point.inc
124-
--eval SET GTID_NEXT = $mts_spco_gd_gtid_to_wait_for
125-
--eval $mts_spco_gd_trx_blocking_worker_1
126142

143+
--eval SET GTID_NEXT = $mts_spco_gd_gtid_to_wait_for_1
144+
--eval $mts_spco_gd_trx_blocking_worker_1
127145
SET GTID_NEXT = AUTOMATIC;
146+
128147
if ($mts_spco_gd_trx_assigned_worker_2 != '')
129148
{
149+
--inc $mts_spco_gd_seqno_to_wait_for
150+
--let $mts_spco_gd_gtid_to_wait_for_2 = "aaaaaaaa-1111-bbbb-2222-cccccccccccc:$mts_spco_gd_seqno_to_wait_for"
151+
--eval SET GTID_NEXT = $mts_spco_gd_gtid_to_wait_for_2
130152
--eval $mts_spco_gd_trx_assigned_worker_2
153+
SET GTID_NEXT = AUTOMATIC;
131154
}
132155
if ($mts_spco_gd_trx_assigned_worker_3 != '')
133156
{
157+
--inc $mts_spco_gd_seqno_to_wait_for
158+
--let $mts_spco_gd_gtid_to_wait_for_3 = "aaaaaaaa-1111-bbbb-2222-cccccccccccc:$mts_spco_gd_seqno_to_wait_for"
159+
--eval SET GTID_NEXT = $mts_spco_gd_gtid_to_wait_for_3
134160
--eval $mts_spco_gd_trx_assigned_worker_3
161+
SET GTID_NEXT = AUTOMATIC;
135162
}
163+
136164
--source include/remove_debug_point.inc
137165

138166
# 2. On the replica, using client connection A, start a transaction and
139167
# assign it the same GTID as to one of the statements issued on the
140168
# source.
141169
--source include/rpl_connection_slave.inc
142-
--eval SET GTID_NEXT = $mts_spco_gd_gtid_to_wait_for
170+
--eval SET GTID_NEXT = $mts_spco_gd_gtid_to_wait_for_1
143171
BEGIN;
144172

173+
if ($mts_spco_gd_trx_assigned_worker_2 != '')
174+
{
175+
--let $rpl_connection_name = rpl_slave_connection_2
176+
--source include/rpl_connection.inc
177+
--eval SET GTID_NEXT = $mts_spco_gd_gtid_to_wait_for_2
178+
BEGIN;
179+
}
180+
if ($mts_spco_gd_trx_assigned_worker_3 != '')
181+
{
182+
--let $rpl_connection_name = rpl_slave_connection_3
183+
--source include/rpl_connection.inc
184+
--eval SET GTID_NEXT = $mts_spco_gd_gtid_to_wait_for_3
185+
BEGIN;
186+
}
187+
145188
--source include/rpl_connection_slave1.inc
146189
# 3. Start the replication threads
147190
--source include/start_slave_sql.inc
@@ -166,20 +209,29 @@ BEGIN;
166209

167210
# 4. On the replica, ensure that the applier worker threads are waiting
168211
# on the pending client connection transaction.
169-
--let $mts_spco_gd_pending_workers = 0
170212
if ($mts_spco_gd_trx_assigned_worker_2 != '')
171213
{
172-
--inc $mts_spco_gd_pending_workers
214+
--let $rpl_connection_name = rpl_slave_connection_2
215+
--source include/rpl_connection.inc
216+
ROLLBACK;
217+
SET GTID_NEXT = AUTOMATIC;
218+
--echo include/wait_condition.inc [First worker must wait on commit order]
219+
--let $wait_condition = SELECT count(*) = 1 FROM information_schema.processlist WHERE STATE = "Waiting for preceding transaction to commit"
220+
--source include/wait_condition.inc
173221
}
174222
if ($mts_spco_gd_trx_assigned_worker_3 != '')
175223
{
176-
--inc $mts_spco_gd_pending_workers
177-
}
178-
if ($mts_spco_gd_pending_workers != 0)
179-
{
180-
--echo include/wait_condition.inc [Workers must wait on commit order]
181-
--let $wait_condition = SELECT count(*) = $mts_spco_gd_pending_workers FROM information_schema.processlist WHERE STATE = "Waiting for preceding transaction to commit"
182-
--source include/wait_condition.inc
224+
if ($mts_spco_gd_worker_3_only_runs_after_deadlock == 0)
225+
{
226+
--source include/wait_condition.inc
227+
--let $rpl_connection_name = rpl_slave_connection_3
228+
--source include/rpl_connection.inc
229+
ROLLBACK;
230+
SET GTID_NEXT = AUTOMATIC;
231+
--echo include/wait_condition.inc [Second worker must wait on commit order]
232+
--let $wait_condition = SELECT count(*) = 2 FROM information_schema.processlist WHERE STATE = "Waiting for preceding transaction to commit"
233+
--source include/wait_condition.inc
234+
}
183235
}
184236

185237
# 5. On the replica, using client connection B, execute
@@ -189,6 +241,7 @@ if ($mts_spco_gd_pending_workers != 0)
189241
# Worker 2 --commit order lock--> Worker 1 --gtid lock--> Client A.
190242
if ($mts_spco_gd_trx_blocking_client != '')
191243
{
244+
--source include/rpl_connection_slave1.inc
192245
--send_eval $mts_spco_gd_trx_blocking_client
193246
}
194247

@@ -209,7 +262,7 @@ if ($mts_spco_gd_state_blocking_client != '')
209262
--source include/wait_condition.inc
210263
}
211264

212-
# 7. On the replica, using client conneciton A, rollback the pending
265+
# 7. On the replica, using client connection A, rollback the pending
213266
# transaction, leading to the following lock acquisition dependencies:
214267
# Client B --statement required lock--> Worker 2 --commit order
215268
# lock--> Worker 1 --statement required lock--> Client B.
@@ -240,6 +293,20 @@ if ($mts_spco_gd_wait_for_coordinator_running_state != '')
240293
}
241294
}
242295

296+
if ($mts_spco_gd_trx_assigned_worker_3 != '')
297+
{
298+
if ($mts_spco_gd_worker_3_only_runs_after_deadlock == 1)
299+
{
300+
--echo include/wait_condition.inc [Coordinator must wait for workers to stop]
301+
--let $wait_condition = SELECT count(*) = 1 FROM information_schema.processlist WHERE STATE = "Waiting for workers to exit"
302+
--source include/wait_condition.inc
303+
--let $rpl_connection_name = rpl_slave_connection_3
304+
--source include/rpl_connection.inc
305+
ROLLBACK;
306+
SET GTID_NEXT = AUTOMATIC;
307+
}
308+
}
309+
243310
# 10. Wait for the replica applier thread to error out with
244311
# $mts_spco_gd_error_expected_replica, if any is defined.
245312
if ($mts_spco_gd_error_expected_replica != 0)
@@ -280,6 +347,7 @@ if ($mts_spco_gd_error_expected_replica != 0)
280347
--let $mts_spco_gd_trx_blocking_worker_1 =
281348
--let $mts_spco_gd_trx_assigned_worker_2 =
282349
--let $mts_spco_gd_trx_assigned_worker_3 =
350+
--let $mts_spco_gd_worker_3_only_runs_after_deadlock =
283351
--let $mts_spco_gd_trx_blocking_client =
284352
--let $mts_spco_gd_state_blocking_client =
285353
--let $mts_spco_gd_wait_for_coordinator_running_state =

mysql-test/extra/rpl_tests/mts_spco_generate_deadlock_setup.inc

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -30,3 +30,10 @@ SET GLOBAL slave_preserve_commit_order = ON;
3030
--eval SET GLOBAL slave_transaction_retries = $mts_spco_gd_transaction_retries
3131
--replace_result $mts_spco_gd_innodb_wait_timeout INNODB_LOCK_WAIT_TIMEOUT
3232
--eval SET GLOBAL innodb_lock_wait_timeout = $mts_spco_gd_innodb_wait_timeout
33+
34+
--let $rpl_connection_name = rpl_slave_connection_2
35+
--let $rpl_server_number = 2
36+
--source include/rpl_connect.inc
37+
--let $rpl_connection_name = rpl_slave_connection_3
38+
--let $rpl_server_number = 2
39+
--source include/rpl_connect.inc

0 commit comments

Comments
 (0)