Skip to content

Commit dc0fce1

Browse files
author
Ole John Aske
committed
Bug#22842538 BINLOG SCHEMA DISTRIBUTION TIMEOUT AND FAILS WHEN ANOTHER MYSQL NODE START
When another mysqld node is started, and joins (subscribe to) the schema distribution protocol, another mysqld which is waiting for a schema change to be distributed will timeout during that wait. That happens as we incorrectly assumed that the new arriving mysqld node would also 'ack' the schema distribution. However, it arrived too late to be a participant in it. This patch fixes 3 issues all contributing to this failure: a) There is a potential race between an 'inflight' subscribe event, and the start of a schema distribution. The subscribing node might or might not take part in the schema distribution, and its role is actually unknown at the point in time where the schema operation is started by the coordinator. The set of participating servers could only be determined when the Coordinator acks its own schema op: If the subscribe event arrived before it own schema up, then the subcribing node is a participant. This patch modifies the Coordinators ack to also modifying the acked slock_bitmap to clear the servers *not* participating. b) check_wakeup_clients() called get_subcriber_bitmask() to get the current set of subscribers. However, 'self' was not included in the subscribers, which it always should be. Fixed this by letting Ndb_schema_dist_data::init() add 'own_nodeid' to subscribers. Furthermore, this enables us to clean up a couple of places where we used to add own_nodeid to the set retrieved from get_subscribers_bitmask(). c) handle_clear_slock() copied schema->slock into ndb_schema_object->slock_bitmap, thereby overwriting the intersect done as part of a). Changed the copy to do an intersect instead. This patch also modifies several places where schema distribution progress is printed: - Always print more significant part of bitmask before the less significant. - Adds some formating when printing the bitmasks. Also removes a few clear of bitmasks immediately after an init, which is redundant as ::init() also cleared it.
1 parent 65978e0 commit dc0fce1

File tree

3 files changed

+221
-85
lines changed

3 files changed

+221
-85
lines changed

mysql-test/suite/ndb_binlog/r/ndb_binlog_ddl_multi.result

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -282,3 +282,23 @@ mysqld-bin.000001 # Query 2 # drop database mysqltest
282282
mysqld-bin.000001 # Query 2 # use `test`; drop table `test`.`t1`
283283
mysqld-bin.000001 # Query 2 # use `test`; create table dummy (dummyk int primary key) engine = ndb
284284
mysqld-bin.000001 # Query 2 # use `test`; DROP TABLE `dummy` /* generated by server */
285+
CREATE TABLE progress(
286+
cnt int, stop_flag int
287+
) ENGINE NDB;
288+
insert into progress value(0,0);
289+
Start background load distributing schema changes.
290+
call p1();
291+
Restart mysqld 'server2'
292+
Checking for schema ops. still making progress
293+
Restart mysqld 'server2'
294+
Checking for schema ops. still making progress
295+
Restart mysqld 'server2'
296+
Checking for schema ops. still making progress
297+
Restart mysqld 'server2'
298+
Checking for schema ops. still making progress
299+
Stopping background load distributing schema changes.
300+
update progress set stop_flag=1;
301+
Wait for background schema distribution load to complete.
302+
Cleanup
303+
drop procedure p1;
304+
drop table progress;

mysql-test/suite/ndb_binlog/t/ndb_binlog_ddl_multi.test

Lines changed: 91 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -250,3 +250,94 @@ enable_query_log;
250250
drop table dummy;
251251
--source include/show_binlog_events2.inc
252252

253+
##########################
254+
# Bug#22842538:
255+
#
256+
# New mysqld's joining the schema distribution may cause
257+
# ongoing DDL operations to never complete -> timeout
258+
#
259+
##########################
260+
connection server1;
261+
262+
263+
# Procedure p1 communicate through 'progress' table
264+
CREATE TABLE progress(
265+
cnt int, stop_flag int
266+
) ENGINE NDB;
267+
268+
269+
# Create a procedure for producing a background load of
270+
# DDL operations. Any operations, like a CREATE TABLE, would
271+
# do. However, we have experienced that CREATE LOGFILE seems
272+
# to be most likely to trigger this bug (Due to timing?)
273+
#
274+
disable_query_log;
275+
delimiter %;
276+
create procedure p1()
277+
begin
278+
declare done int default 0;
279+
repeat
280+
UPDATE progress set cnt=cnt+1;
281+
COMMIT;
282+
CREATE LOGFILE GROUP lg_1
283+
ADD UNDOFILE 'undo_1.dat'
284+
INITIAL_SIZE 4M
285+
UNDO_BUFFER_SIZE 2M
286+
ENGINE NDB;
287+
UPDATE progress set cnt=cnt+1;
288+
COMMIT;
289+
DROP LOGFILE GROUP lg_1 ENGINE NDB;
290+
SELECT stop_flag INTO done FROM progress;
291+
until done end repeat;
292+
end%
293+
delimiter ;%
294+
enable_query_log;
295+
296+
297+
insert into progress value(0,0);
298+
299+
--echo Start background load distributing schema changes.
300+
send call p1();
301+
302+
connection server2;
303+
let $1 = 4;
304+
while ($1)
305+
{
306+
# Ignore the warning generated by ndbcluster's binlog thread
307+
# when cluster is restarted
308+
--disable_query_log ONCE
309+
call mtr.add_suppression("mysqld startup An incident event has been written");
310+
311+
--echo Restart mysqld 'server2'
312+
let $mysqld_name=mysqld.2.1;
313+
--source include/restart_mysqld.inc
314+
315+
--echo Checking for schema ops. still making progress
316+
let $initial_cnt = `select cnt from progress`;
317+
let $progress_cnt = $initial_cnt;
318+
319+
let $max_wait = 10;
320+
while ($progress_cnt == $initial_cnt)
321+
{
322+
sleep 1;
323+
dec $max_wait;
324+
if ($max_wait == 0)
325+
{
326+
die Schema distribution timed out without progress;
327+
}
328+
let $progress_cnt = `select cnt from progress`;
329+
}
330+
dec $1;
331+
}
332+
333+
--echo Stopping background load distributing schema changes.
334+
update progress set stop_flag=1;
335+
336+
connection server1;
337+
338+
--echo Wait for background schema distribution load to complete.
339+
reap;
340+
341+
--echo Cleanup
342+
drop procedure p1;
343+
drop table progress;

0 commit comments

Comments
 (0)