MDEV-37541 Race of rolling back and committing transaction to binlog #4301

andrelkin · 2025-09-15T17:31:45Z

This is the 1st of two commits to demonstrate the failure which is
that a rolling back transaction while serializes its completion in
Engine before a contender transaction the two are logged into binary
log in reverse order.

The Jira issue number for this PR is: MDEV-______

Description

TODO: fill description here

Release Notes

TODO: What should the release notes say about this change?
Include any changed system variables, status variables or behaviour. Optionally list any https://mariadb.com/kb/ pages that need changing.

How can this PR be tested?

TODO: modify the automated test suite to verify that the PR causes MariaDB to behave as intended.
Consult the documentation on "Writing good test cases".

If the changes are not amenable to automated testing, please explain why not and carefully describe how to test manually.

Basing the PR against the correct MariaDB version

This is a new feature or a refactoring, and the PR is based against the main branch.
This is a bug fix, and the PR is based against the earliest maintained branch in which the bug can be reproduced.

PR quality check

I checked the CODING_STANDARDS.md file and my PR conforms to this where appropriate.
For any trivial modifications to the PR, I am ok with the reviewer making the changes themselves.

This is the 1st of two commits to demonstrate the failure which is that a rolling back transaction while serializes its completion in Engine before a contender transaction the two are logged into binary log in reverse order.

The 2nd commit of two contains the fixes. Two transactions could binlog their completions in opposite to how it is done in Engine. That is is rare situations ROLLBACK in Engine could be scheduled first by 2pc coordinator to leave by chance binlogging of the transaction and whatever might be dependent on it in execution time. This is fixed with ensuring the binlog rollback handlerton is always executed first in 2pc.

CLAassistant · 2025-09-15T17:31:52Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

…s defined.

bnestere

Hi @andrelkin ! Thanks for the patch!

I left a couple questions on a comment.

bnestere · 2025-09-19T16:24:03Z

sql/handler.cc


+    int err;
+    if (has_binlog_hton(ha_info) &&
+        (err= binlog_hton->rollback(binlog_hton, thd, all)))


I haven't had time to fully investigate some of my questions/concerns before I leave for vacation, but I figured I'd at least voice them before I leave.

@knielsen mentions on JIRA

Hm. Normally this kind of wrong order is prevented by ensuring that binlogging always happens before releasing table and row locks.
Apparently, this is violated here for ROLLBACK. Not sure why or how the ROLLBACK is executed before binlogging happens.

Which I interpret as supporting your solution. But to me it looks a bit unconventional:

Why does the 2pc coordinator/orderer not consider rollback (or does it and I just misunderstand)?

Would it be better to integrate the rollback case with the 2pc coordinator?

Otherwise, it seems we may run other opposite binlog-order problems when rollback is BINLOG->ENGINE and commit can be ENGINE->BINLOG for direct writes.

Why does the 2pc coordinator/orderer not consider rollback

I am somewhat lost with the question.. Let me make it fully clear, the matter at hand is about the engine rollback
runs ahead of its order in the 2pc protocol. The engine completes (a similar would be with COMMIT, but it does not exist that thanks to the same measure as in the patch) the trx before binlog (as coordinator) gets informed.

knielsen · 2025-09-24T08:21:18Z

andrelkin ***@***.***> writes:

This is the 1st of two commits to demonstrate the failure which is that a rolling back transaction while serializes its completion in Engine before a contender transaction the two are logged into binary log in reverse order.<!-- diff --git a/sql/handler.cc b/sql/handler.cc index 5d7138490fdd2..c8ddb0b5ebd5f 100644 --- a/sql/handler.cc +++ b/sql/handler.cc @@ -2306,11 +2306,18 @@ int ha_rollback_trans(THD *thd, bool all) if (is_real_trans) /* not a statement commit */ thd->stmt_map.close_transient_cursors(); + int err; + if (has_binlog_hton(ha_info) && + (err= binlog_hton->rollback(binlog_hton, thd, all))) + { + my_error(ER_ERROR_DURING_COMMIT, MYF(0), err); + error= 1; + } for (; ha_info; ha_info= ha_info_next) { - int err; handlerton *ht= ha_info->ht(); - if ((err= ht->rollback(ht, thd, all))) + + if (ht != binlog_hton && (err= ht->rollback(ht, thd, all)))

If binlog needs to do some operation before engines roll back (and I agree this is needed here), then that operation should not be done inside the `binlog_hton->rollback()` call. This is really ugly, first we iterate a list to find some specific element and operate on it, then we iterate the list again and have a check to skip that specific element. Instead, use an explicit call into the binlog that does the operations needed, at the point it time where it should happen. (And yes, I know this is done similar in `commit_one_phase_2()`, and it is as wrong there as it is here. Repeating a sin doesn't make it right ;-). Also, do we really need to do this in 10.6? That doesn't seem right. This looks like a bug that has always been there, and IIUC it only occurs when mixing transactional and non-transactional tables in the same "transaction", something that is in any case fragile and problem-ridden. - Kristian.

andrelkin · 2025-09-29T19:15:20Z

@knielsen

do we really need to do this in 10.6?

Yes we unfortunately do, there's an active support case. And for that reason I suggest we keep the ugly style for 10.6 alone. After all its pattern is proven to be effectual and robust.

For 10.11 and up I am preparing much better shaped code that addresses the search for binlog hton.
It should be ready for review in a separate PR.

I hope the two branches compromise makes sense to you.

knielsen · 2025-09-30T08:58:23Z

andrelkin ***@***.***> writes:

andrelkin left a comment (MariaDB/server#4301) @knielsen > do we really need to do this in 10.6? Yes we unfortunately do, there's an *active* support case. And for

That's not a valid reason.

I suggest we keep the ugly style for 10.6 alone. After all its pattern is proven to be effectual and robust. For 10.11 and up I am preparing much better shaped code that addresses the search for binlog hton. I hope the two branches compromise makes sense to you.

Ok, sounds reasonable. - Kristian.

knielsen · 2025-09-30T10:35:18Z

> + int err; > + if (has_binlog_hton(ha_info) && > + (err= binlog_hton->rollback(binlog_hton, thd, all))) > + { > + my_error(ER_ERROR_DURING_COMMIT, MYF(0), err); > + error= 1; > + } > for (; ha_info; ha_info= ha_info_next) > { > - int err; > handlerton *ht= ha_info->ht(); > - if ((err= ht->rollback(ht, thd, all))) > + > + if (ht != binlog_hton && (err= ht->rollback(ht, thd, all))) If binlog needs to do some operation before engines roll back (and I agree this is needed here), then that operation should not be done inside the `binlog_hton->rollback()` call. This is really ugly, first we iterate a list to find some specific element and operate on it, then we iterate the list again and have a check to skip that specific element. Instead, use an explicit call into the binlog that does the operations needed, at the point it time where it should happen.

Andrei, please hold off with this patch for a bit. I just talked to Monty, he mentioned that the binlog is supposed to always be first in the ha_info list. Possibly something at some point broke that, and this could be the root cause of this bug (or regression in this case). Monty wants to look into this and discuss with Serg. If we can ensure that binlog is always first in the list, then it automatically should fix this case, and could fix other corner cases as well. - Kristian.

andrelkin · 2025-09-30T12:18:47Z

If we can ensure

Not that it's impossible but I've been pessimistic about it as there're a number of ways to screw this nice property.

Just one simple case. ~~Let's have binlog_ignore_db=R and trx updates first Innodb in the database I to binlog and Rocksdb in R - so to not.~~
~~Then~~ A two engine - like Innodb and Innodb - transaction's trans->ha_list would be built as RocksDB, Binlog, Innodb.

andrelkin · 2025-09-30T16:29:29Z

Well, the above RocksDB, Binlog, Innodb sequence of ha_info actually takes place without any binlog_ignore_db=R or if an Aria statement instead of RocksDB.
That is

Begin;
  insert into t_innodb...;
  insert into t_aria;
Commit;

Now

(gdb) p thd->transaction->all->ha_list->m_ht->commit
+p thd->transaction->all->ha_list->m_ht->commit
$1 = (int (*)(THD *, bool)) 0x5555583dba8a <maria_commit(THD*, bool)>

My feeling remains the same: it costs nothing to locate Binlog (to do that w/o list scan naturally, rather remembering its introduction to ha_list) and exclude hton->complete action on it in the loop (like it is currently). That has been my current plan B.

B because I had A which aimed to keep Binlog always in the head via slight changes in trans_register_ha().
This seemed to be a promising approach except ha_savepoint() and ha_rollback_to_savepoint() are defined to rely on the current always prepend method to trans->ha_list.
I turned to B, as I could not quickly find/decide-on how/whether the function could be adapted to A. E.g with SAVEPOINT *sv to contain a list of trans->ha_info copies. Or we may consider to tolerate some of my_error(ER_ERROR_DURING_ROLLBACK...inha_rollback_to_savepoint()`.

andrelkin · 2025-09-30T16:31:41Z

Added Monty and Serg to the discussion about

I just talked to Monty, he mentioned that the binlog is supposed to always
be first in the ha_info list.

andrelkin · 2025-10-08T17:04:13Z

I gave it some more thinking. Here is a summary of transaction::all::ha_list ordering state of affairs
and my suggestion how to proceed.
First of all, to remind the issue is incorrect binlogging on master spotted in a two transactions scenario with a specific partly non-transactional workload. When one of transactions rolls back it finds its binlog and engine branches ordered "unnaturally" (Binlog is not first in the transaction branches list) which ensues incorrect logging.

There exist few patterns when transaction::all::ha_list is (Engine_n, Binlog, Engine_n-1, ...) and which don't
have to be this way. Like

Begin;
  insert into t_innodb...;
  insert into t_aria ...;
Commit;

where it's (Aria, Binlog, Innodb). The precise "all" list at commit must be (Binlog, Innodb)
because the Aria statement is obviously outside the main transaction.
Its is clear how to improve these cases.

There also exist a harder pattern like mentioned two 2pc engines with one them
such that its updates are filtered out, on either master or slave, of binlog which leads to the binlog-in-the-middle list.
In order to handle the latter pattern I see only two main options available

enforce binlog ha_info be always in the head of transaction->all.ha_list, or
remember (or just search as the current server does) for it to execute its
binlog_hton->"complete"() first in ha_{commit,rollback}_trans() loops.

If we go via p.1, we'll to fix ha_savepoint() etc, as this feature relies on
the prepared-only policy in growing the list.
If we go via p.2 we'd have to exempt the binlog ha_info from "complete"() inside the loop,
similarly to how it's currently done in commit_one_phase_2()

for (binlog_hton->"complete"() when present ; ha_info ; ha_info= ha_info->next)
    if (hton is not binlog)
         hton->complete()

I don't see why the 2nd, being clearly of lesser efforts, should be lesser favorable.

Dear reviewers, please respond with your ideas or voting.

andrelkin · 2025-10-17T08:20:43Z

Somewhat polished version of p.2 of my last comments is pushed to bb-10.11-andrei with the idea it to be 10.11+ version while bb-10.6-MDEV-37541 to cover 10.6 alone.

vuvova · 2025-10-17T12:31:56Z

Disclaimer: I only looked at the first commit. For now I'm trying to understand a problem and what a solution should be, not to discuss a particular solution.

I see a few rather suspicious issues demonstrated by the first commit:

Why binlog starts a transaction for a memory table? It's not going to log anything into the trx cache. What does "transaction for memory table" conceptually mean in binlog? A bit more generally, to include Andrei's case 2, what does "transaction that is not going to be logged into the trx cache" mean?
There's no tc->prepare() call. Even though a transaction must be logged by a tc. This looks wrong. If a transaction needs to be written into binlog, it should go through tc->prepare() whether it's commit or rollback. Kristian wrote something similar above too.

andrelkin · 2025-10-17T16:51:49Z

The immediate issue is indeed that a Memory update caused the binlog branch registration at logging in ROW format and that's due to a legacy pre- (I did not track it further down) 9e62cf67b3cf commit that executes

if (in_multi_stmt_transaction_mode()))
       trans_register_ha(this, TRUE, binlog_hton);
     trans_register_ha(this, FALSE, binlog_hton);

So we do that for any engine modification. The practice introduced back in time may (does) need revising just as you and Kristian point.
As I mention above this dubious pattern extended further over engines like Aria that "all"-registers themselves too.

However as comments mention this issue is wider than how it appears here. See the {Rocks, Binlog, Innodb} Rocksdb's binlog-filtered updates last in transaction example.

vuvova · 2025-10-17T19:16:38Z

binlog-filtered doesn't seem to matter, but, indeed, with two engines one can have binlog in the middle. Still, Kristian and my point stands — Rollback should do full 2pc if it needs to write something to binlog.

andrelkin · 2025-10-18T12:47:04Z

binlog-filtered doesn't seem to matter

:-(

I over-complicated it: {Innodb, Binlog, Rocksdb} is a natural order for two engine trx like

begin;
  insert into t_rocks  set a=1;
  insert into t_innodb set a=1;
rollback;

my point stands — Rollback should do full 2pc if it needs to write something to binlog.

For what reason actually?
If you mean crash-recovery of the transactional engine part, then past recovery either prepare()-d as you suggest or un-prepare()-d transaction must complete with rollback? The unprepared one always does so, regardless of what is in binlog.
The prepared one would have to be decided..

The policy of ROLLBACK-ended logging, no need to remind you, is merely to record side effects which of course are non-transactional by definition. And in that sense such logging is similar to any non-transactional logging.

PS. I read a key Kristian's note

use an explicit call into the binlog that does the operations needed

(rather than binlog_hton->rollback() that is) , as something similar to what has happened bo binlog_hton->commit().
Just we cannot "deprecate" the rollback method at least without changes around ha_rollback_to_savepoint.

vuvova · 2025-10-20T07:30:37Z

What do you mean, for what reason, that's how 2pc works. if there's one participant that cannot do 2pc, it should be in the middle. That's why the protocol is what it is — every engine prepares, binlog commits (writes to binlog, it's tc->log call), then everyone commits.

As for rollbacks, there's normally no need to do that, every participant (including binlog) simply throws transaction changes away.

But a "rollback" when a non-trans engine is involved is not really a rollback. That engine has the changes committed. And binlog needs to commit (write transaction down persistently) too, so it's a commit, of a sort. And must be subjected to full 2pc.

{Innodb, Binlog, Rocksdb} case is not a problem, it'll be a true rollback and nothing will be written to binlog. It doesn't matter whether nothing will be written to binlog first or last.

andrelkin · 2025-10-20T13:01:51Z

But a "rollback" when a non-trans engine is involved is not really a rollback. That engine has the changes committed.

What engine committed? None of the trans->all.ha_list does so.

As for rollbacks, there's normally no need to do that

just as you say the transaction is fine to rollback anyway it likes.

What happens is that along with the transactional changes there's a non-transactional modification which may be placed in the transactional cache. And the whole cache with ROLLBACK termination is logged to reflect that fact.
We might have designed instead a method to extract that modification from the cache and log it alone.
Obviously it need not any prepare().

{Innodb, Binlog, Rocksdb} case is not a problem

Not a problem without side effects. But we have ones in this bug... I should've mentioned that just in case in my last comments).

vuvova · 2025-10-21T13:14:14Z

What engine committed? None of the trans->all.ha_list does so.

Aria. Or MyISAM. Any non-transactional engine. "Committed" in the sense that changes made between START TRANSACTION and ROLLBACK were made persistent.

Not a problem without side effects. But we have ones in this bug...

what problem do you have in this bug with {Innodb, Binlog, Rocksdb} case?

andrelkin · 2025-10-21T16:26:21Z

Aria. Or MyISAM. Any non-transactional engine.

Right. So we can say it one-phase-committed.

We have two different ways to log such 1pc engine statement when they mingle a 2pc engine transaction.
And I have been pointing to that either the ROLLBACK-termination (e.g necessary for multi-table update) or
the second "standard" logging method ahead of the 2pc transaction (of a pure 1pc engine modification if it mingles the 2pc trx) need not any prepare by the 2pc trx.
It rolls back when it likes.
The ROLLBACK-termination can be conceived as if it's an extended standard method. Extended in the sense we log
some extra events whose properties only require roll-back wherever the whole event will be replayed.

what problem do you have in this bug with {Innodb, Binlog, Rocksdb} case?

When one of two 2pc transactions that depend on each other through Innodb engine, and the dependency parent is represented by the list of handlertons as { Innodb, Binlog, Rocksdb} also having a side effect, the 2 engine parent trx may end up in binlog as it happens in the bug. That is ROLLBACK-terminatedly logged after the child.
Indeed, Innodb branch is rolled back, the child Innodb updates and logs, the parent completes its entire work with rolling back Rocksdb and the logging.

andrelkin · 2025-10-22T16:34:45Z

@vuvova , just one little thing, there is a detail that seems to have been left out

For now I'm trying to understand a problem

but may be critical to the understanding.

And it is that is not the Memory table that produces the side effect in the test. It's DROP TEMPORARY TABLE which causes
trans_cannot_safely_rollback() to claim
rollback is unsafe. The Memory table role is to "simulate" a 2nd 2pc-capanle engine.

emoonrain added 2 commits September 15, 2025 18:34

andrelkin requested review from bnestere and knielsen September 15, 2025 17:31

MDEV-37541 annex: enforce an added test to run only when DEBUG_SYNC i…

24ea8f9

…s defined.

svoj added the MariaDB Corporation label Sep 16, 2025

bnestere reviewed Sep 19, 2025

View reviewed changes

andrelkin requested review from montywi and vuvova September 30, 2025 16:30

andrelkin mentioned this pull request Oct 17, 2025

10.11 patch for MDEV-37541 (Race of rolling back and committing transaction to binlog #4376

Open

4 tasks

Uh oh!

MDEV-37541 Race of rolling back and committing transaction to binlog #4301

Are you sure you want to change the base?

MDEV-37541 Race of rolling back and committing transaction to binlog #4301

Conversation

andrelkin commented Sep 15, 2025 • edited by vuvova Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Release Notes

How can this PR be tested?

Basing the PR against the correct MariaDB version

PR quality check

Uh oh!

CLAassistant commented Sep 15, 2025

Uh oh!

bnestere left a comment

Choose a reason for hiding this comment

Uh oh!

bnestere Sep 19, 2025

Choose a reason for hiding this comment

Uh oh!

andrelkin Sep 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

knielsen commented Sep 24, 2025 via email

Uh oh!

andrelkin commented Sep 29, 2025

Uh oh!

knielsen commented Sep 30, 2025 via email

Uh oh!

knielsen commented Sep 30, 2025 via email

Uh oh!

andrelkin commented Sep 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andrelkin commented Sep 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andrelkin commented Sep 30, 2025

Uh oh!

andrelkin commented Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andrelkin commented Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vuvova commented Oct 17, 2025

Uh oh!

andrelkin commented Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vuvova commented Oct 17, 2025

Uh oh!

andrelkin commented Oct 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vuvova commented Oct 20, 2025

Uh oh!

andrelkin commented Oct 20, 2025

Uh oh!

vuvova commented Oct 21, 2025

Uh oh!

andrelkin commented Oct 21, 2025

Uh oh!

andrelkin commented Oct 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

7 participants

andrelkin commented Sep 15, 2025 •

edited by vuvova

Loading

andrelkin Sep 19, 2025 •

edited

Loading

andrelkin commented Sep 30, 2025 •

edited

Loading

andrelkin commented Sep 30, 2025 •

edited

Loading

andrelkin commented Oct 8, 2025 •

edited

Loading

andrelkin commented Oct 17, 2025 •

edited

Loading

andrelkin commented Oct 17, 2025 •

edited

Loading

andrelkin commented Oct 18, 2025 •

edited

Loading