Skip to content

Conversation

@andrelkin
Copy link
Contributor

@andrelkin andrelkin commented Sep 15, 2025

This is the 1st of two commits to demonstrate the failure which is
that a rolling back transaction while serializes its completion in
Engine before a contender transaction the two are logged into binary
log in reverse order.

  • The Jira issue number for this PR is: MDEV-______

Description

TODO: fill description here

Release Notes

TODO: What should the release notes say about this change?
Include any changed system variables, status variables or behaviour. Optionally list any https://mariadb.com/kb/ pages that need changing.

How can this PR be tested?

TODO: modify the automated test suite to verify that the PR causes MariaDB to behave as intended.
Consult the documentation on "Writing good test cases".

If the changes are not amenable to automated testing, please explain why not and carefully describe how to test manually.

Basing the PR against the correct MariaDB version

  • This is a new feature or a refactoring, and the PR is based against the main branch.
  • This is a bug fix, and the PR is based against the earliest maintained branch in which the bug can be reproduced.

PR quality check

  • I checked the CODING_STANDARDS.md file and my PR conforms to this where appropriate.
  • For any trivial modifications to the PR, I am ok with the reviewer making the changes themselves.

This is the 1st of two commits to demonstrate the failure which is
that a rolling back transaction while serializes its completion in
Engine before a contender transaction the two are logged into binary
log in reverse order.
The 2nd commit of two contains the fixes.

Two transactions could binlog their completions in opposite to how it
is done in Engine. That is is rare situations ROLLBACK in Engine could
be scheduled first by 2pc coordinator to leave by chance binlogging of
the transaction and whatever might be dependent on it in execution time.

This is fixed with ensuring the binlog rollback handlerton is always
executed first in 2pc.
@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

Copy link
Contributor

@bnestere bnestere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @andrelkin ! Thanks for the patch!

I left a couple questions on a comment.


int err;
if (has_binlog_hton(ha_info) &&
(err= binlog_hton->rollback(binlog_hton, thd, all)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't had time to fully investigate some of my questions/concerns before I leave for vacation, but I figured I'd at least voice them before I leave.

@knielsen mentions on JIRA

Hm. Normally this kind of wrong order is prevented by ensuring that binlogging always happens before releasing table and row locks.
Apparently, this is violated here for ROLLBACK. Not sure why or how the ROLLBACK is executed before binlogging happens.

Which I interpret as supporting your solution. But to me it looks a bit unconventional:

  • Why does the 2pc coordinator/orderer not consider rollback (or does it and I just misunderstand)?
  • Would it be better to integrate the rollback case with the 2pc coordinator?

Otherwise, it seems we may run other opposite binlog-order problems when rollback is BINLOG->ENGINE and commit can be ENGINE->BINLOG for direct writes.

Copy link
Contributor Author

@andrelkin andrelkin Sep 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does the 2pc coordinator/orderer not consider rollback

I am somewhat lost with the question.. Let me make it fully clear, the matter at hand is about the engine rollback
runs ahead of its order in the 2pc protocol. The engine completes (a similar would be with COMMIT, but it does not exist that thanks to the same measure as in the patch) the trx before binlog (as coordinator) gets informed.

@knielsen
Copy link
Member

knielsen commented Sep 24, 2025 via email

@andrelkin
Copy link
Contributor Author

@knielsen

do we really need to do this in 10.6?

Yes we unfortunately do, there's an active support case. And for that reason I suggest we keep the ugly style for 10.6 alone. After all its pattern is proven to be effectual and robust.

For 10.11 and up I am preparing much better shaped code that addresses the search for binlog hton.
It should be ready for review in a separate PR.

I hope the two branches compromise makes sense to you.

@knielsen
Copy link
Member

knielsen commented Sep 30, 2025 via email

@knielsen
Copy link
Member

knielsen commented Sep 30, 2025 via email

@andrelkin
Copy link
Contributor Author

andrelkin commented Sep 30, 2025

If we can ensure

Not that it's impossible but I've been pessimistic about it as there're a number of ways to screw this nice property.

Just one simple case. Let's have binlog_ignore_db=R and trx updates first Innodb in the database I to binlog and Rocksdb in R - so to not.
Then A two engine - like Innodb and Innodb - transaction's trans->ha_list would be built as RocksDB, Binlog, Innodb.

@andrelkin
Copy link
Contributor Author

andrelkin commented Sep 30, 2025

Well, the above RocksDB, Binlog, Innodb sequence of ha_info actually takes place without any binlog_ignore_db=R or if an Aria statement instead of RocksDB.
That is

Begin;
  insert into t_innodb...;
  insert into t_aria;
Commit;

Now

(gdb) p thd->transaction->all->ha_list->m_ht->commit
+p thd->transaction->all->ha_list->m_ht->commit
$1 = (int (*)(THD *, bool)) 0x5555583dba8a <maria_commit(THD*, bool)>

My feeling remains the same: it costs nothing to locate Binlog (to do that w/o list scan naturally, rather remembering its introduction to ha_list) and exclude hton->complete action on it in the loop (like it is currently). That has been my current plan B.

B because I had A which aimed to keep Binlog always in the head via slight changes in trans_register_ha().
This seemed to be a promising approach except ha_savepoint() and ha_rollback_to_savepoint() are defined to rely on the current always prepend method to trans->ha_list.
I turned to B, as I could not quickly find/decide-on how/whether the function could be adapted to A. E.g with SAVEPOINT *sv to contain a list of trans->ha_info copies. Or we may consider to tolerate some of my_error(ER_ERROR_DURING_ROLLBACK...inha_rollback_to_savepoint()`.

@andrelkin andrelkin requested review from montywi and vuvova September 30, 2025 16:30
@andrelkin
Copy link
Contributor Author

Added Monty and Serg to the discussion about

I just talked to Monty, he mentioned that the binlog is supposed to always
be first in the ha_info list.

@andrelkin
Copy link
Contributor Author

andrelkin commented Oct 8, 2025

I gave it some more thinking. Here is a summary of transaction::all::ha_list ordering state of affairs
and my suggestion how to proceed.
First of all, to remind the issue is incorrect binlogging on master spotted in a two transactions scenario with a specific partly non-transactional workload. When one of transactions rolls back it finds its binlog and engine branches ordered "unnaturally" (Binlog is not first in the transaction branches list) which ensues incorrect logging.

There exist few patterns when transaction::all::ha_list is (Engine_n, Binlog, Engine_n-1, ...) and which don't
have to be this way. Like

Begin;
  insert into t_innodb...;
  insert into t_aria ...;
Commit;

where it's (Aria, Binlog, Innodb). The precise "all" list at commit must be (Binlog, Innodb)
because the Aria statement is obviously outside the main transaction.
Its is clear how to improve these cases.

There also exist a harder pattern like mentioned two 2pc engines with one them
such that its updates are filtered out, on either master or slave, of binlog which leads to the binlog-in-the-middle list.
In order to handle the latter pattern I see only two main options available

  1. enforce binlog ha_info be always in the head of transaction->all.ha_list, or
  2. remember (or just search as the current server does) for it to execute its
    binlog_hton->"complete"() first in ha_{commit,rollback}_trans() loops.

If we go via p.1, we'll to fix ha_savepoint() etc, as this feature relies on
the prepared-only policy in growing the list.
If we go via p.2 we'd have to exempt the binlog ha_info from "complete"() inside the loop,
similarly to how it's currently done in commit_one_phase_2()

for (binlog_hton->"complete"() when present ; ha_info ; ha_info= ha_info->next)
    if (hton is not binlog)
         hton->complete()

I don't see why the 2nd, being clearly of lesser efforts, should be lesser favorable.

Dear reviewers, please respond with your ideas or voting.

@andrelkin
Copy link
Contributor Author

andrelkin commented Oct 17, 2025

Somewhat polished version of p.2 of my last comments is pushed to bb-10.11-andrei with the idea it to be 10.11+ version while bb-10.6-MDEV-37541 to cover 10.6 alone.

@vuvova
Copy link
Member

vuvova commented Oct 17, 2025

Disclaimer: I only looked at the first commit. For now I'm trying to understand a problem and what a solution should be, not to discuss a particular solution.

I see a few rather suspicious issues demonstrated by the first commit:

  • Why binlog starts a transaction for a memory table? It's not going to log anything into the trx cache. What does "transaction for memory table" conceptually mean in binlog? A bit more generally, to include Andrei's case 2, what does "transaction that is not going to be logged into the trx cache" mean?

  • There's no tc->prepare() call. Even though a transaction must be logged by a tc. This looks wrong. If a transaction needs to be written into binlog, it should go through tc->prepare() whether it's commit or rollback. Kristian wrote something similar above too.

@andrelkin
Copy link
Contributor Author

andrelkin commented Oct 17, 2025

The immediate issue is indeed that a Memory update caused the binlog branch registration at logging in ROW format and that's due to a legacy pre- (I did not track it further down) 9e62cf67b3cf commit that executes

if (in_multi_stmt_transaction_mode()))
       trans_register_ha(this, TRUE, binlog_hton);
     trans_register_ha(this, FALSE, binlog_hton);

So we do that for any engine modification. The practice introduced back in time may (does) need revising just as you and Kristian point.
As I mention above this dubious pattern extended further over engines like Aria that "all"-registers themselves too.

However as comments mention this issue is wider than how it appears here. See the {Rocks, Binlog, Innodb} Rocksdb's binlog-filtered updates last in transaction example.

@vuvova
Copy link
Member

vuvova commented Oct 17, 2025

binlog-filtered doesn't seem to matter, but, indeed, with two engines one can have binlog in the middle. Still, Kristian and my point stands — Rollback should do full 2pc if it needs to write something to binlog.

@andrelkin
Copy link
Contributor Author

andrelkin commented Oct 18, 2025

binlog-filtered doesn't seem to matter

:-(

I over-complicated it: {Innodb, Binlog, Rocksdb} is a natural order for two engine trx like

begin;
  insert into t_rocks  set a=1;
  insert into t_innodb set a=1;
rollback;

my point stands — Rollback should do full 2pc if it needs to write something to binlog.

For what reason actually?
If you mean crash-recovery of the transactional engine part, then past recovery either prepare()-d as you suggest or un-prepare()-d transaction must complete with rollback? The unprepared one always does so, regardless of what is in binlog.
The prepared one would have to be decided..

The policy of ROLLBACK-ended logging, no need to remind you, is merely to record side effects which of course are non-transactional by definition. And in that sense such logging is similar to any non-transactional logging.

PS. I read a key Kristian's note

use an explicit call into the binlog that does the operations needed

(rather than binlog_hton->rollback() that is) , as something similar to what has happened bo binlog_hton->commit().
Just we cannot "deprecate" the rollback method at least without changes around ha_rollback_to_savepoint.

@vuvova
Copy link
Member

vuvova commented Oct 20, 2025

What do you mean, for what reason, that's how 2pc works. if there's one participant that cannot do 2pc, it should be in the middle. That's why the protocol is what it is — every engine prepares, binlog commits (writes to binlog, it's tc->log call), then everyone commits.

As for rollbacks, there's normally no need to do that, every participant (including binlog) simply throws transaction changes away.

But a "rollback" when a non-trans engine is involved is not really a rollback. That engine has the changes committed. And binlog needs to commit (write transaction down persistently) too, so it's a commit, of a sort. And must be subjected to full 2pc.

{Innodb, Binlog, Rocksdb} case is not a problem, it'll be a true rollback and nothing will be written to binlog. It doesn't matter whether nothing will be written to binlog first or last.

@andrelkin
Copy link
Contributor Author

But a "rollback" when a non-trans engine is involved is not really a rollback. That engine has the changes committed.

What engine committed? None of the trans->all.ha_list does so.

As for rollbacks, there's normally no need to do that

just as you say the transaction is fine to rollback anyway it likes.

What happens is that along with the transactional changes there's a non-transactional modification which may be placed in the transactional cache. And the whole cache with ROLLBACK termination is logged to reflect that fact.
We might have designed instead a method to extract that modification from the cache and log it alone.
Obviously it need not any prepare().

{Innodb, Binlog, Rocksdb} case is not a problem

Not a problem without side effects. But we have ones in this bug... I should've mentioned that just in case in my last comments).

@vuvova
Copy link
Member

vuvova commented Oct 21, 2025

What engine committed? None of the trans->all.ha_list does so.

Aria. Or MyISAM. Any non-transactional engine. "Committed" in the sense that changes made between START TRANSACTION and ROLLBACK were made persistent.

Not a problem without side effects. But we have ones in this bug...

what problem do you have in this bug with {Innodb, Binlog, Rocksdb} case?

@andrelkin
Copy link
Contributor Author

Aria. Or MyISAM. Any non-transactional engine.

Right. So we can say it one-phase-committed.

We have two different ways to log such 1pc engine statement when they mingle a 2pc engine transaction.
And I have been pointing to that either the ROLLBACK-termination (e.g necessary for multi-table update) or
the second "standard" logging method ahead of the 2pc transaction (of a pure 1pc engine modification if it mingles the 2pc trx) need not any prepare by the 2pc trx.
It rolls back when it likes.
The ROLLBACK-termination can be conceived as if it's an extended standard method. Extended in the sense we log
some extra events whose properties only require roll-back wherever the whole event will be replayed.

what problem do you have in this bug with {Innodb, Binlog, Rocksdb} case?

When one of two 2pc transactions that depend on each other through Innodb engine, and the dependency parent is represented by the list of handlertons as { Innodb, Binlog, Rocksdb} also having a side effect, the 2 engine parent trx may end up in binlog as it happens in the bug. That is ROLLBACK-terminatedly logged after the child.
Indeed, Innodb branch is rolled back, the child Innodb updates and logs, the parent completes its entire work with rolling back Rocksdb and the logging.

@andrelkin
Copy link
Contributor Author

@vuvova , just one little thing, there is a detail that seems to have been left out

For now I'm trying to understand a problem

but may be critical to the understanding.

And it is that is not the Memory table that produces the side effect in the test. It's DROP TEMPORARY TABLE which causes
trans_cannot_safely_rollback() to claim
rollback is unsafe. The Memory table role is to "simulate" a 2nd 2pc-capanle engine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

7 participants