[pull] master from torvalds:master #1874

pull · 2025-05-09T18:32:16Z

See Commits and Changes for more details.

Created by pull[bot] (v2.0.0-alpha.1)

Can you help keep this open source service alive? 💖 Please sponsor : )

There are a few spots where linked timeouts are armed, and not all of them adhere to the pre-arm, attempt issue, post-arm pattern. This can be problematic if the linked request returns that it will trigger a callback later, and does so before the linked timeout is fully armed. Consolidate all the linked timeout handling into __io_issue_sqe(), rather than have it spread throughout the various issue entry points. Cc: [email protected] Link: axboe/liburing#1390 Reported-by: Chase Hiltz <[email protected]> Signed-off-by: Jens Axboe <[email protected]>

Add support for the Zcb extension's compressed half-word instructions (C.LHU, C.LH, and C.SH) in the RISC-V misaligned access trap handler. Signed-off-by: Zong Li <[email protected]> Signed-off-by: Nylon Chen <[email protected]> Fixes: 956d705 ("riscv: Unaligned load/store handling for M_MODE") Reviewed-by: Alexandre Ghiti <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexandre Ghiti <[email protected]>

Some file systems do not support read_iter/write_iter, such as selinuxfs in this issue. So before calling them, first confirm that the interface is supported and then call it. It is releavant in that vfs_iter_read/write have the check, and removal of their used caused szybot to be able to hit this issue. Fixes: f2fed44 ("loop: stop using vfs_iter__{read,write} for buffered I/O") Reported-by: [email protected] Closes: https://syzkaller.appspot.com/bug?extid=6af973a3b8dfd2faefdc Signed-off-by: Lizhi Xu <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>

In case of a ZONE APPEND write, regardless of native ZONE APPEND or the emulation layer in the zone write plugging code, the sector the data got written to by the device needs to be updated in the bio. At the moment, this is done for every native ZONE APPEND write and every request that is flagged with 'BIO_ZONE_WRITE_PLUGGING'. But thus superfluously updates the sector for regular writes to a zoned block device. Check if a bio is a native ZONE APPEND write or if the bio is flagged as 'BIO_EMULATES_ZONE_APPEND', meaning the block layer's zone write plugging code handles the ZONE APPEND and translates it into a regular write and back. Only if one of these two criterion is met, update the sector in the bio upon completion. Signed-off-by: Johannes Thumshirn <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Link: https://lore.kernel.org/r/dea089581cb6b777c1cd1500b38ac0b61df4b2d1.1746530748.git.jth@kernel.org Signed-off-by: Jens Axboe <[email protected]>

The original nvme subsystem design didn't have a CONNECTING state; the state machine allowed transitions from RESETTING to LIVE directly. With the introduction of nvme fabrics the CONNECTING state was introduce. Over time the nvme-pci started to use the CONNECTING state as well. Eventually, a bug fix for the nvme-fc started to depend that the only valid transition to LIVE was from CONNECTING. Though this change didn't update the firmware update handler which was still depending on RESETTING to LIVE transition. The simplest way to address it for the time being is to switch into CONNECTING state before going to LIVE state. Fixes: d2fe192 ("nvme: only allow entering LIVE from CONNECTING state") Reported-by: Guenter Roeck <[email protected]> Signed-off-by: Daniel Wagner <[email protected]> Closes: https://lore.kernel.org/all/[email protected] Reviewed-by: Keith Busch <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Reviewed-by: Guenter Roeck <[email protected]>

Multishot normally uses io_req_post_cqe() to post completions, but when stopping it, it may finish up with a deferred completion. This is fine, except if another multishot event triggers before the deferred completions get flushed. If this occurs, then CQEs may get reordered in the CQ ring, as new multishot completions get posted before the deferred ones are flushed. This can cause confusion on the application side, if strict ordering is required for the use case. When multishot posting via io_req_post_cqe(), flush any pending deferred completions first, if any. Cc: [email protected] # 6.1+ Reported-by: Norman Maurer <[email protected]> Reported-by: Christian Mazakas <[email protected]> Signed-off-by: Jens Axboe <[email protected]>

In 'lookup_or_create_module_kobject()', an internal kobject is created using 'module_ktype'. So call to 'kobject_put()' on error handling path causes an attempt to use an uninitialized completion pointer in 'module_kobject_release()'. In this scenario, we just want to release kobject without an extra synchronization required for a regular module unloading process, so adding an extra check whether 'complete()' is actually required makes 'kobject_put()' safe. Reported-by: [email protected] Closes: https://syzkaller.appspot.com/bug?extid=7fb8a372e1f6add936dd Fixes: 942e443 ("module: Fix mod->mkobj.kobj potentially freed too early") Cc: [email protected] Suggested-by: Petr Pavlu <[email protected]> Signed-off-by: Dmitry Antipov <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Petr Pavlu <[email protected]>

Since both load/store and user/kernel should use almost the same path and that we are going to add some code around that, factorize it. Signed-off-by: Clément Léger <[email protected]> Reviewed-by: Alexandre Ghiti <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexandre Ghiti <[email protected]>

We can safely reenable IRQs if coming from userspace. This allows to access user memory that could potentially trigger a page fault. Fixes: b686ecd ("riscv: misaligned: Restrict user access to kernel memory") Signed-off-by: Clément Léger <[email protected]> Reviewed-by: Alexandre Ghiti <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexandre Ghiti <[email protected]>

Now that we can safely handle user memory accesses while in the misaligned access handlers, use get_user() instead of __get_user() to have user memory access checks. Signed-off-by: Clément Léger <[email protected]> Reviewed-by: Alexandre Ghiti <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexandre Ghiti <[email protected]>

When userspace does PR_SET_TAGGED_ADDR_CTRL, but Supm extension is not available, the kernel crashes: Oops - illegal instruction [#1] [snip] epc : set_tagged_addr_ctrl+0x112/0x15a ra : set_tagged_addr_ctrl+0x74/0x15a epc : ffffffff80011ace ra : ffffffff80011a30 sp : ffffffc60039be10 [snip] status: 0000000200000120 badaddr: 0000000010a79073 cause: 0000000000000002 set_tagged_addr_ctrl+0x112/0x15a __riscv_sys_prctl+0x352/0x73c do_trap_ecall_u+0x17c/0x20c andle_exception+0x150/0x15c Fix it by checking if Supm is available. Fixes: 09d6775 ("riscv: Add support for userspace pointer masking") Signed-off-by: Nam Cao <[email protected]> Cc: [email protected] Reviewed-by: Samuel Holland <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexandre Ghiti <[email protected]>

The .rela.dyn section contains runtime relocations and is only emitted for a relocatable kernel. riscv uses this section to relocate the kernel at runtime but that section is stripped from vmlinux. That prevents kexec to successfully load vmlinux since it does not contain the relocations info needed. Fixes: 559d1e4 ("riscv: Use --emit-relocs in order to move .rela.dyn in init") Tested-by: Björn Töpel <[email protected]> Reviewed-by: Björn Töpel <[email protected]> Acked-by: Ard Biesheuvel <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexandre Ghiti <[email protected]>

When the prctl() interface for pointer masking was added, it did not check that the pointer masking ISA extension was supported, only the individual submodes. Userspace could still attempt to disable pointer masking and query the pointer masking state. commit 81de1af ("riscv: Fix kernel crash due to PR_SET_TAGGED_ADDR_CTRL") disallowed the former, as the senvcfg write could crash on older systems. PR_GET_TAGGED_ADDR_CTRL state does not crash, because it reads only kernel-internal state and not senvcfg, but it should still be disallowed for consistency. Fixes: 09d6775 ("riscv: Add support for userspace pointer masking") Signed-off-by: Samuel Holland <[email protected]> Reviewed-by: Nam Cao <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexandre Ghiti <[email protected]>

Ever since commit eca2040("scsi: block: ioprio: Clean up interface definition"), the macro IOPRIO_PRIO_LEVEL() will mask the level value to something between 0 and 7 so necessarily, level will always be lower than IOPRIO_NR_LEVELS(8). Remove this obsolete check. Reported-by: Kexin Wei <[email protected]> Cc: Damien Le Moal <[email protected]> Signed-off-by: Aaron Lu <[email protected]> Reviewed-by: Damien Le Moal <[email protected]> Link: https://lore.kernel.org/r/20250508083018.GA769554@bytedance Signed-off-by: Jens Axboe <[email protected]>

… block-6.15 Pull NVMe fix from Christoph: "nvme fixes for linux 6.15 - unblock ctrl state transition for firmware update (Daniel Wagner)" * tag 'nvme-6.15-2025-05-08' of git://git.infradead.org/nvme: nvme: unblock ctrl state transition for firmware update

…/linux/kernel/git/alexghiti/linux into fixes riscv fixes for 6.15-rc6 - A fix to handle compressed halfword load/store instructions misaligned accesses - A fix to allow user memory access while handling a misaligned access - 2 fixes to return an error if the pointer masking extension is not implemented on the platform but userspace still tries to access it, which caused oops on some early platforms - A fix to prevent the stripping of .rela.dyn so that a vmlinux loaded by kexec can successfully boot * tag 'riscv-fixes-6.15-rc6' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/alexghiti/linux: riscv: Disallow PR_GET_TAGGED_ADDR_CTRL without Supm scripts: Do not strip .rela.dyn section riscv: Fix kernel crash due to PR_SET_TAGGED_ADDR_CTRL riscv: misaligned: use get_user() instead of __get_user() riscv: misaligned: enable IRQs while handling misaligned accesses riscv: misaligned: factorize trap handling riscv: misaligned: Add handling for ZCB instructions

Our QA team reported a 10%-23%, throughput reduction on an io_uring sqpoll testcase doing IO to a null_blk, that I traced back to a reduction of the device submission queue depth utilization. It turns out that, after commit af5d68f ("io_uring/sqpoll: manage task_work privately"), we capped the number of task_work entries that can be completed from a single spin of sqpoll to only 8 entries, before the sqpoll goes around to (potentially) sleep. While this cap doesn't drive the submission side directly, it impacts the completion behavior, which affects the number of IO queued by fio per sqpoll cycle on the submission side, and io_uring ends up seeing less ios per sqpoll cycle. As a result, block layer plugging is less effective, and we see more time spent inside the block layer in profilings charts, and increased submission latency measured by fio. There are other places that have increased overhead once sqpoll sleeps more often, such as the sqpoll utilization calculation. But, in this microbenchmark, those were not representative enough in perf charts, and their removal didn't yield measurable changes in throughput. The major overhead comes from the fact we plug less, and less often, when submitting to the block layer. My benchmark is: fio --ioengine=io_uring --direct=1 --iodepth=128 --runtime=300 --bs=4k \ --invalidate=1 --time_based --ramp_time=10 --group_reporting=1 \ --filename=/dev/nullb0 --name=RandomReads-direct-nullb-sqpoll-4k-1 \ --rw=randread --numjobs=1 --sqthread_poll In one machine, tested on top of Linux 6.15-rc1, we have the following baseline: READ: bw=4994MiB/s (5236MB/s), 4994MiB/s-4994MiB/s (5236MB/s-5236MB/s), io=439GiB (471GB), run=90001-90001msec With this patch: READ: bw=5762MiB/s (6042MB/s), 5762MiB/s-5762MiB/s (6042MB/s-6042MB/s), io=506GiB (544GB), run=90001-90001msec which is a 15% improvement in measured bandwidth. The average submission latency is noticeably lowered too. As measured by fio: Baseline: lat (usec): min=20, max=241, avg=99.81, stdev=3.38 Patched: lat (usec): min=26, max=226, avg=86.48, stdev=4.82 If we look at blktrace, we can also see the plugging behavior is improved. In the baseline, we end up limited to plugging 8 requests in the block layer regardless of the device queue depth size, while after patching we can drive more io, and we manage to utilize the full device queue. In the baseline, after a stabilization phase, an ordinary submission looks like: 254,0 1 49942 0.016028795 5977 U N [iou-sqp-5976] 7 After patching, I see consistently more requests per unplug. 254,0 1 4996 0.001432872 3145 U N [iou-sqp-3144] 32 Ideally, the cap size would at least be the deep enough to fill the device queue, but we can't predict that behavior, or assume all IO goes to a single device, and thus can't guess the ideal batch size. We also don't want to let the tw run unbounded, though I'm not sure it would really be a problem. Instead, let's just give it a more sensible value that will allow for more efficient batching. I've tested with different cap values, and initially proposed to increase the cap to 1024. Jens argued it is too big of a bump and I observed that, with 32, I'm no longer able to observe this bottleneck in any of my machines. Fixes: af5d68f ("io_uring/sqpoll: manage task_work privately") Signed-off-by: Gabriel Krisman Bertazi <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>

tl;dr: There is a window in the mm switching code where the new CR3 is set and the CPU should be getting TLB flushes for the new mm. But should_flush_tlb() has a bug and suppresses the flush. Fix it by widening the window where should_flush_tlb() sends an IPI. Long Version: === History === There were a few things leading up to this. First, updating mm_cpumask() was observed to be too expensive, so it was made lazier. But being lazy caused too many unnecessary IPIs to CPUs due to the now-lazy mm_cpumask(). So code was added to cull mm_cpumask() periodically[2]. But that culling was a bit too aggressive and skipped sending TLB flushes to CPUs that need them. So here we are again. === Problem === The too-aggressive code in should_flush_tlb() strikes in this window: // Turn on IPIs for this CPU/mm combination, but only // if should_flush_tlb() agrees: cpumask_set_cpu(cpu, mm_cpumask(next)); next_tlb_gen = atomic64_read(&next->context.tlb_gen); choose_new_asid(next, next_tlb_gen, &new_asid, &need_flush); load_new_mm_cr3(need_flush); // ^ After 'need_flush' is set to false, IPIs *MUST* // be sent to this CPU and not be ignored. this_cpu_write(cpu_tlbstate.loaded_mm, next); // ^ Not until this point does should_flush_tlb() // become true! should_flush_tlb() will suppress TLB flushes between load_new_mm_cr3() and writing to 'loaded_mm', which is a window where they should not be suppressed. Whoops. === Solution === Thankfully, the fuzzy "just about to write CR3" window is already marked with loaded_mm==LOADED_MM_SWITCHING. Simply checking for that state in should_flush_tlb() is sufficient to ensure that the CPU is targeted with an IPI. This will cause more TLB flush IPIs. But the window is relatively small and I do not expect this to cause any kind of measurable performance impact. Update the comment where LOADED_MM_SWITCHING is written since it grew yet another user. Peter Z also raised a concern that should_flush_tlb() might not observe 'loaded_mm' and 'is_lazy' in the same order that switch_mm_irqs_off() writes them. Add a barrier to ensure that they are observed in the order they are written. Signed-off-by: Dave Hansen <[email protected]> Acked-by: Rik van Riel <[email protected]> Link: https://lore.kernel.org/oe-lkp/[email protected]/ [1] Fixes: 6db2526 ("x86/mm/tlb: Only trim the mm_cpumask once a second") [2] Reported-by: Stephen Dolan <[email protected]> Cc: [email protected] Acked-by: Ingo Molnar <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

…rnel/git/modules/linux Pull modules fix from Petr Pavlu: "A single fix to prevent use of an uninitialized completion pointer when releasing a module_kobject in specific situations. This addresses a latent bug exposed by commit f95bbfe ("drivers: base: handle module_kobject creation")" * tag 'modules-6.15-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/modules/linux: module: ensure that kobject_put() is safe for module type kobjects

Pull io_uring fixes from Jens Axboe: - Fix for linked timeouts arming and firing wrt prep and issue of the request being managed by the linked timeout - Fix for a CQE ordering issue between requests with multishot and using the same buffer group. This is a dumbed down version for this release and for stable, it'll get improved for v6.16 - Tweak the SQPOLL submit batch size. A previous commit made SQPOLL manage its own task_work and chose a tiny batch size, bump it from 8 to 32 to fix a performance regression due to that * tag 'io_uring-6.15-20250509' of git://git.kernel.dk/linux: io_uring/sqpoll: Increase task_work submission batch size io_uring: ensure deferred completions are flushed for multishot io_uring: always arm linked timeouts prior to issue

Pull block fixes from Jens Axboe: - Fix for a regression in this series for loop and read/write iterator handling - zone append block update tweak - remove a broken IO priority test - NVMe pull request via Christoph: - unblock ctrl state transition for firmware update (Daniel Wagner) * tag 'block-6.15-20250509' of git://git.kernel.dk/linux: block: remove test of incorrect io priority level nvme: unblock ctrl state transition for firmware update block: only update request sector if needed loop: Add sanity check for read/write_iter

…linux/kernel/git/riscv/linux Pull RISC-V fixes from Palmer Dabbelt: - The compressed half-word misaligned access instructions (c.lhu, c.lh, and c.sh) from the Zcb extension are now properly emulated - A series of fixes to properly emulate permissions while handling userspace misaligned accesses - A pair of fixes for PR_GET_TAGGED_ADDR_CTRL to avoid accessing the envcfg CSR on systems that don't support that CSR, and to report those failures up to userspace - The .rela.dyn section is no longer stripped from vmlinux, as it is necessary to relocate the kernel under some conditions (including kexec) * tag 'riscv-for-linus-6.15-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux: riscv: Disallow PR_GET_TAGGED_ADDR_CTRL without Supm scripts: Do not strip .rela.dyn section riscv: Fix kernel crash due to PR_SET_TAGGED_ADDR_CTRL riscv: misaligned: use get_user() instead of __get_user() riscv: misaligned: enable IRQs while handling misaligned accesses riscv: misaligned: factorize trap handling riscv: misaligned: Add handling for ZCB instructions

axboe and others added 22 commits May 4, 2025 09:15

pull bot added the ⤵️ pull label May 9, 2025

pull bot merged commit 3013c33 into AwesomeGitHubRepos:master May 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[pull] master from torvalds:master #1874

[pull] master from torvalds:master #1874

pull bot commented May 9, 2025 •

edited

Loading

[pull] master from torvalds:master #1874

[pull] master from torvalds:master #1874

Conversation

pull bot commented May 9, 2025 • edited Loading

pull bot commented May 9, 2025 •

edited

Loading