forked from torvalds/linux
-
Notifications
You must be signed in to change notification settings - Fork 3
[pull] master from torvalds:master #1837
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
md_account_bio() is not called from raid10_handle_discard(), now that we handle bitmap inside md_account_bio(), also fix missing bitmap_startwrite for discard. Test whole disk discard for 20G raid10: Before: Device d/s dMB/s drqm/s %drqm d_await dareq-sz md0 48.00 16.00 0.00 0.00 5.42 341.33 After: Device d/s dMB/s drqm/s %drqm d_await dareq-sz md0 68.00 20462.00 0.00 0.00 2.65 308133.65 Link: https://lore.kernel.org/linux-raid/[email protected] Fixes: 528bc2c ("md/raid10: enable io accounting") Signed-off-by: Yu Kuai <[email protected]> Acked-by: Coly Li <[email protected]>
The bitmap_get_stats() function incorrectly returns -ENOENT for external bitmaps. Remove the external bitmap check as the statistics should be available regardless of bitmap storage location. Return -EINVAL only for invalid bitmap with no storage (neither in superblock nor in external file). Note: "bitmap_info.external" here refers to a bitmap stored in a separate file (bitmap_file), not to external metadata. Fixes: 8d28d0d ("md/md-bitmap: Synchronize bitmap_get_stats() with bitmap lifetime") Signed-off-by: Zheng Qixing <[email protected]> Link: https://lore.kernel.org/linux-raid/[email protected] Signed-off-by: Yu Kuai <[email protected]>
Add an SPDX license identifier line to blk-throttle.h Use 'GPL-2.0' as the identifier, since blk-throttle.c uses that, and blk.h (from which some material was copied when blk-throttle.h was created) also uses that identifier. Signed-off-by: Tim Bird <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Link: https://lore.kernel.org/r/MW5PR13MB5632EE4645BCA24ED111EC0EFDB62@MW5PR13MB5632.namprd13.prod.outlook.com Signed-off-by: Jens Axboe <[email protected]>
When registering a queue fails after blk_mq_sysfs_register() is successful but the function later encounters an error, we need to clean up the blk_mq_sysfs resources. Add the missing blk_mq_sysfs_unregister() call in the error path to properly clean up these resources and prevent a memory leak. Fixes: 320ae51 ("blk-mq: new multi-queue block IO queueing mechanism") Signed-off-by: Zheng Qixing <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Yu Kuai <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
krb_authenticate frees sess->user and does not set the pointer to NULL. It calls ksmbd_krb5_authenticate to reinitialise sess->user but that function may return without doing so. If that happens then smb2_sess_setup, which calls krb_authenticate, will be accessing free'd memory when it later uses sess->user. Cc: [email protected] Signed-off-by: Sean Heelan <[email protected]> Acked-by: Namjae Jeon <[email protected]> Signed-off-by: Steve French <[email protected]>
wait_event_timeout() will set the state of the current task to TASK_UNINTERRUPTIBLE, before doing the condition check. This means that ksmbd_durable_scavenger_alive() will try to acquire the mutex while already in a sleeping state. The scheduler warns us by giving the following warning: do not call blocking ops when !TASK_RUNNING; state=2 set at [<0000000061515a6f>] prepare_to_wait_event+0x9f/0x6c0 WARNING: CPU: 2 PID: 4147 at kernel/sched/core.c:10099 __might_sleep+0x12f/0x160 mutex lock is not needed in ksmbd_durable_scavenger_alive(). Signed-off-by: Namjae Jeon <[email protected]> Signed-off-by: Steve French <[email protected]>
Move tcp_transport free to ksmbd_conn_free. If ksmbd connection is referenced when ksmbd server thread terminates, It will not be freed, but conn->tcp_transport is freed. __smb2_lease_break_noti can be performed asynchronously when the connection is disconnected. __smb2_lease_break_noti calls ksmbd_conn_write, which can cause use-after-free when conn->ksmbd_transport is already freed. Cc: [email protected] Reported-by: Norbert Szetei <[email protected]> Tested-by: Norbert Szetei <[email protected]> Signed-off-by: Namjae Jeon <[email protected]> Signed-off-by: Steve French <[email protected]>
There is a room in smb_break_all_levII_oplock that can cause racy issues when unlocking in the middle of the loop. This patch use read lock to protect whole loop. Cc: [email protected] Reported-by: Norbert Szetei <[email protected]> Tested-by: Norbert Szetei <[email protected]> Signed-off-by: Namjae Jeon <[email protected]> Signed-off-by: Steve French <[email protected]>
[ 2110.972290] ------------[ cut here ]------------ [ 2110.972301] WARNING: CPU: 3 PID: 735 at fs/read_write.c:599 __kernel_write_iter+0x21b/0x280 This patch doesn't allow writing to directory. Cc: [email protected] Reported-by: Norbert Szetei <[email protected]> Tested-by: Norbert Szetei <[email protected]> Signed-off-by: Namjae Jeon <[email protected]> Signed-off-by: Steve French <[email protected]>
The user can set any value for 'deadtime'. This affects the arithmetic expression 'req->deadtime * SMB_ECHO_INTERVAL', which is subject to overflow. The added check makes the server behavior more predictable. Found by Linux Verification Center (linuxtesting.org) with SVACE. Fixes: 0626e66 ("cifsd: add server handler for central processing and tranport layers") Cc: [email protected] Signed-off-by: Denis Arefev <[email protected]> Acked-by: Namjae Jeon <[email protected]> Signed-off-by: Steve French <[email protected]>
IORING_OP_RECV_ZC requests take a zcrx object id via sqe::zcrx_ifq_idx, which binds it to the corresponding if / queue. However, we don't return that id back to the user. It's fine as currently there can be only one zcrx and the user assumes that its id should be 0, but as we'll need multiple zcrx objects in the future let's explicitly pass it back on registration. Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/8714667d370651962f7d1a169032e5f02682a73e.1744722517.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
It'll likely change how page pools store memory providers, so in preparation for that, keep accesses in one place in io_uring by introducing a helper. Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/3522eb8fa9b4e21bcf32e7e9ae656c616b282210.1744722526.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
Set cmd->iocb.ki_ioprio to the ioprio of loop device's request. The purpose is to inherit the original request ioprio in the aio flow. Signed-off-by: Yunlong Xing <[email protected]> Signed-off-by: Zhiguo Niu <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
The original commit message and the wording "uncork" in the code comment indicate that it is expected that the suppressed event instances are automatically sent after unsuppressing. This is not the case, instead they are discarded. In effect this means that no "changed" events are emitted on the device itself by default. While each discovered partition does trigger a changed event on the device, devices without partitions don't have any event emitted. This makes udev miss the device creation and prompted workarounds in userspace. See the linked util-linux/losetup bug. Explicitly emit the events and drop the confusingly worded comments. Link: util-linux/util-linux#2434 Fixes: 498ef5c ("loop: suppress uevents while reconfiguring the device") Cc: [email protected] Signed-off-by: Thomas Weißschuh <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
Remove the suppression of the uevents before scanning for partitions. The partitions inherit their suppression settings from their parent device, which lead to the uevents being dropped. This is similar to the same changes for LOOP_CONFIGURE done in commit bb430b6 ("loop: LOOP_CONFIGURE: send uevents for partitions"). Fixes: 498ef5c ("loop: suppress uevents while reconfiguring the device") Cc: [email protected] Signed-off-by: Thomas Weißschuh <[email protected]> Reviewed-by: Jan Kara <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
vfs_iter_{read,write} always perform direct I/O when the file has the O_DIRECT flag set, which breaks disabling direct I/O using the LOOP_SET_STATUS / LOOP_SET_STATUS64 ioctls. This was recenly reported as a regression, but as far as I can tell was only uncovered by better checking for block sizes and has been around since the direct I/O support was added. Fix this by using the existing aio code that calls the raw read/write iter methods instead. Note that despite the comments there is no need for block drivers to ever call flush_dcache_page themselves, and the call is a left-over from prehistoric times. Fixes: ab1cb27 ("block: loop: introduce ioctl command of LOOP_SET_DIRECT_IO") Reported-by: Darrick J. Wong <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Ming Lei <[email protected]> Tested-by: Darrick J. Wong <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
The CONFIG_BLK_DEV_UBLK help text suggests setting the config option to Y so task_work_add() can be used to dispatch I/O, improving performance. However, this mechanism was removed in commit 29dc5d0 ("ublk: kill queuing request by task_work_add"). So remove this paragraph from the config help text. Signed-off-by: Caleb Sander Mateos <[email protected]> Reviewed-by: Uday Shankar <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
Commit 62baf70 caused the ANA log page to be re-read, even on controllers that do not support ANA. While this should generally harmless, some controllers hang on the unsupported log page and never finish probing. Fixes: 62baf70 ("nvme: re-read ANA log page after ns scan completes") Signed-off-by: Hannes Reinecke <[email protected]> Tested-by: Srikanth Aithal <[email protected]> [hch: more detailed commit message] Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]>
When rapidly rescanning for new namespaces nvme_mpath_add_sysfs_link() may be called for a block device not added to sysfs. But NVME_NS_SYSFS_ATTR_LINK had already been set, so when checking this device a second time we will fail to create the link. Fix this by exchanging the order of the block device check and the NVME_NS_SYSFS_ATTR_LINK bit check. Fixes: 4dbd2b2 ("nvme-multipath: Add visibility for round-robin io-policy") Signed-off-by: Hannes Reinecke <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]>** Reviewed-by: Nilay Shroff <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
When compiling with C=1, the following sparse warning is generated: auth.c:243:23: warning: Using plain integer as NULL pointer Avoid this warning by using NULL to instead of 0 to set the sq tls_key pointer. Fixes: fa2e0f8 ("nvmet-tcp: support secure channel concatenation") Signed-off-by: Damien Le Moal <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Reviewed-by: Chaitanya Kulkarni <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
For a command that is normally processed through the command request execute() function, the completion entry for the command is initialized by __nvmet_req_complete() and nvmet_pci_epf_cq_work() only needs to set the status field and the phase of the completion entry before posting the entry to the completion queue. However, for commands that are failed due to an internal error (e.g. the command data buffer allocation fails), the command request execute() function is not called and __nvmet_req_complete() is never executed for the command, leaving the command completion entry uninitialized. For such command failed before calling req->execute(), the host ends up seeing completion entries with an invalid submission queue ID and command ID. Avoid such issue by always fully initilizing a command completion entry in nvmet_pci_epf_cq_work(), setting the entry submission queue head, ID and command ID. Fixes: 0faa0fe ("nvmet: New NVMe PCI endpoint function target driver") Signed-off-by: Damien Le Moal <[email protected]> Reviewed-by: Niklas Cassel <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
When a host shuts down the controller when shutting down but does so without first disabling the controller, the enable bit remains set in the controller configuration register. When the host restarts and attempts to enable the controller again, the nvmet_pci_epf_poll_cc_work() function is unable to detect the change from 0 to 1 of the enable bit, and thus the controller is not enabled again, which result in a device scan timeout on the host. This problem also occurs if the host shuts down uncleanly or if the PCIe link goes down: as the CC.EN value is not reset, the controller is not enabled again when the host restarts. Fix this by introducing the function nvmet_pci_epf_clear_ctrl_config() to clear the CC and CSTS registers of the controller when the PCIe link is lost (nvmet_pci_epf_stop_ctrl() function), or when starting the controller fails (nvmet_pci_epf_enable_ctrl() fails). Also use this function in nvmet_pci_epf_init_bar() to simplify the initialization of the CC and CSTS registers. Furthermore, modify the function nvmet_pci_epf_disable_ctrl() to clear the CC.EN bit and write this updated value to the BAR register when the controller is shutdown by the host, to ensure that upon restart, we can detect the host setting CC.EN. Fixes: 0faa0fe ("nvmet: New NVMe PCI endpoint function target driver") Signed-off-by: Damien Le Moal <[email protected]> Reviewed-by: Niklas Cassel <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
Since the link_up boolean field of struct nvmet_pci_epf_ctrl is always set to true when nvmet_pci_epf_start_ctrl() is called, assign true to this field in nvmet_pci_epf_start_ctrl(). Conversely, since this field is set to false when nvmet_pci_epf_stop_ctrl() is called, set this field to false directly inside that function. While at it, also add information messages to notify the user of the PCI link state changes to help troubleshoot any link stability issues without needing to enable debug messages. Signed-off-by: Damien Le Moal <[email protected]> Reviewed-by: Keith Busch <[email protected]> Reviewed-by: Niklas Cassel <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
During recovery/check operations, the process_checks function loops through available disks to find a 'primary' source with successfully read data. If no suitable source disk is found after checking all possibilities, the 'primary' index will reach conf->raid_disks * 2. Add an explicit check for this condition after the loop. If no source disk was found, print an error message and return early to prevent further processing without a valid primary source. Link: https://lore.kernel.org/linux-raid/[email protected] Signed-off-by: Meir Elisha <[email protected]> Suggested-and-reviewed-by: Yu Kuai <[email protected]> Signed-off-by: Yu Kuai <[email protected]>
Placing multiple protection information buffers inside the same page can lead to oopses because set_page_dirty_lock() can't be called from interrupt context. Since a protection information buffer is not backed by a file there is no point in setting its page dirty, there is nothing to synchronize. Drop the call to set_page_dirty_lock() and remove the last argument to bio_integrity_unpin_bvec(). Cc: [email protected] Fixes: 492c5d4 ("block: bio-integrity: directly map user buffers") Signed-off-by: Martin K. Petersen <[email protected]> Reviewed-by: Keith Busch <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
Bounds check for iterator variable `i` is missed, so add it and fix ublk_find_tgt(). Cc: Johannes Thumshirn <[email protected]> Signed-off-by: Ming Lei <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
Add io_uring UAPI header so that ublk can work with latest uapi definition. Fix the following build failure: stripe.c: In function ‘stripe_to_uring_op’: stripe.c:120:29: error: ‘IORING_OP_READV_FIXED’ undeclared (first use in this function); did you mean ‘IORING_OP_READ_FIXED’? 120 | return zc ? IORING_OP_READV_FIXED : IORING_OP_READV; | ^~~~~~~~~~~~~~~~~~~~~ | IORING_OP_READ_FIXED Reviewed-by: Johannes Thumshirn <[email protected]> Fixes: 57ed58c ("selftests: ublk: enable zero copy for stripe target") Signed-off-by: Ming Lei <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
Use global array of $UBLK_BACKFILES for storing all backfile name, then clean them automatically. Signed-off-by: Ming Lei <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
Detach ublk daemon from the starting process completely by double-fork and clearing its process group, so that `_add_ublk_dev` can return from sub-shell. Then it is more friendly for writing shell test script for adding/recovering ublk device. Prepare for running ublk test in parallel. Signed-off-by: Ming Lei <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
Run stress tests in parallel, meantime add shell local function to simplify the two stress tests. Signed-off-by: Ming Lei <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
Add stress_03 & stress_04 for covering zero copy feature. Signed-off-by: Ming Lei <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
…TUP_DEFER_TASKRUN It is observed that this way is more efficient for fast nvme backing file. Signed-off-by: Ming Lei <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
In NUMA machine, ublk IO performance is very sensitive with queue pthread's affinity setting. Retrieve queue's affinity and select the 1st cpu as queue thread's sched affinity, and it is observed that single cpu task affinity can get stable & good performance if client application is put on proper cpu. Dump this info when adding one ublk device. Use shmem to communicate queue's tid between parent and daemon. Signed-off-by: Ming Lei <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
Increase max nr_queues to 32, and queue depth to 1024. Signed-off-by: Ming Lei <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
Support target specific command line for making related command line code handling more readable & clean. Also helps for adding new features. Signed-off-by: Ming Lei <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
Add user recovery feature. Meantime add user recovery test: generic_04 and generic_05(zero copy) Signed-off-by: Ming Lei <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
Add test_stress_05.sh for covering removing device with recovery enabled. io-hang has been observed with the following patch: https://lore.kernel.org/linux-block/[email protected]/ Signed-off-by: Ming Lei <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
test may exit early because of missing program or not having required feature before calling _prep_test(), then $UBLK_TMP isn't cleaned. Fix it by moving creating $UBLK_TMP into _prep_test(), any resources created since _prep_test() will be cleaned by _cleanup_test(). Signed-off-by: Ming Lei <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
Most uring_cmds issued against ublk character devices are serialized because each command affects only one queue, and there is an early check which only allows a single task (the queue's ubq_daemon) to issue uring_cmds against that queue. However, this mechanism does not work for FETCH_REQs, since they are expected before ubq_daemon is set. Since FETCH_REQs are only used at initialization and not in the fast path, serialize them using the per-ublk-device mutex. This fixes a number of data races that were previously possible if a badly behaved ublk server decided to issue multiple FETCH_REQs against the same qid/tag concurrently. Reported-by: Caleb Sander Mateos <[email protected]> Signed-off-by: Uday Shankar <[email protected]> Signed-off-by: Ming Lei <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
Add ublk_force_abort_dev() for handling ublk_nosrv_dev_should_queue_io() in ublk_stop_dev(). Then queue quiesce and unquiesce can be paired in single function. Meantime not change device state to QUIESCED any more, since the disk is going to be removed soon. Reviewed-by: Uday Shankar <[email protected]> Signed-off-by: Ming Lei <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
…e_io Now ublk deals with ublk_nosrv_dev_should_queue_io() by keeping request queue as quiesced. This way is fragile because queue quiesce crosses syscalls or process contexts. Switch to rely on ubq->canceling for dealing with ublk_nosrv_dev_should_queue_io(), because it has been used for this purpose during io_uring context exiting, and it can be reused before recovering too. In ublk_queue_rq(), the request will be added to requeue list without kicking off requeue in case of ubq->canceling, and finally requests added in requeue list will be dispatched from either ublk_stop_dev() or ublk_ctrl_end_recovery(). Meantime we have to move reset of ubq->canceling from ublk_ctrl_start_recovery() to ublk_ctrl_end_recovery(), when IO handling can be recovered completely. Then blk_mq_quiesce_queue() and blk_mq_unquiesce_queue() are always used in same context. Signed-off-by: Ming Lei <[email protected]> Reviewed-by: Uday Shankar <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
ublk_ch_release() is called after ublk char device is closed, when all uring_cmd are done, so it is perfect fine to move ublk device reset to ublk_ch_release() from ublk_ctrl_start_recovery(). This way can avoid to grab the exiting daemon task_struct too long. However, reset of the following ublk IO flags has to be moved until ublk io_uring queues are ready: - ubq->canceling For requeuing IO in case of ublk_nosrv_dev_should_queue_io() before device is recovered - ubq->fail_io For failing IO in case of UBLK_F_USER_RECOVERY_FAIL_IO before device is recovered - ublk_io->flags For preventing using io->cmd With this way, recovery is simplified a lot. Signed-off-by: Ming Lei <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
There are currently two ways in which ublk server exit is detected by ublk_drv: 1. uring_cmd cancellation. If there are any outstanding uring_cmds which have not been completed to the ublk server when it exits, io_uring calls the uring_cmd callback with a special cancellation flag as the issuing task is exiting. 2. I/O timeout. This is needed in addition to the above to handle the "saturated queue" case, when all I/Os for a given queue are in the ublk server, and therefore there are no outstanding uring_cmds to cancel when the ublk server exits. There are a couple of issues with this approach: - It is complex and inelegant to have two methods to detect the same condition - The second method detects ublk server exit only after a long delay (~30s, the default timeout assigned by the block layer). This delays the nosrv behavior from kicking in and potential subsequent recovery of the device. The second issue is brought to light with the new test_generic_06 which will be added in following patch. It fails before this fix: selftests: ublk: test_generic_06.sh dev id is 0 dd: error writing '/dev/ublkb0': Input/output error 1+0 records in 0+0 records out 0 bytes copied, 30.0611 s, 0.0 kB/s DEAD dd took 31 seconds to exit (>= 5s tolerance)! generic_06 : [FAIL] Fix this by instead detecting and handling ublk server exit in the character file release callback. This has several advantages: - This one place can handle both saturated and unsaturated queues. Thus, it replaces both preexisting methods of detecting ublk server exit. - It runs quickly on ublk server exit - there is no 30s delay. - It starts the process of removing task references in ublk_drv. This is needed if we want to relax restrictions in the driver like letting only one thread serve each queue There is also the disadvantage that the character file release callback can also be triggered by intentional close of the file, which is a significant behavior change. Preexisting ublk servers (libublksrv) are dependent on the ability to open/close the file multiple times. To address this, only transition to a nosrv state if the file is released while the ublk device is live. This allows for programs to open/close the file multiple times during setup. It is still a behavior change if a ublk server decides to close/reopen the file while the device is LIVE (i.e. while it is responsible for serving I/O), but that would be highly unusual. This behavior is in line with what is done by FUSE, which is very similar to ublk in that a userspace daemon is providing services traditionally provided by the kernel. With this change in, the new test (and all other selftests, and all ublksrv tests) pass: selftests: ublk: test_generic_06.sh dev id is 0 dd: error writing '/dev/ublkb0': Input/output error 1+0 records in 0+0 records out 0 bytes copied, 0.0376731 s, 0.0 kB/s DEAD generic_04 : [PASS] Signed-off-by: Uday Shankar <[email protected]> Signed-off-by: Ming Lei <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
Remove __ublk_quiesce_dev() and open code for updating device state as QUIESCED. We needn't to drain inflight requests in __ublk_quiesce_dev() any more, because all inflight requests are aborted in ublk char device release handler. Also we needn't to set ->canceling in __ublk_quiesce_dev() any more because it is done unconditionally now in ublk_ch_release(). Reviewed-by: Uday Shankar <[email protected]> Signed-off-by: Ming Lei <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
Now ublk_abort_queue() is moved to ublk char device release handler, meantime our request queue is "quiesced" because either ->canceling was set from uring_cmd cancel function or all IOs are inflight and can't be completed by ublk server, things becomes easy much: - all uring_cmd are done, so we needn't to mark io as UBLK_IO_FLAG_ABORTED for handling completion from uring_cmd - ublk char device is closed, no one can hold IO request reference any more, so we can simply complete this request or requeue it for ublk_nosrv_should_reissue_outstanding. Reviewed-by: Uday Shankar <[email protected]> Signed-off-by: Ming Lei <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
Add one simple fault inject target, and verify if an application using ublk device sees an I/O error quickly after the ublk server dies. Signed-off-by: Uday Shankar <[email protected]> Signed-off-by: Ming Lei <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
…kernel/git/mdraid/linux into block-6.15 Pull MD fixes from Yu: "- fix raid10 missing discard IO accounting (Yu Kuai) - fix bitmap stats for bitmap file (Zheng Qixing) - fix oops while reading all member disks failed during check/repair (Meir Elisha)" * tag 'md-6.15-20250416' of https://git.kernel.org/pub/scm/linux/kernel/git/mdraid/linux: md/raid1: Add check for missing source disk in process_checks() md/md-bitmap: fix stats collection for external bitmaps md/raid10: fix missing discard IO accounting
… block-6.15 Pull NVMe fixes from Christoph: "nvme fixes for Linux 6.15 - fix scan failure for non-ANA multipath controllers (Hannes Reinecke) - fix multipath sysfs links creation for some cases (Hannes Reinecke) - PCIe endpoint fixes (Damien Le Moal) - use NULL instead of 0 in the auth code (Damien Le Moal)" * tag 'nvme-6.15-2025-04-17' of git://git.infradead.org/nvme: nvmet: pci-epf: cleanup link state management nvmet: pci-epf: clear CC and CSTS when disabling the controller nvmet: pci-epf: always fully initialize completion entries nvmet: auth: use NULL to clear a pointer in nvmet_auth_sq_free() nvme-multipath: sysfs links may not be created for devices nvme: fixup scan failure for non-ANA multipath controllers
Don't optimise for requests with offset=0. Large registered buffers are the preference and hence the user is likely to pass an offset, and the adjustments are not expensive and will be made even cheaper in following patches. Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/1c2beb20470ee3c886a363d4d8340d3790db19f3.1744882081.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
Kernel registered buffers are special because segments are not uniform in size, and we have a bunch of optimisations based on that uniformity for normal buffers. Handle kbuf separately, it'll be cleaner this way. Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/4e9e5990b0ab5aee723c0be5cd9b5bcf810375f9.1744882081.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
io_import_fixed is a mess. Even though we know the final len of the iterator, we still assign offset + len and do some magic after to correct for that. Do offset calculation first and finalise it with iov_iter_bvec at the end. Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/2d5107fed24f8b23245ef2ede9a5a7f7c426df61.1744882081.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
Sending exact nr_segs, avoids bio split check and processing in block layer, which takes around 5%[1] of overall CPU utilization. In our setup, we see overall improvement of IOPS from 7.15M to 7.65M [2] and 5% less CPU utilization. [1] 3.52% io_uring [kernel.kallsyms] [k] bio_split_rw_at 1.42% io_uring [kernel.kallsyms] [k] bio_split_rw 0.62% io_uring [kernel.kallsyms] [k] bio_submit_split [2] sudo taskset -c 0,1 ./t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1 -n2 -r4 /dev/nvme0n1 /dev/nvme1n1 Signed-off-by: Nitesh Shetty <[email protected]> [Pavel: fixed for kbuf, rebased and reworked on top of cleanups] Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/7a1a49a8d053bd617c244291d63dbfbc07afde36.1744882081.git.asml.silence@gmail.com [axboe: fold in fix factoring in buf reg offset] Signed-off-by: Jens Axboe <[email protected]>
kbuf imports have the front offset adjusted and segments removed, but the tail segments are still included in the segment count that gets passed in the iov_iter. As the segments aren't necessarily all the same size, move importing to a separate helper and iterate the mapped length to get an exact count. Reviewed-by: Nitesh Shetty <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
There is a problem with page pools not dma-unmapping immediately when the device is going down, and delaying it until the page pool is destroyed, which is not allowed (see links). That just got fixed for normal page pools, and we need to address memory providers as well. Unmap pages in the memory provider uninstall callback, and protect it with a new lock. There is also a gap between when a dma mapping is created and the mp is installed, so if the device is killed in between, io_uring would be holding on to dma mappings to a dead device with no one to call ->uninstall. Move it to page pool init and rely on ->is_mapped to make sure it's only done once. Link: https://lore.kernel.org/lkml/[email protected]/T/ Link: https://lore.kernel.org/all/[email protected]/ Fixes: 34a3e60 ("io_uring/zcrx: implement zerocopy receive pp memory provider") Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/ef9b7db249b14f6e0b570a1bb77ff177389f881c.1744965853.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
Pull io_uring fixes from Jens Axboe: - Correctly cap iov_iter->nr_segs for imports of registered buffers, both kbuf and normal ones. Three cleanups to make it saner first, then two fixes for each of the buffer types. This fixes a performance regression where partial buffer usage doesn't trim the tail number of segments, leading the block layer to iterate the IOs to check if it needs splitting. - Two patches tweaking the newly introduced zero-copy rx API, mostly to keep the API consistent once we add multiple interface queues per ring support in the 6.16 release. - zc rx unmapping fix for a dead device * tag 'io_uring-6.15-20250418' of git://git.kernel.dk/linux: io_uring/zcrx: fix late dma unmap for a dead dev io_uring/rsrc: ensure segments counts are correct on kbuf buffers io_uring/rsrc: send exact nr_segs for fixed buffer io_uring/rsrc: refactor io_import_fixed io_uring/rsrc: separate kbuf offset adjustments io_uring/rsrc: don't skip offset calculation io_uring/zcrx: add pp to ifq conversion helper io_uring/zcrx: return ifq id to the user
Pull block fixes from Jens Axboe: - MD pull via Yu: - fix raid10 missing discard IO accounting (Yu Kuai) - fix bitmap stats for bitmap file (Zheng Qixing) - fix oops while reading all member disks failed during check/repair (Meir Elisha) - NVMe pull via Christoph: - fix scan failure for non-ANA multipath controllers (Hannes Reinecke) - fix multipath sysfs links creation for some cases (Hannes Reinecke) - PCIe endpoint fixes (Damien Le Moal) - use NULL instead of 0 in the auth code (Damien Le Moal) - Various ublk fixes: - Slew of selftest additions - Improvements and fixes for IO cancelation - Tweak to Kconfig verbiage - Fix for page dirtying for blk integrity mapped pages - loop fixes: - buffered IO fix - uevent fixes - request priority inheritance fix - Various little fixes * tag 'block-6.15-20250417' of git://git.kernel.dk/linux: (38 commits) selftests: ublk: add generic_06 for covering fault inject ublk: simplify aborting ublk request ublk: remove __ublk_quiesce_dev() ublk: improve detection and handling of ublk server exit ublk: move device reset into ublk_ch_release() ublk: rely on ->canceling for dealing with ublk_nosrv_dev_should_queue_io ublk: add ublk_force_abort_dev() ublk: properly serialize all FETCH_REQs selftests: ublk: move creating UBLK_TMP into _prep_test() selftests: ublk: add test_stress_05.sh selftests: ublk: support user recovery selftests: ublk: support target specific command line selftests: ublk: increase max nr_queues and queue depth selftests: ublk: set queue pthread's cpu affinity selftests: ublk: setup ring with IORING_SETUP_SINGLE_ISSUER/IORING_SETUP_DEFER_TASKRUN selftests: ublk: add two stress tests for zero copy feature selftests: ublk: run stress tests in parallel selftests: ublk: make sure _add_ublk_dev can return in sub-shell selftests: ublk: cleanup backfile automatically selftests: ublk: add io_uring uapi header ...
Pull smb server fixes from Steve French: - Fix integer overflow in server disconnect deadtime calculation - Three fixes for potential use after frees: one for oplocks, and one for leases and one for kerberos authentication - Fix to prevent attempted write to directory - Fix locking warning for durable scavenger thread * tag 'v6.15-rc2-ksmbd-server-fixes' of git://git.samba.org/ksmbd: ksmbd: Prevent integer overflow in calculation of deadtime ksmbd: fix the warning from __kernel_write_iter ksmbd: fix use-after-free in smb_break_all_levII_oplock() ksmbd: fix use-after-free in __smb2_lease_break_noti() ksmbd: fix WARNING "do not call blocking ops when !TASK_RUNNING" ksmbd: Fix dangling pointer in krb_authenticate
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
See Commits and Changes for more details.
Created by
pull[bot] (v2.0.0-alpha.1)
Can you help keep this open source service alive? 💖 Please sponsor : )