See https://github.com/llvm/llvm-project/blob/main/llvm/test/CodeGen/AMDGPU/fcopysign.bf16.ll#L1233
The VOPD pair v_dual_mov_b32 v0, s2 :: v_dual_mov_b32 v1, s3
is treated like a single instruction that writes to both v0 and v1.
s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_2)
says to wait first for the VOPD pair to complete before the use of v0, and then again for the VOPD pair to complete before the use of v1. The second part of this is redundant and potentially decreases code density.