Skip to content

KTO训练,1张卡没问题,8张卡出现问题:Tensor Tensor dtypes: BFloat16vs Float #3955

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
qq941134965 opened this issue Apr 22, 2025 · 3 comments

Comments

@qq941134965
Copy link

感觉是通信的问题?怎么解决

@qq941134965
Copy link
Author

swift版本3.2.2
trl 0.16.1

@qq941134965
Copy link
Author

看起来在进入训练之前报的错:
[rank7]: RuntimeError: Detected mismatch between collectives on ranks. Rank 7 is running collective: CollectiveFingerPrint(SequenceNumber=5, OpType=_ALLGATHER_BASE, TensorShape=[1], TensorDtypes=Float, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt))), but Rank 0 is running collective: CollectiveFingerPrint(SequenceNumber=5, OpType=_ALLGATHER_BASE, TensorShape=[1], TensorDtypes=BFloat16, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt))).Collectives differ in the following aspects: Tensor Tensor dtypes: Floatvs BFloat16

@hjh0119
Copy link
Collaborator

hjh0119 commented Apr 23, 2025

Please provide a minimal reproducible script.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants