Chunked DPO #721

cyr0930 · 2025-05-21T07:31:23Z

Summary

Try to fix #439.

As above issue addressed, chunk hidden state across batch-dimension has restrictive benefits.
Therefore I try to chunk hidden state across (batch*seq_len)-dimension.

As it requires non-trivial online loss computation, we cannot use fusing forward-backward technique in this case.
However, memory footprint issue still can be addressed by slicing hidden state into small chunks.
This is because we can do backward-step chunk by chunk instead of doing it all at once which results in high spike, although materializing all logits is inevitable.

I'm not sure the implementation of this PR is perfect for now, but just want to check this idea is valid and aligned with the spirit of liger-kernel. Any feedback would be great. Thanks.

Testing Done

I haven't run the tests yet, because I just want to check this concept is okay to be accepted.

Hardware Type:
run make test to ensure correctness
run make checkstyle to ensure code style
run make test-convergence to ensure convergence

cyr0930 and others added 8 commits May 21, 2025 05:30

init

885a21a

[feat] chunked dpo loss

b442602

chunked matmul

cf8d269

Merge branch 'main' into feat/chunked_dpo

83855c4

Merge branch 'main' into feat/chunked_dpo

a1a4626

Merge branch 'main' into feat/chunked_dpo

8242a9b

Merge branch 'main' into feat/chunked_dpo

f61b2df

Merge branch 'main' into feat/chunked_dpo

bce43e7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Chunked DPO #721

Chunked DPO #721

Uh oh!

cyr0930 commented May 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Chunked DPO #721

Are you sure you want to change the base?

Chunked DPO #721

Uh oh!

Conversation

cyr0930 commented May 21, 2025

Summary

Testing Done

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants