Skip to content

DeepNVMe update #966

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 50 commits into from
Jun 9, 2025
Merged
Changes from 1 commit
Commits
Show all changes
50 commits
Select commit Hold shift + click to select a range
8106bb8
Fast model checkpointing
tjruwase Dec 30, 2021
761e4e5
Support both legacy and serialized formats
tjruwase Dec 31, 2021
5967c79
Add io_buffer_mb option
tjruwase Jan 3, 2022
d96f1f6
Bug fix
tjruwase Jan 3, 2022
bbd96f2
Force flush
tjruwase Jan 3, 2022
3a16127
More model options; Refactor common codes
tjruwase Jan 4, 2022
c3df495
--gpu option
tjruwase Jan 5, 2022
315f02a
--half and more flexible options
tjruwase Jan 5, 2022
a41ba08
Add deepspeed.save_checkpoint()
tjruwase Jan 8, 2022
4fcb060
Free ds memory
tjruwase Jan 8, 2022
a49c542
Improve repro
tjruwase Jan 8, 2022
233b9e9
Double I/O buffer (#56)
tjruwase Feb 22, 2022
b1f02b2
Double I/O buffer (#60)
tjruwase Mar 11, 2022
a16ac9e
Add checkpoint comparison (#62)
jerryyangli Mar 15, 2022
b945adc
save_checkpoint perf monitoring
tjruwase Mar 19, 2022
2c7a5ed
Merge branch 'staging-fast-model-checkpoint-v2' of github.com:microso…
tjruwase Mar 19, 2022
64a8f75
Disable checkpoint save on exit
tjruwase Mar 22, 2022
44b8664
Perf statistics for save_checkpoint (#64)
tjruwase Mar 22, 2022
ff4bd69
add logs for a100-80
GuanhuaWang Sep 21, 2022
e4817a1
add torch* error log with half flag but without fused flag
GuanhuaWang Sep 22, 2022
b297e17
log for error
GuanhuaWang Sep 22, 2022
f05dab1
local rank arg
tjruwase Oct 5, 2022
fc4291f
Merge branch 'staging-fast-model-checkpoint-v2' of github.com:microso…
tjruwase Oct 5, 2022
db295f1
Merge branch 'staging-fast-model-checkpoint-v2' of github.com:microso…
tjruwase Oct 5, 2022
1aa971a
Handle local_rank arg (#78)
tjruwase Oct 5, 2022
98b2f8a
Single writer option
tjruwase Oct 5, 2022
2e42285
Single writer option (#79)
tjruwase Oct 5, 2022
09dbd8a
Merge branch 'staging-fast-model-checkpoint-v3' of github.com:microso…
tjruwase Oct 7, 2022
a567adf
Allow missing folder
tjruwase Oct 12, 2022
65793bd
DP writer refactor
tjruwase Feb 10, 2023
5bfdf04
Update for DS; Add GDS
tjruwase Feb 12, 2025
9a27914
Integrate GDS into deepspeed_model_save
tjruwase Feb 20, 2025
53572f8
Rebase fast persist
tjruwase Feb 25, 2025
515dded
Rebase fast persist (#184)
tjruwase Feb 25, 2025
d01aa27
Move folder
tjruwase Mar 26, 2025
e5a316f
Merge branch 'olruwase/fast_persist' of github.com:microsoft/DeepSpee…
tjruwase Mar 26, 2025
4059f80
Remove folder
tjruwase Mar 26, 2025
1c3a54c
More cleanup
tjruwase Mar 26, 2025
9a8540b
torch changes
tjruwase Mar 27, 2025
ee2f081
sglang+zero_inference
tjruwase Apr 7, 2025
ad81cec
Remove file
tjruwase Apr 7, 2025
dff5274
Add offload configs
tjruwase Apr 8, 2025
d84bb56
Add pin_memory
tjruwase Apr 8, 2025
db3b32b
Cleanup scripts
tjruwase Apr 8, 2025
6ee91cb
SGLang README
tjruwase Apr 12, 2025
e283b74
Remove file
tjruwase Apr 12, 2025
54872e1
Merge branch 'master' into olruwase/fast_persist
tjruwase Apr 14, 2025
d971d84
Merge branch 'master' into olruwase/fast_persist
loadams May 15, 2025
3decf3d
Merge branch 'master' into olruwase/fast_persist
hwchen2017 May 23, 2025
0512775
Merge branch 'master' into olruwase/fast_persist
PKUWZP Jun 9, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
log for error
  • Loading branch information
GuanhuaWang committed Sep 22, 2022
commit b297e1776f8b9a266cc18852fd331e811b31f422
75 changes: 67 additions & 8 deletions fast_io/model_checkpoint/log_9_21_22/torch_star_half_error.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2,12 +2,71 @@ Performance test of deepspeed integration of fast model checkpointing.
torch version = 1.12.0+cu113
args = Namespace(cpu_offload=False, folder='/home/guanhuawang/eclipse', fused=False, gpu=False, half=True, io_buffer_mb=1024, legacy=True, model='gpt2-large', no_statistics=False, optimizer=False, single_io_buffer=True, zero_stage=0)
Model name = gpt2-large
[2022-09-22 01:22:52,520] [INFO] [logging.py:68:log_dist] [Rank -1] DeepSpeed info: version=0.7.4+74104af1, git-hash=74104af1, git-branch=staging-fast-model-checkpoint-v3
[2022-09-22 01:22:52,524] [INFO] [comm.py:617:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment...
[2022-09-22 01:22:53,396] [INFO] [comm.py:669:mpi_discovery] Discovered MPI settings of world_rank=0, local_rank=0, world_size=1, master_addr=192.168.1.46, master_port=29500
[2022-09-22 01:22:53,397] [INFO] [comm.py:633:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2022-09-22 01:22:53,400] [WARNING] [config_utils.py:63:_process_deprecated_field] Config parameter cpu_offload is deprecated use offload_optimizer instead
[2022-09-22 01:29:33,721] [INFO] [logging.py:68:log_dist] [Rank -1] DeepSpeed info: version=0.7.4+74104af1, git-hash=74104af1, git-branch=staging-fast-model-checkpoint-v3
[2022-09-22 01:29:33,725] [INFO] [comm.py:617:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment...
--------------------------------------------------------------------------
WARNING: No preset parameters were found for the device that Open MPI
detected:

Local host: azwuse57c00009D
Device name: mlx5_ib0
Device vendor ID: 0x02c9
Device vendor part ID: 4124

Default device parameters will be used, which may result in lower
performance. You can edit any of the files specified by the
btl_openib_device_param_files MCA parameter to set values for your
device.

NOTE: You can turn off this warning by setting the MCA parameter
btl_openib_warn_no_device_params_found to 0.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
By default, for Open MPI 4.0 and later, infiniband ports on a device
are not used by default. The intent is to use UCX for these devices.
You can override this policy by setting the btl_openib_allow_ib MCA parameter
to true.

Local host: azwuse57c00009D
Local adapter: mlx5_ib0
Local port: 1

--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.

Local host: azwuse57c00009D
Local device: mlx5_ib4
--------------------------------------------------------------------------
[2022-09-22 01:29:34,587] [INFO] [comm.py:669:mpi_discovery] Discovered MPI settings of world_rank=0, local_rank=0, world_size=1, master_addr=192.168.1.46, master_port=29500
[2022-09-22 01:29:34,587] [INFO] [comm.py:633:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2022-09-22 01:29:34,591] [WARNING] [config_utils.py:63:_process_deprecated_field] Config parameter cpu_offload is deprecated use offload_optimizer instead
NCCL version 2.10.3+cuda11.3
[2022-09-22 01:22:56,452] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
[2022-09-22 01:22:56,454] [INFO] [logging.py:68:log_dist] [Rank 0] Using DeepSpeed Optimizer param name adam as basic optimizer
[2022-09-22 01:22:56,482] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed Basic Optimizer = {basic_optimizer.__class__.__name__}
[2022-09-22 01:29:38,429] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
[2022-09-22 01:29:38,430] [INFO] [logging.py:68:log_dist] [Rank 0] Using DeepSpeed Optimizer param name adam as basic optimizer
[2022-09-22 01:29:38,461] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed Basic Optimizer = {basic_optimizer.__class__.__name__}
Traceback (most recent call last):
File "deepspeed_save_model.py", line 133, in <module>
main()
File "deepspeed_save_model.py", line 129, in main
run(model, model_name, ckpt_name, args)
File "deepspeed_save_model.py", line 106, in run
write_sec = test_save(tag, folder, model, args, writer_type)
File "deepspeed_save_model.py", line 76, in test_save
ds_engine = _get_ds_engine(model, ds_config)
File "deepspeed_save_model.py", line 52, in _get_ds_engine
ds_engine, _, _, _ = deepspeed.initialize(
File "/home/guanhuawang/DeepSpeed-internal/deepspeed/__init__.py", line 124, in initialize
engine = DeepSpeedEngine(args=args,
File "/home/guanhuawang/DeepSpeed-internal/deepspeed/runtime/engine.py", line 322, in __init__
self._configure_optimizer(optimizer, model_parameters)
File "/home/guanhuawang/DeepSpeed-internal/deepspeed/runtime/engine.py", line 1178, in _configure_optimizer
self.optimizer = self._configure_fp16_optimizer(basic_optimizer)
File "/home/guanhuawang/DeepSpeed-internal/deepspeed/runtime/engine.py", line 1314, in _configure_fp16_optimizer
or self.fp16_fused_mode() \
File "/home/guanhuawang/DeepSpeed-internal/deepspeed/runtime/engine.py", line 792, in fp16_fused_mode
return self._config.fp16_fused_mode
AttributeError: 'DeepSpeedConfig' object has no attribute 'fp16_fused_mode'
[azwuse57c00009D:37114] 4 more processes have sent help message help-mpi-btl-openib.txt / no device params found
[azwuse57c00009D:37114] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[azwuse57c00009D:37114] 4 more processes have sent help message help-mpi-btl-openib.txt / ib port not selected