-
Notifications
You must be signed in to change notification settings - Fork 637
GRPO训练报错:Fatal Python error: none_dealloc: deallocating None: bug likely caused by a refcount error in a C extension #3864
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Can you try removing --move_model_batches 16? Will that cause any issues? |
我已经测试过,去掉--move_model_batches 16这个参数了,但还是会报这个bug |
same error when use GRPO |
I have the same error using examples/train/grpo/train_72b_4gpu.sh |
训练了6k步后遇到了类似的报错
训练启动参数
|
What is the version of vLLM that is causing the error? |
训练1000步报错,报错信息如下,100%可复现,续训就会在2000步再报错一次 {'loss': 0.00618663, 'grad_norm': 0.67167018, 'learning_rate': 9.3e-07, 'memory(GiB)': 45.7, 'train_speed(iter/s)': 0.029422, 'completions/mean_length': 651.25, 'completions/min_length': 597.0, 'completions/max_length': 710.0, 'completions/clipped_ratio': 0.0, 'rewards/MultiModal_Iou_Shaped/mean': 0.91099524, 'rewards/MultiModal_Iou_Shaped/std': 0.02873231, 'rewards/Consistency_Reward/mean': 0.91666669, 'rewards/Consistency_Reward/std': 0.15430333, 'rewards/Multimodal_Format/mean': 1.0, 'rewards/Multimodal_Format/std': 0.0, 'reward': 2.82766199, 'reward_std': 0.11247847, 'kl': 0.00708008, 'clip_ratio': 0.0, 'epoch': 0.18, 'global_step/max_steps': '1095/5932', 'percentage': '18.46%', 'elapsed_time': '10h 18m 14s', 'remaining_time': '1d 21h 31m 1s'}
Train: 18%|█████████████████████▏ | 1095/5932 [10:18:14<28:13:59, 21.01s/it]INFO 04-25 15:32:28 [executor_base.py:226] It took 0.365433 seconds to wake up tags {'kv_cache', 'weights'}.
INFO 04-25 15:32:28 [executor_base.py:226] It took 0.416451 seconds to wake up tags {'weights', 'kv_cache'}.
INFO 04-25 15:32:28 [executor_base.py:226] It took 0.419207 seconds to wake up tags {'weights', 'kv_cache'}.
INFO 04-25 15:32:28 [executor_base.py:226] It took 0.423997 seconds to wake up tags {'weights', 'kv_cache'}.
Fatal Python error: none_dealloc: deallocating None
Python runtime state: initialized
Thread 0x000071e302a00640 (most recent call first):
<no Python frame>
Thread 0x000071e2ffe00640 (most recent call first):
<no Python frame>
Thread 0x000071e2ff400640 (most recent call first):
<no Python frame>
Thread 0x000071e303400640 (most recent call first):
<no Python frame>
Thread 0x000071e303e00640 (most recent call first):
File "/data2/anaconda3/envs/chxm/lib/python3.10/threading.py", line 320 in wait
File "/data2/anaconda3/envs/chxm/lib/python3.10/multiprocessing/queues.py", line 231 in _feed
File "/data2/anaconda3/envs/chxm/lib/python3.10/threading.py", line 953 in run
File "/data2/anaconda3/envs/chxm/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
File "/data2/anaconda3/envs/chxm/lib/python3.10/threading.py", line 973 in _bootstrap
Thread 0x000071e306a00640 (most recent call first):
File "/data2/anaconda3/envs/chxm/lib/python3.10/threading.py", line 320 in wait
File "/data2/anaconda3/envs/chxm/lib/python3.10/multiprocessing/queues.py", line 231 in _feed
File "/data2/anaconda3/envs/chxm/lib/python3.10/threading.py", line 953 in run
File "/data2/anaconda3/envs/chxm/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
File "/data2/anaconda3/envs/chxm/lib/python3.10/threading.py", line 973 in _bootstrap
Thread 0x000071e307400640 (most recent call first):
File "/data2/anaconda3/envs/chxm/lib/python3.10/threading.py", line 320 in wait
File "/data2/anaconda3/envs/chxm/lib/python3.10/multiprocessing/queues.py", line 231 in _feed
File "/data2/anaconda3/envs/chxm/lib/python3.10/threading.py", line 953 in run
File "/data2/anaconda3/envs/chxm/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
File "/data2/anaconda3/envs/chxm/lib/python3.10/threading.py", line 973 in _bootstrap
Thread 0x000071e307e00640 (most recent call first):
File "/data2/anaconda3/envs/chxm/lib/python3.10/threading.py", line 320 in wait
File "/data2/anaconda3/envs/chxm/lib/python3.10/multiprocessing/queues.py", line 231 in _feed
File "/data2/anaconda3/envs/chxm/lib/python3.10/threading.py", line 953 in run
File "/data2/anaconda3/envs/chxm/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
File "/data2/anaconda3/envs/chxm/lib/python3.10/threading.py", line 973 in _bootstrap
Thread 0x000071e30aa00640 (most recent call first):
File "/data2/anaconda3/envs/chxm/lib/python3.10/selectors.py", line 416 in select
File "/data2/anaconda3/envs/chxm/lib/python3.10/multiprocessing/connection.py", line 931 in wait
File "/data2/anaconda3/envs/chxm/lib/python3.10/multiprocessing/connection.py", line 424 in _poll
File "/data2/anaconda3/envs/chxm/lib/python3.10/multiprocessing/connection.py", line 257 in poll
File "/data2/anaconda3/envs/chxm/lib/python3.10/multiprocessing/queues.py", line 113 in get
File "/data2/anaconda3/envs/chxm/lib/python3.10/site-packages/torch/utils/data/_utils/pin_memory.py", line 35 in do_one_step
File "/data2/anaconda3/envs/chxm/lib/python3.10/site-packages/torch/utils/data/_utils/pin_memory.py", line 59 in _pin_memory_loop
File "/data2/anaconda3/envs/chxm/lib/python3.10/threading.py", line 953 in run
File "/data2/anaconda3/envs/chxm/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
File "/data2/anaconda3/envs/chxm/lib/python3.10/threading.py", line 973 in _bootstrap
Thread 0x000071e30b400640 (most recent call first):
File "/data2/anaconda3/envs/chxm/lib/python3.10/threading.py", line 324 in wait
File "/data2/anaconda3/envs/chxm/lib/python3.10/threading.py", line 607 in wait
File "/data2/anaconda3/envs/chxm/lib/python3.10/site-packages/tqdm/_monitor.py", line 60 in run
File "/data2/anaconda3/envs/chxm/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
File "/data2/anaconda3/envs/chxm/lib/python3.10/threading.py", line 973 in _bootstrap
Thread 0x000071e30be00640 (most recent call first):
File "/data2/anaconda3/envs/chxm/lib/python3.10/threading.py", line 324 in wait
File "/data2/anaconda3/envs/chxm/lib/python3.10/queue.py", line 180 in get
File "/data2/anaconda3/envs/chxm/lib/python3.10/site-packages/tensorboard/summary/writer/event_file_writer.py", line 269 in _run
File "/data2/anaconda3/envs/chxm/lib/python3.10/site-packages/tensorboard/summary/writer/event_file_writer.py", line 244 in run
File "/data2/anaconda3/envs/chxm/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
File "/data2/anaconda3/envs/chxm/lib/python3.10/threading.py", line 973 in _bootstrap
Thread 0x000071e311000640 (most recent call first):
File "/data2/anaconda3/envs/chxm/lib/python3.10/site-packages/vllm/usage/usage_lib.py", line 220 in _report_continous_usage
File "/data2/anaconda3/envs/chxm/lib/python3.10/site-packages/vllm/usage/usage_lib.py", line 163 in _report_usage_worker
File "/data2/anaconda3/envs/chxm/lib/python3.10/threading.py", line 953 in run
File "/data2/anaconda3/envs/chxm/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
File "/data2/anaconda3/envs/chxm/lib/python3.10/threading.py", line 973 in _bootstrap
Thread 0x000071e3dac00640 (most recent call first):
File "/data2/anaconda3/envs/chxm/lib/python3.10/threading.py", line 324 in wait
File "/data2/anaconda3/envs/chxm/lib/python3.10/threading.py", line 607 in wait
File "/data2/anaconda3/envs/chxm/lib/python3.10/site-packages/tqdm/_monitor.py", line 60 in run
File "/data2/anaconda3/envs/chxm/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
File "/data2/anaconda3/envs/chxm/lib/python3.10/threading.py", line 973 in _bootstrap
Current thread 0x000071e51e4af740 (most recent call first):
File "/data2/anaconda3/envs/chxm/lib/python3.10/site-packages/vllm/core/scheduler.py", line 794 in _schedule_running
File "/data2/anaconda3/envs/chxm/lib/python3.10/site-packages/vllm/core/scheduler.py", line 1259 in _schedule_default
File "/data2/anaconda3/envs/chxm/lib/python3.10/site-packages/vllm/core/scheduler.py", line 1460 in _schedule
File "/data2/anaconda3/envs/chxm/lib/python3.10/site-packages/vllm/core/scheduler.py", line 1501 in schedule
File "/data2/anaconda3/envs/chxm/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 1375 in step
File "/data2/anaconda3/envs/chxm/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 1409 in _run_engine
File "/data2/anaconda3/envs/chxm/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 470 in generate
File "/data2/anaconda3/envs/chxm/lib/python3.10/site-packages/vllm/utils.py", line 1134 in inner
File "/data2/chxm/ms-swift-main/swift/llm/infer/infer_engine/grpo_vllm_engine.py", line 149 in infer
File "/data2/chxm/ms-swift-main/swift/trainers/rlhf_trainer/grpo_trainer.py", line 599 in _infer_multi_turn
File "/data2/chxm/ms-swift-main/swift/trainers/rlhf_trainer/grpo_trainer.py", line 759 in _fast_infer
File "/data2/chxm/ms-swift-main/swift/trainers/rlhf_trainer/grpo_trainer.py", line 793 in _generate_completions
File "/data2/chxm/ms-swift-main/swift/trainers/rlhf_trainer/grpo_trainer.py", line 823 in _generate_and_score_completions
File "/data2/anaconda3/envs/chxm/lib/python3.10/site-packages/trl/trainer/grpo_trainer.py", line 647 in _prepare_inputs
File "/data2/anaconda3/envs/chxm/lib/python3.10/site-packages/trl/extras/profiling.py", line 87 in wrapper
File "/data2/anaconda3/envs/chxm/lib/python3.10/site-packages/transformers/trainer.py", line 3730 in training_step
File "/data2/chxm/ms-swift-main/swift/trainers/rlhf_trainer/grpo_trainer.py", line 1147 in training_step
File "/data2/anaconda3/envs/chxm/lib/python3.10/site-packages/transformers/trainer.py", line 2560 in _inner_training_loop
File "/data2/anaconda3/envs/chxm/lib/python3.10/site-packages/transformers/trainer.py", line 2245 in train
File "/data2/chxm/ms-swift-main/swift/trainers/mixin.py", line 294 in train
File "/data2/chxm/ms-swift-main/swift/llm/train/sft.py", line 204 in train
File "/data2/chxm/ms-swift-main/swift/llm/train/sft.py", line 144 in run
File "/data2/chxm/ms-swift-main/swift/llm/base.py", line 47 in main
File "/data2/chxm/ms-swift-main/swift/llm/train/rlhf.py", line 98 in rlhf_main
File "/data2/chxm/ms-swift-main/swift/cli/rlhf.py", line 5 in <module>
Extension modules: numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._dynamo.autograd_compiler, torch._C._dynamo.eval_frame, torch._C._dynamo.guards, torch._C._dynamo.utils, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, _brotli, zstandard.backend_c, charset_normalizer.md, simplejson._speedups, requests.packages.charset_normalizer.md, requests.packages.chardet.md, yaml._yaml, markupsafe._speedups, PIL._imaging, pyarrow.lib, pandas._libs.tslibs.ccalendar, pandas._libs.tslibs.np_datetime, pandas._libs.tslibs.dtypes, pandas._libs.tslibs.base, pandas._libs.tslibs.nattype, pandas._libs.tslibs.timezones, pandas._libs.tslibs.fields, pandas._libs.tslibs.timedeltas, pandas._libs.tslibs.tzconversion, pandas._libs.tslibs.timestamps, pandas._libs.properties, pandas._libs.tslibs.offsets, pandas._libs.tslibs.strptime, pandas._libs.tslibs.parsing, pandas._libs.tslibs.conversion, pandas._libs.tslibs.period, pandas._libs.tslibs.vectorized, pandas._libs.ops_dispatch, pandas._libs.missing, pandas._libs.hashtable, pandas._libs.algos, pandas._libs.interval, pandas._libs.lib, pyarrow._compute, pandas._libs.ops, pandas._libs.hashing, pandas._libs.arrays, pandas._libs.tslib, pandas._libs.sparse, pandas._libs.internals, pandas._libs.indexing, pandas._libs.index, pandas._libs.writers, pandas._libs.join, pandas._libs.window.aggregations, pandas._libs.window.indexers, pandas._libs.reshape, pandas._libs.groupby, pandas._libs.json, pandas._libs.parsers, pandas._libs.testing, kiwisolver._cext, google._upb._message, pyarrow._parquet, pyarrow._fs, pyarrow._azurefs, pyarrow._hdfs, pyarrow._gcsfs, pyarrow._s3fs, multidict._multidict, yarl._quoting_c, propcache._helpers_c, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket.mask, aiohttp._websocket.reader_c, frozenlist._frozenlist, xxhash._xxhash, pyarrow._json, pyarrow._acero, pyarrow._csv, pyarrow._substrait, pyarrow._dataset, pyarrow._dataset_orc, pyarrow._parquet_encryption, pyarrow._dataset_parquet_encryption, pyarrow._dataset_parquet, sklearn.__check_build._check_build, psutil._psutil_linux, psutil._psutil_posix, scipy._lib._ccallback_c, scipy.sparse._sparsetools, _csparsetools, scipy.sparse._csparsetools, scipy.linalg._fblas, scipy.linalg._flapack, scipy.linalg.cython_lapack, scipy.linalg._cythonized_array_utils, scipy.linalg._solve_toeplitz, scipy.linalg._decomp_lu_cython, scipy.linalg._matfuncs_sqrtm_triu, scipy.linalg._matfuncs_expm, scipy.linalg._linalg_pythran, scipy.linalg.cython_blas, scipy.linalg._decomp_update, scipy.sparse.linalg._dsolve._superlu, scipy.sparse.linalg._eigen.arpack._arpack, scipy.sparse.linalg._propack._spropack, scipy.sparse.linalg._propack._dpropack, scipy.sparse.linalg._propack._cpropack, scipy.sparse.linalg._propack._zpropack, scipy.sparse.csgraph._tools, scipy.sparse.csgraph._shortest_path, scipy.sparse.csgraph._traversal, scipy.sparse.csgraph._min_spanning_tree, scipy.sparse.csgraph._flow, scipy.sparse.csgraph._matching, scipy.sparse.csgraph._reordering, scipy.special._ufuncs_cxx, scipy.special._ufuncs, scipy.special._specfun, scipy.special._comb, scipy.special._ellip_harm_2, scipy.spatial._ckdtree, scipy._lib.messagestream, scipy.spatial._qhull, scipy.spatial._voronoi, scipy.spatial._distance_wrap, scipy.spatial._hausdorff, scipy.spatial.transform._rotation, scipy.optimize._group_columns, scipy.optimize._trlib._trlib, scipy.optimize._lbfgsb, _moduleTNC, scipy.optimize._moduleTNC, scipy.optimize._cobyla, scipy.optimize._slsqp, scipy.optimize._minpack, scipy.optimize._lsq.givens_elimination, scipy.optimize._zeros, scipy.optimize._cython_nnls, scipy._lib._uarray._uarray, scipy.linalg._decomp_interpolative, scipy.optimize._bglu_dense, scipy.optimize._lsap, scipy.optimize._direct, scipy.integrate._odepack, scipy.integrate._quadpack, scipy.integrate._vode, scipy.integrate._dop, scipy.integrate._lsoda, scipy.interpolate._fitpack, scipy.interpolate._dfitpack, scipy.interpolate._dierckx, scipy.interpolate._ppoly, scipy.interpolate._interpnd, scipy.interpolate._rbfinterp_pythran, scipy.interpolate._rgi_cython, scipy.interpolate._bspl, scipy.special.cython_special, scipy.stats._stats, scipy.stats._sobol, scipy.stats._qmc_cy, scipy.stats._biasedurn, scipy.stats._stats_pythran, scipy.stats._levy_stable.levyst, scipy.stats._ansari_swilk_statistics, scipy.stats._mvn, scipy.stats._rcont.rcont, scipy.ndimage._nd_image, scipy.ndimage._rank_filter_1d, _ni_label, scipy.ndimage._ni_label, sklearn.utils._isfinite, sklearn.utils.sparsefuncs_fast, sklearn.utils.murmurhash, sklearn.utils._openmp_helpers, sklearn.metrics.cluster._expected_mutual_info_fast, sklearn.preprocessing._csr_polynomial_expansion, sklearn.preprocessing._target_encoder_fast, sklearn.metrics._dist_metrics, sklearn.metrics._pairwise_distances_reduction._datasets_pair, sklearn.utils._cython_blas, sklearn.metrics._pairwise_distances_reduction._base, sklearn.metrics._pairwise_distances_reduction._middle_term_computer, sklearn.utils._heap, sklearn.utils._sorting, sklearn.metrics._pairwise_distances_reduction._argkmin, sklearn.metrics._pairwise_distances_reduction._argkmin_classmode, sklearn.utils._vector_sentinel, sklearn.metrics._pairwise_distances_reduction._radius_neighbors, sklearn.metrics._pairwise_distances_reduction._radius_neighbors_classmode, sklearn.metrics._pairwise_fast, PIL._imagingft, av._core, av.logging, av.bytesource, av.buffer, av.audio.format, av.error, av.dictionary, av.container.pyio, av.utils, av.option, av.descriptor, av.format, av.stream, av.container.streams, av.sidedata.motionvectors, av.sidedata.sidedata, av.opaque, av.packet, av.container.input, av.container.output, av.container.core, av.codec.context, av.video.format, av.video.reformatter, av.plane, av.video.plane, av.video.frame, av.video.stream, av.codec.hwaccel, av.codec.codec, av.frame, av.audio.layout, av.audio.plane, av.audio.frame, av.audio.stream, av.filter.pad, av.filter.link, av.filter.context, av.filter.graph, av.filter.filter, av.filter.loudnorm, av.audio.resampler, av.audio.codeccontext, av.audio.fifo, av.bitstream, av.video.codeccontext, cuda_utils, msgpack._cmsgpack, regex._regex, scipy.io.matlab._mio_utils, scipy.io.matlab._streams, scipy.io.matlab._mio5_utils, zmq.backend.cython._zmq, msgspec._core, setproctitle, uvloop.loop, ray._raylet, vllm.cumem_allocator, sentencepiece._sentencepiece, __triton_launcher, PIL._imagingmath (total: 269)
W0425 15:32:31.625000 1224982 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1225055 closing signal SIGTERM
W0425 15:32:31.628000 1224982 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1225056 closing signal SIGTERM
W0425 15:32:31.629000 1224982 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1225057 closing signal SIGTERM
E0425 15:32:37.269000 1224982 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: -6) local_rank: 0 (pid: 1225054) of binary: /data2/anaconda3/envs/chxm/bin/python
Traceback (most recent call last):
File "/data2/anaconda3/envs/chxm/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/data2/anaconda3/envs/chxm/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/data2/anaconda3/envs/chxm/lib/python3.10/site-packages/torch/distributed/run.py", line 922, in <module>
main()
File "/data2/anaconda3/envs/chxm/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
return f(*args, **kwargs)
File "/data2/anaconda3/envs/chxm/lib/python3.10/site-packages/torch/distributed/run.py", line 918, in main
run(args)
File "/data2/anaconda3/envs/chxm/lib/python3.10/site-packages/torch/distributed/run.py", line 909, in run
elastic_launch(
File "/data2/anaconda3/envs/chxm/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/data2/anaconda3/envs/chxm/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
========================================================
/data2/chxm/ms-swift-main/swift/cli/rlhf.py FAILED
--------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
--------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2025-04-25_15:32:31
host : ubuntn
rank : 0 (local_rank: 0)
exitcode : -6 (pid: 1225054)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 1225054
======================================================== (chxm) (base) member@ubuntn:~$ pip list
Package Version Editable project location
---------------------------------------- -------------- -------------------------
absl-py 2.2.2
accelerate 1.6.0
addict 2.4.0
aiofiles 24.1.0
aiohappyeyeballs 2.6.1
aiohttp 3.11.16
aiosignal 1.3.2
airportsdata 20250224
aliyun-python-sdk-core 2.16.0
aliyun-python-sdk-kms 2.16.5
altair 5.5.0
annotated-types 0.7.0
antlr4-python3-runtime 4.7.2
anyio 4.9.0
argon2-cffi 23.1.0
argon2-cffi-bindings 21.2.0
arrow 1.3.0
arxiv 2.2.0
astor 0.8.1
asttokens 3.0.0
async-lru 2.0.5
async-timeout 5.0.1
attrdict 2.0.1
attrs 25.3.0
auto_gptq 0.7.1
av 14.3.0
babel 2.17.0
beautifulsoup4 4.13.3
binpacking 1.5.2
bitsandbytes 0.45.5
blake3 1.0.4
bleach 6.2.0
blinker 1.9.0
boto3 1.37.32
botocore 1.37.32
Brotli 1.1.0
cachetools 5.5.2
certifi 2025.1.31
cffi 1.17.1
charset-normalizer 3.4.1
click 8.1.8
cloudpickle 3.1.1
colorama 0.4.6
comm 0.2.2
compressed-tensors 0.9.3
contourpy 1.3.1
cpm-kernels 1.0.11
crcmod 1.7
cryptography 43.0.3
cupy-cuda12x 13.4.1
cycler 0.12.1
dacite 1.9.2
datasets 3.2.0
debugpy 1.8.14
decorator 5.2.1
decord 0.6.0
deepspeed 0.16.5
defusedxml 0.7.1
Deprecated 1.2.18
depyf 0.18.0
dill 0.3.8
diskcache 5.6.3
distro 1.9.0
dnspython 2.7.0
docker-pycreds 0.4.0
duckduckgo_search 5.3.1b1
einops 0.8.1
email_validator 2.2.0
et_xmlfile 2.0.0
evalscope 0.14.0
evaluate 0.4.3
exceptiongroup 1.2.2
executing 2.2.0
fastapi 0.115.12
fastapi-cli 0.0.7
fastjsonschema 2.21.1
fastrlock 0.8.3
feedparser 6.0.11
ffmpy 0.5.0
filelock 3.18.0
fire 0.7.0
flash_attn 2.7.4.post1
fonttools 4.57.0
fqdn 1.5.1
frozenlist 1.5.0
fsspec 2024.9.0
func_timeout 4.3.5
future 1.0.0
fuzzywuzzy 0.18.0
gekko 1.3.0
gguf 0.14.0
gitdb 4.0.12
GitPython 3.1.44
google-ai-generativelanguage 0.6.15
google-api-core 2.24.2
google-api-python-client 2.167.0
google-auth 2.39.0
google-auth-httplib2 0.2.0
google-generativeai 0.8.5
googleapis-common-protos 1.70.0
gradio 5.24.0
gradio_client 1.8.0
griffe 0.49.0
groovy 0.1.2
grpcio 1.71.0
grpcio-status 1.71.0
h11 0.14.0
h2 4.2.0
h5py 3.13.0
hf-xet 1.0.3
hjson 3.1.0
hpack 4.1.0
httpcore 1.0.7
httplib2 0.22.0
httptools 0.6.4
httpx 0.28.1
huggingface-hub 0.30.2
human-eval 1.0.3
hyperframe 6.1.0
idna 3.10
imageio 2.37.0
immutabledict 4.2.1
importlib_metadata 8.0.0
interegular 0.3.3
ipykernel 6.29.5
ipython 8.35.0
ipywidgets 8.1.6
isoduration 20.11.0
jedi 0.19.2
jieba 0.42.1
Jinja2 3.1.6
jiter 0.9.0
jmespath 0.10.0
joblib 1.4.2
json5 0.12.0
jsonlines 4.0.0
jsonpointer 3.0.0
jsonschema 4.23.0
jsonschema-specifications 2024.10.1
jupyter 1.1.1
jupyter_client 8.6.3
jupyter-console 6.6.3
jupyter_core 5.7.2
jupyter-events 0.12.0
jupyter-lsp 2.2.5
jupyter_server 2.15.0
jupyter_server_terminals 0.5.3
jupyterlab 4.4.0
jupyterlab_pygments 0.3.0
jupyterlab_server 2.27.3
jupyterlab_widgets 3.0.14
kiwisolver 1.4.8
lagent 0.2.4
langdetect 1.0.9
lark 1.2.2
latex2sympy2 1.9.1
latex2sympy2_extended 1.10.1
lazy_loader 0.4
Levenshtein 0.27.1
lightning-utilities 0.14.3
llguidance 0.7.16
llvmlite 0.44.0
lm-format-enforcer 0.10.11
lxml 5.3.2
Markdown 3.7
markdown-it-py 3.0.0
MarkupSafe 3.0.2
math-verify 0.7.0
matplotlib 3.10.1
matplotlib-inline 0.1.7
mdurl 0.1.2
mistral_common 1.5.4
mistune 3.1.3
mmengine 0.10.7
mmengine-lite 0.10.7
modelscope 1.25.0
mpmath 1.3.0
ms-opencompass 0.1.6
ms_swift 3.4.0.dev0 /data2/chxm/ms-swift-main
ms-vlmeval 0.0.16
msgpack 1.1.0
msgspec 0.19.0
multidict 6.4.3
multiprocess 0.70.16
narwhals 1.34.1
nbclient 0.10.2
nbconvert 7.16.6
nbformat 5.10.4
nest-asyncio 1.6.0
networkx 3.4.2
ninja 1.11.1.4
nltk 3.9.1
notebook 7.4.0
notebook_shim 0.2.4
numba 0.61.2
numpy 1.26.4
nvidia-cublas-cu12 12.4.5.8
nvidia-cuda-cupti-cu12 12.4.127
nvidia-cuda-nvrtc-cu12 12.4.127
nvidia-cuda-runtime-cu12 12.4.127
nvidia-cudnn-cu12 9.1.0.70
nvidia-cufft-cu12 11.2.1.3
nvidia-curand-cu12 10.3.5.147
nvidia-cusolver-cu12 11.6.1.9
nvidia-cusparse-cu12 12.3.1.170
nvidia-cusparselt-cu12 0.6.2
nvidia-ml-py 12.570.86
nvidia-nccl-cu12 2.21.5
nvidia-nvjitlink-cu12 12.4.127
nvidia-nvtx-cu12 12.4.127
nvitop 1.4.2
omegaconf 2.0.0
openai 1.72.0
OpenCC 1.1.9
opencv-python 4.11.0.86
opencv-python-headless 4.11.0.86
openpyxl 3.1.5
opentelemetry-api 1.26.0
opentelemetry-exporter-otlp 1.26.0
opentelemetry-exporter-otlp-proto-common 1.26.0
opentelemetry-exporter-otlp-proto-grpc 1.26.0
opentelemetry-exporter-otlp-proto-http 1.26.0
opentelemetry-proto 1.26.0
opentelemetry-sdk 1.26.0
opentelemetry-semantic-conventions 0.47b0
opentelemetry-semantic-conventions-ai 0.4.3
orjson 3.10.16
oss2 2.19.1
outlines 0.1.11
outlines_core 0.1.26
overrides 7.7.0
packaging 24.2
pandas 2.2.3
pandocfilters 1.5.1
parso 0.8.4
partial-json-parser 0.2.1.1.post5
peft 0.14.0
pexpect 4.9.0
phx-class-registry 4.1.0
pillow 11.1.0
pip 25.0
platformdirs 4.3.7
portalocker 3.1.1
prettytable 3.16.0
prometheus_client 0.21.1
prometheus-fastapi-instrumentator 7.1.0
prompt_toolkit 3.0.50
propcache 0.3.1
proto-plus 1.26.1
protobuf 5.29.4
psutil 7.0.0
ptyprocess 0.7.0
pure_eval 0.2.3
py-cpuinfo 9.0.0
pyarrow 19.0.1
pyasn1 0.6.1
pyasn1_modules 0.4.2
pycocotools 2.0.8
pycountry 24.6.1
pycparser 2.22
pycryptodome 3.22.0
pydantic 2.11.3
pydantic_core 2.33.1
pydeck 0.9.1
pydub 0.25.1
Pygments 2.19.1
pynvml 12.0.0
pyparsing 3.2.3
pypinyin 0.54.0
python-dateutil 2.9.0.post0
python-dotenv 1.1.0
python-json-logger 3.3.0
python-Levenshtein 0.27.1
python-multipart 0.0.20
pytz 2025.2
PyYAML 6.0.2
pyzmq 26.4.0
qwen-vl-utils 0.0.10
rank-bm25 0.2.2
RapidFuzz 3.13.0
ray 2.43.0
referencing 0.36.2
regex 2024.11.6
requests 2.32.3
rfc3339-validator 0.1.4
rfc3986-validator 0.1.1
rich 14.0.0
rich-toolkit 0.14.1
rouge 1.0.1
rouge-chinese 1.0.3
rouge_score 0.1.2
rpds-py 0.24.0
rsa 4.9.1
ruff 0.11.5
s3transfer 0.11.4
sacrebleu 2.5.1
safehttpx 0.1.6
safetensors 0.5.3
scikit-image 0.25.2
scikit-learn 1.6.1
scipy 1.15.2
seaborn 0.13.2
semantic-version 2.10.0
Send2Trash 1.8.3
sentence-transformers 4.0.2
sentencepiece 0.2.0
sentry-sdk 2.26.1
setproctitle 1.3.5
setuptools 69.5.1
sgmllib3k 1.0.0
shellingham 1.5.4
simplejson 3.20.1
six 1.17.0
smmap 5.0.2
sniffio 1.3.1
socksio 1.0.0
sortedcontainers 2.4.0
soupsieve 2.6
stack-data 0.6.3
starlette 0.46.1
streamlit 1.44.1
sty 1.0.6
swankit 0.1.7
swanlab 0.5.5
sympy 1.13.1
tabulate 0.9.0
tenacity 9.1.2
tensorboard 2.19.0
tensorboard-data-server 0.7.2
termcolor 3.0.1
terminado 0.18.1
threadpoolctl 3.6.0
tifffile 2025.3.30
tiktoken 0.9.0
timeout-decorator 0.5.0
tinycss2 1.4.0
tokenizers 0.21.1
toml 0.10.2
tomli 2.2.1
tomlkit 0.13.2
torch 2.6.0
torchaudio 2.6.0
torchmetrics 1.7.1
torchvision 0.21.0
tornado 6.4.2
tqdm 4.67.1
traitlets 5.14.3
transformers 4.51.3
transformers-stream-generator 0.0.5
triton 3.2.0
trl 0.16.1
typer 0.15.2
types-python-dateutil 2.9.0.20241206
typing_extensions 4.13.2
typing-inspection 0.4.0
tzdata 2025.2
uri-template 1.3.0
uritemplate 4.1.1
urllib3 2.4.0
uvicorn 0.34.0
uvloop 0.21.0
validators 0.34.0
vllm 0.8.4
volcengine-python-sdk 1.1.5
wandb 0.19.9
watchdog 6.0.0
watchfiles 1.0.5
wcwidth 0.2.13
webcolors 24.11.1
webencodings 0.5.1
websocket-client 1.8.0
websockets 15.0.1
Werkzeug 3.1.3
wheel 0.45.1
widgetsnbextension 4.0.14
word2number 1.1
wrapt 1.17.2
xformers 0.0.29.post2
xgrammar 0.1.18
XlsxWriter 3.2.2
xtuner 0.1.23
xxhash 3.5.0
yapf 0.43.0
yarl 1.19.0
zipp 3.21.0
zstandard 0.23.0 (chxm) (base) member@ubuntn:~$ conda list
# packages in environment at /data2/anaconda3/envs/chxm:
#
# Name Version Build Channel
_libgcc_mutex 0.1 main
_openmp_mutex 5.1 1_gnu
absl-py 2.2.2 pypi_0 pypi
accelerate 1.6.0 pypi_0 pypi
addict 2.4.0 pypi_0 pypi
aiofiles 24.1.0 pypi_0 pypi
aiohappyeyeballs 2.6.1 pypi_0 pypi
aiohttp 3.11.16 pypi_0 pypi
aiosignal 1.3.2 pypi_0 pypi
airportsdata 20250224 pypi_0 pypi
aliyun-python-sdk-core 2.16.0 pypi_0 pypi
aliyun-python-sdk-kms 2.16.5 pypi_0 pypi
altair 5.5.0 pypi_0 pypi
annotated-types 0.7.0 pypi_0 pypi
antlr4-python3-runtime 4.7.2 pypi_0 pypi
anyio 4.9.0 pypi_0 pypi
argon2-cffi 23.1.0 pypi_0 pypi
argon2-cffi-bindings 21.2.0 pypi_0 pypi
arrow 1.3.0 pypi_0 pypi
arxiv 2.2.0 pypi_0 pypi
astor 0.8.1 pypi_0 pypi
asttokens 3.0.0 pypi_0 pypi
async-lru 2.0.5 pypi_0 pypi
async-timeout 5.0.1 pypi_0 pypi
attrdict 2.0.1 pypi_0 pypi
attrs 25.3.0 pypi_0 pypi
auto-gptq 0.7.1 pypi_0 pypi
av 14.3.0 pypi_0 pypi
babel 2.17.0 pypi_0 pypi
beautifulsoup4 4.13.3 pypi_0 pypi
binpacking 1.5.2 pypi_0 pypi
bitsandbytes 0.45.5 pypi_0 pypi
blake3 1.0.4 pypi_0 pypi
bleach 6.2.0 pypi_0 pypi
blinker 1.9.0 pypi_0 pypi
boto3 1.37.32 pypi_0 pypi
botocore 1.37.32 pypi_0 pypi
brotli 1.1.0 pypi_0 pypi
bzip2 1.0.8 h5eee18b_6
ca-certificates 2025.2.25 h06a4308_0
cachetools 5.5.2 pypi_0 pypi
certifi 2025.1.31 pypi_0 pypi
cffi 1.17.1 pypi_0 pypi
charset-normalizer 3.4.1 pypi_0 pypi
click 8.1.8 pypi_0 pypi
cloudpickle 3.1.1 pypi_0 pypi
colorama 0.4.6 pypi_0 pypi
comm 0.2.2 pypi_0 pypi
compressed-tensors 0.9.3 pypi_0 pypi
contourpy 1.3.1 pypi_0 pypi
cpm-kernels 1.0.11 pypi_0 pypi
crcmod 1.7 pypi_0 pypi
cryptography 43.0.3 pypi_0 pypi
cupy-cuda12x 13.4.1 pypi_0 pypi
cycler 0.12.1 pypi_0 pypi
dacite 1.9.2 pypi_0 pypi
datasets 3.2.0 pypi_0 pypi
debugpy 1.8.14 pypi_0 pypi
decorator 5.2.1 pypi_0 pypi
decord 0.6.0 pypi_0 pypi
deepspeed 0.16.5 pypi_0 pypi
defusedxml 0.7.1 pypi_0 pypi
deprecated 1.2.18 pypi_0 pypi
depyf 0.18.0 pypi_0 pypi
dill 0.3.8 pypi_0 pypi
diskcache 5.6.3 pypi_0 pypi
distro 1.9.0 pypi_0 pypi
dnspython 2.7.0 pypi_0 pypi
docker-pycreds 0.4.0 pypi_0 pypi
duckduckgo-search 5.3.1b1 pypi_0 pypi
einops 0.8.1 pypi_0 pypi
email-validator 2.2.0 pypi_0 pypi
et-xmlfile 2.0.0 pypi_0 pypi
evalscope 0.14.0 pypi_0 pypi
evaluate 0.4.3 pypi_0 pypi
exceptiongroup 1.2.2 pypi_0 pypi
executing 2.2.0 pypi_0 pypi
fastapi 0.115.12 pypi_0 pypi
fastapi-cli 0.0.7 pypi_0 pypi
fastjsonschema 2.21.1 pypi_0 pypi
fastrlock 0.8.3 pypi_0 pypi
feedparser 6.0.11 pypi_0 pypi
ffmpy 0.5.0 pypi_0 pypi
filelock 3.18.0 pypi_0 pypi
fire 0.7.0 pypi_0 pypi
flash-attn 2.7.4.post1 pypi_0 pypi
fonttools 4.57.0 pypi_0 pypi
fqdn 1.5.1 pypi_0 pypi
frozenlist 1.5.0 pypi_0 pypi
fsspec 2024.9.0 pypi_0 pypi
func-timeout 4.3.5 pypi_0 pypi
future 1.0.0 pypi_0 pypi
fuzzywuzzy 0.18.0 pypi_0 pypi
gekko 1.3.0 pypi_0 pypi
gguf 0.14.0 pypi_0 pypi
gitdb 4.0.12 pypi_0 pypi
gitpython 3.1.44 pypi_0 pypi
google-ai-generativelanguage 0.6.15 pypi_0 pypi
google-api-core 2.24.2 pypi_0 pypi
google-api-python-client 2.167.0 pypi_0 pypi
google-auth 2.39.0 pypi_0 pypi
google-auth-httplib2 0.2.0 pypi_0 pypi
google-generativeai 0.8.5 pypi_0 pypi
googleapis-common-protos 1.70.0 pypi_0 pypi
gradio 5.24.0 pypi_0 pypi
gradio-client 1.8.0 pypi_0 pypi
griffe 0.49.0 pypi_0 pypi
groovy 0.1.2 pypi_0 pypi
grpcio 1.71.0 pypi_0 pypi
grpcio-status 1.71.0 pypi_0 pypi
h11 0.14.0 pypi_0 pypi
h2 4.2.0 pypi_0 pypi
h5py 3.13.0 pypi_0 pypi
hf-xet 1.0.3 pypi_0 pypi
hjson 3.1.0 pypi_0 pypi
hpack 4.1.0 pypi_0 pypi
httpcore 1.0.7 pypi_0 pypi
httplib2 0.22.0 pypi_0 pypi
httptools 0.6.4 pypi_0 pypi
httpx 0.28.1 pypi_0 pypi
huggingface-hub 0.30.2 pypi_0 pypi
human-eval 1.0.3 pypi_0 pypi
hyperframe 6.1.0 pypi_0 pypi
idna 3.10 pypi_0 pypi
imageio 2.37.0 pypi_0 pypi
immutabledict 4.2.1 pypi_0 pypi
importlib-metadata 8.0.0 pypi_0 pypi
interegular 0.3.3 pypi_0 pypi
ipykernel 6.29.5 pypi_0 pypi
ipython 8.35.0 pypi_0 pypi
ipywidgets 8.1.6 pypi_0 pypi
isoduration 20.11.0 pypi_0 pypi
jedi 0.19.2 pypi_0 pypi
jieba 0.42.1 pypi_0 pypi
jinja2 3.1.6 pypi_0 pypi
jiter 0.9.0 pypi_0 pypi
jmespath 0.10.0 pypi_0 pypi
joblib 1.4.2 pypi_0 pypi
json5 0.12.0 pypi_0 pypi
jsonlines 4.0.0 pypi_0 pypi
jsonpointer 3.0.0 pypi_0 pypi
jsonschema 4.23.0 pypi_0 pypi
jsonschema-specifications 2024.10.1 pypi_0 pypi
jupyter 1.1.1 pypi_0 pypi
jupyter-client 8.6.3 pypi_0 pypi
jupyter-console 6.6.3 pypi_0 pypi
jupyter-core 5.7.2 pypi_0 pypi
jupyter-events 0.12.0 pypi_0 pypi
jupyter-lsp 2.2.5 pypi_0 pypi
jupyter-server 2.15.0 pypi_0 pypi
jupyter-server-terminals 0.5.3 pypi_0 pypi
jupyterlab 4.4.0 pypi_0 pypi
jupyterlab-pygments 0.3.0 pypi_0 pypi
jupyterlab-server 2.27.3 pypi_0 pypi
jupyterlab-widgets 3.0.14 pypi_0 pypi
kiwisolver 1.4.8 pypi_0 pypi
lagent 0.2.4 pypi_0 pypi
langdetect 1.0.9 pypi_0 pypi
lark 1.2.2 pypi_0 pypi
latex2sympy2 1.9.1 pypi_0 pypi
latex2sympy2-extended 1.10.1 pypi_0 pypi
lazy-loader 0.4 pypi_0 pypi
ld_impl_linux-64 2.40 h12ee557_0
levenshtein 0.27.1 pypi_0 pypi
libffi 3.4.4 h6a678d5_1
libgcc-ng 11.2.0 h1234567_1
libgomp 11.2.0 h1234567_1
libstdcxx-ng 11.2.0 h1234567_1
libuuid 1.41.5 h5eee18b_0
lightning-utilities 0.14.3 pypi_0 pypi
llguidance 0.7.16 pypi_0 pypi
llvmlite 0.44.0 pypi_0 pypi
lm-format-enforcer 0.10.11 pypi_0 pypi
lxml 5.3.2 pypi_0 pypi
markdown 3.7 pypi_0 pypi
markdown-it-py 3.0.0 pypi_0 pypi
markupsafe 3.0.2 pypi_0 pypi
math-verify 0.7.0 pypi_0 pypi
matplotlib 3.10.1 pypi_0 pypi
matplotlib-inline 0.1.7 pypi_0 pypi
mdurl 0.1.2 pypi_0 pypi
mistral-common 1.5.4 pypi_0 pypi
mistune 3.1.3 pypi_0 pypi
mmengine 0.10.7 pypi_0 pypi
mmengine-lite 0.10.7 pypi_0 pypi
modelscope 1.25.0 pypi_0 pypi
mpmath 1.3.0 pypi_0 pypi
ms-opencompass 0.1.6 pypi_0 pypi
ms-swift 3.4.0.dev0 dev_0 <develop>
ms-vlmeval 0.0.16 pypi_0 pypi
msgpack 1.1.0 pypi_0 pypi
msgspec 0.19.0 pypi_0 pypi
multidict 6.4.3 pypi_0 pypi
multiprocess 0.70.16 pypi_0 pypi
narwhals 1.34.1 pypi_0 pypi
nbclient 0.10.2 pypi_0 pypi
nbconvert 7.16.6 pypi_0 pypi
nbformat 5.10.4 pypi_0 pypi
ncurses 6.4 h6a678d5_0
nest-asyncio 1.6.0 pypi_0 pypi
networkx 3.4.2 pypi_0 pypi
ninja 1.11.1.4 pypi_0 pypi
nltk 3.9.1 pypi_0 pypi
notebook 7.4.0 pypi_0 pypi
notebook-shim 0.2.4 pypi_0 pypi
numba 0.61.2 pypi_0 pypi
numpy 1.26.4 pypi_0 pypi
nvidia-cublas-cu12 12.4.5.8 pypi_0 pypi
nvidia-cuda-cupti-cu12 12.4.127 pypi_0 pypi
nvidia-cuda-nvrtc-cu12 12.4.127 pypi_0 pypi
nvidia-cuda-runtime-cu12 12.4.127 pypi_0 pypi
nvidia-cudnn-cu12 9.1.0.70 pypi_0 pypi
nvidia-cufft-cu12 11.2.1.3 pypi_0 pypi
nvidia-curand-cu12 10.3.5.147 pypi_0 pypi
nvidia-cusolver-cu12 11.6.1.9 pypi_0 pypi
nvidia-cusparse-cu12 12.3.1.170 pypi_0 pypi
nvidia-cusparselt-cu12 0.6.2 pypi_0 pypi
nvidia-ml-py 12.570.86 pypi_0 pypi
nvidia-nccl-cu12 2.21.5 pypi_0 pypi
nvidia-nvjitlink-cu12 12.4.127 pypi_0 pypi
nvidia-nvtx-cu12 12.4.127 pypi_0 pypi
nvitop 1.4.2 pypi_0 pypi
omegaconf 2.0.0 pypi_0 pypi
openai 1.72.0 pypi_0 pypi
opencc 1.1.9 pypi_0 pypi
opencv-python 4.11.0.86 pypi_0 pypi
opencv-python-headless 4.11.0.86 pypi_0 pypi
openpyxl 3.1.5 pypi_0 pypi
openssl 3.0.16 h5eee18b_0
opentelemetry-api 1.26.0 pypi_0 pypi
opentelemetry-exporter-otlp 1.26.0 pypi_0 pypi
opentelemetry-exporter-otlp-proto-common 1.26.0 pypi_0 pypi
opentelemetry-exporter-otlp-proto-grpc 1.26.0 pypi_0 pypi
opentelemetry-exporter-otlp-proto-http 1.26.0 pypi_0 pypi
opentelemetry-proto 1.26.0 pypi_0 pypi
opentelemetry-sdk 1.26.0 pypi_0 pypi
opentelemetry-semantic-conventions 0.47b0 pypi_0 pypi
opentelemetry-semantic-conventions-ai 0.4.3 pypi_0 pypi
orjson 3.10.16 pypi_0 pypi
oss2 2.19.1 pypi_0 pypi
outlines 0.1.11 pypi_0 pypi
outlines-core 0.1.26 pypi_0 pypi
overrides 7.7.0 pypi_0 pypi
packaging 24.2 pypi_0 pypi
pandas 2.2.3 pypi_0 pypi
pandocfilters 1.5.1 pypi_0 pypi
parso 0.8.4 pypi_0 pypi
partial-json-parser 0.2.1.1.post5 pypi_0 pypi
peft 0.14.0 pypi_0 pypi
pexpect 4.9.0 pypi_0 pypi
phx-class-registry 4.1.0 pypi_0 pypi
pillow 11.1.0 pypi_0 pypi
pip 25.0 py310h06a4308_0
platformdirs 4.3.7 pypi_0 pypi
portalocker 3.1.1 pypi_0 pypi
prettytable 3.16.0 pypi_0 pypi
prometheus-client 0.21.1 pypi_0 pypi
prometheus-fastapi-instrumentator 7.1.0 pypi_0 pypi
prompt-toolkit 3.0.50 pypi_0 pypi
propcache 0.3.1 pypi_0 pypi
proto-plus 1.26.1 pypi_0 pypi
protobuf 5.29.4 pypi_0 pypi
psutil 7.0.0 pypi_0 pypi
ptyprocess 0.7.0 pypi_0 pypi
pure-eval 0.2.3 pypi_0 pypi
py-cpuinfo 9.0.0 pypi_0 pypi
pyarrow 19.0.1 pypi_0 pypi
pyasn1 0.6.1 pypi_0 pypi
pyasn1-modules 0.4.2 pypi_0 pypi
pycocotools 2.0.8 pypi_0 pypi
pycountry 24.6.1 pypi_0 pypi
pycparser 2.22 pypi_0 pypi
pycryptodome 3.22.0 pypi_0 pypi
pydantic 2.11.3 pypi_0 pypi
pydantic-core 2.33.1 pypi_0 pypi
pydeck 0.9.1 pypi_0 pypi
pydub 0.25.1 pypi_0 pypi
pygments 2.19.1 pypi_0 pypi
pynvml 12.0.0 pypi_0 pypi
pyparsing 3.2.3 pypi_0 pypi
pypinyin 0.54.0 pypi_0 pypi
python 3.10.16 he870216_1
python-dateutil 2.9.0.post0 pypi_0 pypi
python-dotenv 1.1.0 pypi_0 pypi
python-json-logger 3.3.0 pypi_0 pypi
python-levenshtein 0.27.1 pypi_0 pypi
python-multipart 0.0.20 pypi_0 pypi
pytz 2025.2 pypi_0 pypi
pyyaml 6.0.2 pypi_0 pypi
pyzmq 26.4.0 pypi_0 pypi
qwen-vl-utils 0.0.10 pypi_0 pypi
rank-bm25 0.2.2 pypi_0 pypi
rapidfuzz 3.13.0 pypi_0 pypi
ray 2.43.0 pypi_0 pypi
readline 8.2 h5eee18b_0
referencing 0.36.2 pypi_0 pypi
regex 2024.11.6 pypi_0 pypi
requests 2.32.3 pypi_0 pypi
rfc3339-validator 0.1.4 pypi_0 pypi
rfc3986-validator 0.1.1 pypi_0 pypi
rich 14.0.0 pypi_0 pypi
rich-toolkit 0.14.1 pypi_0 pypi
rouge 1.0.1 pypi_0 pypi
rouge-chinese 1.0.3 pypi_0 pypi
rouge-score 0.1.2 pypi_0 pypi
rpds-py 0.24.0 pypi_0 pypi
rsa 4.9.1 pypi_0 pypi
ruff 0.11.5 pypi_0 pypi
s3transfer 0.11.4 pypi_0 pypi
sacrebleu 2.5.1 pypi_0 pypi
safehttpx 0.1.6 pypi_0 pypi
safetensors 0.5.3 pypi_0 pypi
scikit-image 0.25.2 pypi_0 pypi
scikit-learn 1.6.1 pypi_0 pypi
scipy 1.15.2 pypi_0 pypi
seaborn 0.13.2 pypi_0 pypi
semantic-version 2.10.0 pypi_0 pypi
send2trash 1.8.3 pypi_0 pypi
sentence-transformers 4.0.2 pypi_0 pypi
sentencepiece 0.2.0 pypi_0 pypi
sentry-sdk 2.26.1 pypi_0 pypi
setproctitle 1.3.5 pypi_0 pypi
setuptools 69.5.1 pypi_0 pypi
sgmllib3k 1.0.0 pypi_0 pypi
shellingham 1.5.4 pypi_0 pypi
simplejson 3.20.1 pypi_0 pypi
six 1.17.0 pypi_0 pypi
smmap 5.0.2 pypi_0 pypi
sniffio 1.3.1 pypi_0 pypi
socksio 1.0.0 pypi_0 pypi
sortedcontainers 2.4.0 pypi_0 pypi
soupsieve 2.6 pypi_0 pypi
sqlite 3.45.3 h5eee18b_0
stack-data 0.6.3 pypi_0 pypi
starlette 0.46.1 pypi_0 pypi
streamlit 1.44.1 pypi_0 pypi
sty 1.0.6 pypi_0 pypi
swankit 0.1.7 pypi_0 pypi
swanlab 0.5.5 pypi_0 pypi
sympy 1.13.1 pypi_0 pypi
tabulate 0.9.0 pypi_0 pypi
tenacity 9.1.2 pypi_0 pypi
tensorboard 2.19.0 pypi_0 pypi
tensorboard-data-server 0.7.2 pypi_0 pypi
termcolor 3.0.1 pypi_0 pypi
terminado 0.18.1 pypi_0 pypi
threadpoolctl 3.6.0 pypi_0 pypi
tifffile 2025.3.30 pypi_0 pypi
tiktoken 0.9.0 pypi_0 pypi
timeout-decorator 0.5.0 pypi_0 pypi
tinycss2 1.4.0 pypi_0 pypi
tk 8.6.14 h39e8969_0
tokenizers 0.21.1 pypi_0 pypi
toml 0.10.2 pypi_0 pypi
tomli 2.2.1 pypi_0 pypi
tomlkit 0.13.2 pypi_0 pypi
torch 2.6.0 pypi_0 pypi
torchaudio 2.6.0 pypi_0 pypi
torchmetrics 1.7.1 pypi_0 pypi
torchvision 0.21.0 pypi_0 pypi
tornado 6.4.2 pypi_0 pypi
tqdm 4.67.1 pypi_0 pypi
traitlets 5.14.3 pypi_0 pypi
transformers 4.51.3 pypi_0 pypi
transformers-stream-generator 0.0.5 pypi_0 pypi
triton 3.2.0 pypi_0 pypi
trl 0.16.1 pypi_0 pypi
typer 0.15.2 pypi_0 pypi
types-python-dateutil 2.9.0.20241206 pypi_0 pypi
typing-extensions 4.13.2 pypi_0 pypi
typing-inspection 0.4.0 pypi_0 pypi
tzdata 2025.2 pypi_0 pypi
uri-template 1.3.0 pypi_0 pypi
uritemplate 4.1.1 pypi_0 pypi
urllib3 2.4.0 pypi_0 pypi
uvicorn 0.34.0 pypi_0 pypi
uvloop 0.21.0 pypi_0 pypi
validators 0.34.0 pypi_0 pypi
vllm 0.8.4 pypi_0 pypi
volcengine-python-sdk 1.1.5 pypi_0 pypi
wandb 0.19.9 pypi_0 pypi
watchdog 6.0.0 pypi_0 pypi
watchfiles 1.0.5 pypi_0 pypi
wcwidth 0.2.13 pypi_0 pypi
webcolors 24.11.1 pypi_0 pypi
webencodings 0.5.1 pypi_0 pypi
websocket-client 1.8.0 pypi_0 pypi
websockets 15.0.1 pypi_0 pypi
werkzeug 3.1.3 pypi_0 pypi
wheel 0.45.1 py310h06a4308_0
widgetsnbextension 4.0.14 pypi_0 pypi
word2number 1.1 pypi_0 pypi
wrapt 1.17.2 pypi_0 pypi
xformers 0.0.29.post2 pypi_0 pypi
xgrammar 0.1.18 pypi_0 pypi
xlsxwriter 3.2.2 pypi_0 pypi
xtuner 0.1.23 pypi_0 pypi
xxhash 3.5.0 pypi_0 pypi
xz 5.6.4 h5eee18b_1
yapf 0.43.0 pypi_0 pypi
yarl 1.19.0 pypi_0 pypi
zipp 3.21.0 pypi_0 pypi
zlib 1.2.13 h5eee18b_1
zstandard 0.23.0 pypi_0 pypi |
Can't reproduce this issue in the main branch. Does anyone have a clean environment and a script that can stably reproduce? below is my repro script export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
export NPROC_PER_NODE=8
swift rlhf \
--rlhf_type grpo \
--model Qwen/Qwen2.5-1.5B-Instruct \
--reward_funcs accuracy format \
--use_vllm true \
--vllm_device auto \
--vllm_gpu_memory_utilization 0.6 \
--vllm_max_model_len 2048 \
--train_type lora \
--torch_dtype bfloat16 \
--dataset AI-MO/NuminaMath-TIR \
--max_completion_length 2048 \
--num_train_epochs 3 \
--per_device_train_batch_size 8 \
--per_device_eval_batch_size 8 \
--learning_rate 1e-6 \
--gradient_accumulation_steps 1 \
--eval_steps 200 \
--save_steps 200 \
--save_total_limit 2 \
--logging_steps 5 \
--max_length 4096 \
--warmup_ratio 0.05 \
--dataloader_num_workers 4 \
--dataset_num_proc 4 \
--num_generations 4 \
--temperature 1.0 \
--log_completions true \
--beta 0.001 \
--sleep_level 1 \
--system swift/examples/train/grpo/prompt.txt \
--num_infer_workers 8 \
--deepspeed zero3 cuda: 12.1
|
我用的这个镜像 |
@winni0 do you have any reproducible scripts? (Please use open-source models and datasets to ensure experimental consistency) |
data_format:LLVIP
torch2.4 |
swift=3.4.0 |
同样遇到此问题,每次训练1000多iter的时候就会出现,稳定复现 |
swift.version |
我们使用的是基于开源数据集二次构建的数据集,和普通的数据训练的特殊之处在于向vllm传入2_images参数,并且采用多图进行训练 |
大家后来怎么解决的,我稳定复现这个报错,我的命令脚本如下: |
同样采用--sleep_level 1 --deepspeed zero3 |
一直没解决呢 |
用的数据集是modelscope上开源的数据集,模型是开源模型Qwen2.5-7B-Instruct |
The refactoring of the internal vLLM codebase is currently a work in progress. |
这个问题都解决了吗?稳定复现,1890步报错,我还以为是我的数据有问题呢 |
The code for GRPOTrainer has been refactored. Please try again using the main branch. |
v3.4.1 这个版本修复了吗? |
Describe the bug
What the bug is, and how to reproduce, better with screenshots(描述bug以及复现过程,最好有截图)
部分报错信息:
INFO 04-12 20:17:34 worker.py:133] Sleep mode freed 36.81 GiB memory, 18.94 GiB memory is still in use. [170/1868]
INFO 04-12 20:17:34 executor_base.py:208] It took 0.316247 seconds to fall asleep.
INFO 04-12 20:17:34 worker.py:133] Sleep mode freed 36.81 GiB memory, 18.35 GiB memory is still in use.
INFO 04-12 20:17:34 executor_base.py:208] It took 0.316826 seconds to fall asleep.
{'loss': 0.02363645, 'grad_norm': 0.0, 'learning_rate': 9.2e-07, 'memory(GiB)': 54.45, 'train_speed(iter/s)': 0.073447, 'completion_length': 900.875, 'response_
clip_ratio': 0.375, 'rewards/Format': 0.475, 'rewards/RepetitionPenalty': 0.0, 'reward': 0.475, 'reward_std': 0.15773503, 'kl': 0.0, 'clip_ratio': 0.0, 'epoch':
0.23, 'global_step/max_steps': '5825/25499', 'percentage': '22.84%', 'elapsed_time': '22h 1m 1s', 'remaining_time': '3d 2h 21m 45s'}
Train: 23%|███████████████████████▉ | 5825/25499 [22:01:01<60:01:23, 10.98s/it]
INFO 04-12 20:17:37 executor_base.py:219] It took 0.193484 seconds to wake up.
INFO 04-12 20:17:37 executor_base.py:219] It took 0.209495 seconds to wake up.
INFO 04-12 20:17:37 executor_base.py:219] It took 0.194383 seconds to wake up.
INFO 04-12 20:17:37 executor_base.py:219] It took 0.206503 seconds to wake up.
INFO 04-12 20:17:37 prefix_caching_block.py:479] Successfully reset prefix cache
INFO 04-12 20:17:37 prefix_caching_block.py:479] Successfully reset prefix cache
INFO 04-12 20:17:37 prefix_caching_block.py:479] Successfully reset prefix cache
INFO 04-12 20:17:37 prefix_caching_block.py:479] Successfully reset prefix cache
INFO 04-12 20:17:37 prefix_caching_block.py:479] Successfully reset prefix cache
INFO 04-12 20:17:37 prefix_caching_block.py:479] Successfully reset prefix cache
INFO 04-12 20:17:37 prefix_caching_block.py:479] Successfully reset prefix cache
INFO 04-12 20:17:37 prefix_caching_block.py:479] Successfully reset prefix cache
Fatal Python error: none_dealloc: deallocating None: bug likely caused by a refcount error in a C extension
Python runtime state: initialized
Thread 0x00007f22fd2fb640 (most recent call first):
Thread 0x00007f22fdafc640 (most recent call first):
Thread 0x00007f22fcafa640 (most recent call first):
Thread 0x00007f22fe2fd640 (most recent call first):
Thread 0x00007f22d4fd1640 (most recent call first):
File "/usr/local/lib/python3.11/threading.py", line 327 in wait
File "/usr/local/lib/python3.11/multiprocessing/queues.py", line 231 in _feed
File "/usr/local/lib/python3.11/threading.py", line 982 in run
File "/usr/local/lib/python3.11/threading.py", line 1045 in _bootstrap_inner
File "/usr/local/lib/python3.11/threading.py", line 1002 in _bootstrap
Thread 0x00007f22d47d0640 (most recent call first):
File "/usr/local/lib/python3.11/threading.py", line 327 in wait
File "/usr/local/lib/python3.11/multiprocessing/queues.py", line 231 in _feed
File "/usr/local/lib/python3.11/threading.py", line 982 in run
File "/usr/local/lib/python3.11/threading.py", line 327 in wait [120/1868]
File "/usr/local/lib/python3.11/multiprocessing/queues.py", line 231 in _feed
File "/usr/local/lib/python3.11/threading.py", line 982 in run
File "/usr/local/lib/python3.11/threading.py", line 1045 in _bootstrap_inner
File "/usr/local/lib/python3.11/threading.py", line 1002 in _bootstrap
Thread 0x00007f22d37ce640 (most recent call first):
File "/usr/local/lib/python3.11/threading.py", line 327 in wait
File "/usr/local/lib/python3.11/multiprocessing/queues.py", line 231 in _feed
File "/usr/local/lib/python3.11/threading.py", line 982 in run
File "/usr/local/lib/python3.11/threading.py", line 1045 in _bootstrap_inner
File "/usr/local/lib/python3.11/threading.py", line 1002 in _bootstrap
Thread 0x00007f22ff0fe640 (most recent call first):
File "/usr/local/lib/python3.11/selectors.py", line 415 in select
File "/usr/local/lib/python3.11/multiprocessing/connection.py", line 948 in wait
File "/usr/local/lib/python3.11/multiprocessing/connection.py", line 440 in _poll
File "/usr/local/lib/python3.11/multiprocessing/connection.py", line 257 in poll
File "/usr/local/lib/python3.11/multiprocessing/queues.py", line 113 in get
File "/usr/local/lib/python3.11/site-packages/torch/utils/data/_utils/pin_memory.py", line 35 in do_one_step
File "/usr/local/lib/python3.11/site-packages/torch/utils/data/_utils/pin_memory.py", line 59 in _pin_memory_loop
File "/usr/local/lib/python3.11/threading.py", line 982 in run
File "/usr/local/lib/python3.11/threading.py", line 1045 in _bootstrap_inner
File "/usr/local/lib/python3.11/threading.py", line 1002 in _bootstrap
Thread 0x00007f2511fff640 (most recent call first):
File "/usr/local/lib/python3.11/site-packages/vllm/usage/usage_lib.py", line 220 in _report_continous_usage
File "/usr/local/lib/python3.11/site-packages/vllm/usage/usage_lib.py", line 163 in _report_usage_worker
File "/usr/local/lib/python3.11/threading.py", line 982 in run
File "/usr/local/lib/python3.11/threading.py", line 1045 in _bootstrap_inner
File "/usr/local/lib/python3.11/threading.py", line 1002 in _bootstrap
Thread 0x00007f2520ff9640 (most recent call first):
File "/usr/local/lib/python3.11/threading.py", line 331 in wait
File "/usr/local/lib/python3.11/threading.py", line 629 in wait
File "/usr/local/lib/python3.11/site-packages/tqdm/_monitor.py", line 60 in run
File "/usr/local/lib/python3.11/threading.py", line 1045 in _bootstrap_inner
File "/usr/local/lib/python3.11/threading.py", line 1002 in _bootstrap
Thread 0x00007f27b5a6f640 (most recent call first):
File "/usr/local/lib/python3.11/threading.py", line 331 in wait
File "/usr/local/lib/python3.11/threading.py", line 629 in wait
File "/usr/local/lib/python3.11/site-packages/tqdm/_monitor.py", line 60 in run
File "/usr/local/lib/python3.11/threading.py", line 1045 in _bootstrap_inner
File "/usr/local/lib/python3.11/threading.py", line 1002 in _bootstrap
Current thread 0x00007f2b06257e00 (most recent call first):
File "/usr/local/lib/python3.11/site-packages/vllm/core/scheduler.py", line 779 in _schedule_running
File "/usr/local/lib/python3.11/site-packages/vllm/core/scheduler.py", line 1244 in _schedule_default
File "/usr/local/lib/python3.11/site-packages/vllm/core/scheduler.py", line 1445 in _schedule
File "/usr/local/lib/python3.11/site-packages/vllm/core/scheduler.py", line 1486 in schedule
File "/usr/local/lib/python3.11/site-packages/swift/llm/infer/infer_engine/utils.py", line 612 in new_step
File "/usr/local/lib/python3.11/site-packages/vllm/entrypoints/llm.py", line 1397 in _run_engine
File "/usr/local/lib/python3.11/site-packages/vllm/entrypoints/llm.py", line 469 in generate
File "/usr/local/lib/python3.11/site-packages/vllm/utils.py", line 1057 in inner
File "/usr/local/lib/python3.11/site-packages/swift/llm/infer/infer_engine/grpo_vllm_engine.py", line 149 in infer
File "/usr/local/lib/python3.11/site-packages/swift/trainers/rlhf_trainer/grpo_trainer.py", line 637 in _infer_multi_turn
File "/usr/local/lib/python3.11/site-packages/swift/trainers/rlhf_trainer/grpo_trainer.py", line 775 in _fast_infer
File "/usr/local/lib/python3.11/site-packages/swift/trainers/rlhf_trainer/grpo_trainer.py", line 817 in _generate_and_score_completions
File "/usr/local/lib/python3.11/site-packages/trl/trainer/grpo_trainer.py", line 647 in _prepare_inputs
File "/usr/local/lib/python3.11/site-packages/trl/extras/profiling.py", line 87 in wrapper
File "/usr/local/lib/python3.11/site-packages/transformers/trainer.py", line 3669 in training_step
File "/usr/local/lib/python3.11/site-packages/swift/trainers/rlhf_trainer/grpo_trainer.py", line 1047 in training_step
File "/usr/local/lib/python3.11/site-packages/transformers/trainer.py", line 2531 in _inner_training_loop
File "/usr/local/lib/python3.11/site-packages/transformers/trainer.py", line 2171 in train
File "/usr/local/lib/python3.11/site-packages/swift/trainers/mixin.py", line 289 in train
File "/usr/local/lib/python3.11/site-packages/swift/llm/train/sft.py", line 202 in train
File "/usr/local/lib/python3.11/site-packages/swift/llm/train/sft.py", line 142 in run
File "/usr/local/lib/python3.11/site-packages/swift/llm/base.py", line 47 in main
File "/usr/local/lib/python3.11/site-packages/swift/llm/train/rlhf.py", line 98 in rlhf_main
File "/usr/local/lib/python3.11/site-packages/swift/cli/rlhf.py", line 5 in
Extension modules: numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common,
numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random.
_sfc64, numpy.random._generator, torch._C, torch._C._dynamo.autograd_compiler, torch._C._dynamo.eval_frame, torch._C._dynamo.guards, torch._C._dynamo.utils, tor
ch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, zstandard.backend_c, charset_normalizer.md, simplejson._speed
ups, requests.packages.charset_normalizer.md, requests.packages.chardet.md, yaml._yaml, markupsafe._speedups, PIL._imaging, pyarrow.lib, pandas._libs.tslibs.cca
lendar, pandas._libs.tslibs.np_datetime, pandas._libs.tslibs.dtypes, pandas._libs.tslibs.base, pandas._libs.tslibs.nattype, pandas._libs.tslibs.timezones, panda
s._libs.tslibs.fields, pandas._libs.tslibs.timedeltas, pandas._libs.tslibs.tzconversion, pandas._libs.tslibs.timestamps, pandas._libs.properties, pandas._libs.t
slibs.offsets, pandas._libs.tslibs.strptime, pandas._libs.tslibs.parsing, pandas._libs.tslibs.conversion, pandas._libs.tslibs.period, pandas._libs.tslibs.vector
ized, pandas._libs.ops_dispatch, pandas._libs.missing, pandas._libs.hashtable, pandas._libs.algos, pandas._libs.interval, pandas._libs.lib, pyarrow._compute, pa
ndas._libs.ops, pandas._libs.hashing, pandas._libs.arrays, pandas._libs.tslib, pandas._libs.sparse, pandas._libs.internals, pandas._libs.indexing, pandas._libs.
index, pandas._libs.writers, pandas._libs.join, pandas._libs.window.aggregations, pandas._libs.window.indexers, pandas._libs.reshape, pandas._libs.groupby, pand
as._libs.json, pandas._libs.parsers, pandas._libs.testing, kiwisolver._cext, google._upb._message, _cffi_backend, pyarrow._parquet, pyarrow._fs, pyarrow._azuref
s, pyarrow._hdfs, pyarrow._gcsfs, pyarrow._s3fs, multidict._multidict, yarl._quoting_c, propcache._helpers_c, aiohttp._http_writer, aiohttp._http_parser, aiohtt
p._websocket.mask, aiohttp._websocket.reader_c, frozenlist._frozenlist, xxhash._xxhash, pyarrow._json, pyarrow._acero, pyarrow._csv, pyarrow._substrait, pyarrow
._dataset, pyarrow._dataset_orc, pyarrow._parquet_encryption, pyarrow._dataset_parquet_encryption, pyarrow._dataset_parquet, sklearn.__check_build._check_build,
psutil._psutil_linux, psutil._psutil_posix, scipy._lib._ccallback_c, scipy.sparse._sparsetools, _csparsetools, scipy.sparse._csparsetools, scipy.linalg._fblas,
scipy.linalg._flapack, scipy.linalg.cython_lapack, scipy.linalg._cythonized_array_utils, scipy.linalg._solve_toeplitz, scipy.linalg._decomp_lu_cython, scipy.li
nalg._matfuncs_sqrtm_triu, scipy.linalg._matfuncs_expm, scipy.linalg._linalg_pythran, scipy.linalg.cython_blas, scipy.linalg._decomp_update, scipy.sparse.linalg
._dsolve._superlu, scipy.sparse.linalg._eigen.arpack._arpack, scipy.sparse.linalg._propack._spropack, scipy.sparse.linalg._propack._dpropack, scipy.sparse.linal
root@k8s-gpu:~# tmux attach -t wangjuan1
frame, av.video.stream, av.codec.hwaccel, av.codec.codec, av.frame, av.audio.layout, av.audio.plane, av.audio.frame, av.audio.stream, av.filter.pad, av[10/1868]
ink, av.filter.context, av.filter.graph, av.filter.filter, av.filter.loudnorm, av.audio.resampler, av.audio.codeccontext, av.audio.fifo, av.bitstream, av.video.
codeccontext, regex._regex, scipy.io.matlab._mio_utils, scipy.io.matlab._streams, scipy.io.matlab._mio5_utils, msgspec._core, sentencepiece._sentencepiece, zmq.
backend.cython._zmq, msgpack._cmsgpack, setproctitle, uvloop.loop, ray._raylet, vllm.cumem_allocator, cuda_utils, __triton_launcher (total: 268)
W0412 20:17:38.097000 20231 usr/local/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 20296 closing signal SI
GTERM
W0412 20:17:38.098000 20231 usr/local/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 20298 closing signal SI
GTERM
W0412 20:17:38.100000 20231 usr/local/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 20299 closing signal SI
GTERM
E0412 20:17:38.104000 20231 usr/local/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: -6) local_rank: 1 (pi
d: 20297) of binary: /usr/local/bin/python
Traceback (most recent call last):
File "", line 198, in _run_module_as_main
File "", line 88, in _run_code
File "/usr/local/lib/python3.11/site-packages/torch/distributed/run.py", line 923, in
main()
File "/usr/local/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 355, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/torch/distributed/run.py", line 919, in main
run(args)
File "/usr/local/lib/python3.11/site-packages/torch/distributed/run.py", line 910, in run
elastic_launch(
File "/usr/local/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 138, in call
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
/usr/local/lib/python3.11/site-packages/swift/cli/rlhf.py FAILED
Failures:
<NO_OTHER_FAILURES>
Root Cause (first observed failure):
[0]:
time : 2025-04-12_20:17:38
host : k8s-gpu
rank : 1 (local_rank: 1)
exitcode : -6 (pid: 20297)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 20297
/usr/local/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up
Your hardware and system info
Write your system info like CUDA version/system/GPU/torch version here(在这里给出硬件信息和系统信息,如CUDA版本,系统,GPU型号和torch版本等)
用的官方镜像:modelscope-registry.us-west-1.cr.aliyuncs.com/modelscope-repo/modelscope:ubuntu22.04-cuda12.4.0-py311-torch2.5.1-modelscope1.25.0-swift3.2.2
cuda12.4 linux H100 torch2.5.1 4*80G
Additional context
Add any other context about the problem here(在这里补充其他信息)
用的是sft-lora之后的模型
执行命令:CUDA_VISIBLE_DEVICES=0,1,2,3
NPROC_PER_NODE=4
WANDB_API_KEY=XXX
swift rlhf
--rlhf_type grpo
--model /nfs/largemodel/wangjuan/outputs/Qwen2.5-7B-110K-sft5-0408/v1-20250408-062835/checkpoint-7360-merged
--train_type lora
--dataset '/nfs/largemodel/wangjuan/data/law_chinese1/DISC-Law-SFT-Pair-QA-released_alpaca.json' '/nfs/largemodel/wangjuan/data/law_chinese1/DISC-Law-SFT-Triplet-QA-released_alpaca.json'
--torch_dtype bfloat16
--num_train_epochs 1
--max_length 1024
--per_device_train_batch_size 4
--per_device_eval_batch_size 4
--gradient_accumulation_steps 8
--eval_steps 1000
--save_steps 1000
--learning_rate 1e-6
--save_total_limit 2
--logging_steps 5
--output_dir /nfs/largemodel/wangjuan/outputs/Qwen2.5-7B-swift-GRPO-0411
--warmup_ratio 0.05
--dataloader_num_workers 4
--max_completion_length 1024
--reward_funcs format repetition
--num_generations 4
--system '回复的格式如下:
...
...
'
--use_vllm true
--vllm_gpu_memory_utilization 0.5
--vllm_max_model_len 2048
--deepspeed zero3
--temperature 1.0
--top_p 1.0
--top_k 80
--log_completions true
--num_infer_workers 4
--tensor_parallel_size 2
--async_generate false
--move_model_batches 16
--offload_optimizer true
--offload_model true
--gc_collect_after_offload true
--report_to 'wandb'
--sleep_level 1
The text was updated successfully, but these errors were encountered: