Skip to content

grop训练到中间某个step报错 #4867

Open
@HongyongZZZ

Description

@HongyongZZZ

我采用的是Colocate(Internal) Mode的部署方式,在训练前期都能正常进行,但是训练到1000多步的时候出现了如下报错,并且我又尝试了多次,还是在1190至1300步这个范围内出现该报错

训练指令
`export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
export NPROC_PER_NODE=8

swift rlhf
--rlhf_type grpo
--model /home/user/models/qwen2.5-audio-7B-ins
--model_type qwen2_audio
--reward_funcs external_my_reward
--reward_weights 1
--use_vllm true
--vllm_mode colocate
--vllm_gpu_memory_utilization 0.35
--sleep_level 1
--offload_optimizer false
--offload_model false
--train_type full
--torch_dtype bfloat16
--dataset '/home/user/data/30000_RL.json'
--external_plugins examples/train/grpo/plugin/plugin.py
--max_completion_length 2048
--num_train_epochs 1
--per_device_train_batch_size 8
--per_device_eval_batch_size 8
--learning_rate 1e-6
--gradient_accumulation_steps 2
--eval_steps 300
--save_steps 300
--save_total_limit 25
--logging_steps 5
--max_length 8192
--output_dir output/GRPO_Qwen-Audio
--warmup_ratio 0.05
--dataloader_num_workers 4
--num_generations 32
--temperature 1.
--deepspeed zero3
--log_completions true
--report_to wandb
--num_iterations 1
--beta 0.001 `

报错
`Fatal Python error: none_dealloc: deallocating None
Python runtime state: initialized

Thread 0x00007f4e64ffd700 (most recent call first):

Thread 0x00007f53aaffd700 (most recent call first):

Thread 0x00007f549ffff700 (most recent call first):

Thread 0x00007f4e657fe700 (most recent call first):

Thread 0x00007f4e65fff700 (most recent call first):

Thread 0x00007f5320dff700 (most recent call first):

Thread 0x00007f53d4bd6700 (most recent call first):

Thread 0x00007f5362ad3700 (most recent call first):

Thread 0x00007f54b1b1c700 (most recent call first):
File "/home/user/miniconda3/envs/swift/lib/python3.10/threading.py", line 320 in wait
File "/home/user/miniconda3/envs/swift/lib/python3.10/multiprocessing/queues.py", line 231 in _feed
File "/home/user/miniconda3/envs/swift/lib/python3.10/threading.py", line 953 in run
File "/home/user/miniconda3/envs/swift/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
File "/home/user/miniconda3/envs/swift/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007f54b231d700 (most recent call first):
File "/home/user/miniconda3/envs/swift/lib/python3.10/threading.py", line 320 in wait
File "/home/user/miniconda3/envs/swift/lib/python3.10/multiprocessing/queues.py", line 231 in _feed
File "/home/user/miniconda3/envs/swift/lib/python3.10/threading.py", line 953 in run
File "/home/user/miniconda3/envs/swift/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
File "/home/user/miniconda3/envs/swift/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007f54b2b1e700 (most recent call first):
File "/home/user/miniconda3/envs/swift/lib/python3.10/threading.py", line 320 in wait
File "/home/user/miniconda3/envs/swift/lib/python3.10/multiprocessing/queues.py", line 231 in _feed
File "/home/user/miniconda3/envs/swift/lib/python3.10/threading.py", line 953 in run
File "/home/user/miniconda3/envs/swift/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
File "/home/user/miniconda3/envs/swift/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007f54bcbfd700 (most recent call first):
File "/home/user/miniconda3/envs/swift/lib/python3.10/threading.py", line 320 in wait
File "/home/user/miniconda3/envs/swift/lib/python3.10/multiprocessing/queues.py", line 231 in _feed
File "/home/user/miniconda3/envs/swift/lib/python3.10/threading.py", line 953 in run
File "/home/user/miniconda3/envs/swift/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
File "/home/user/miniconda3/envs/swift/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007f54bd5fe700 (most recent call first):
File "/home/user/miniconda3/envs/swift/lib/python3.10/selectors.py", line 416 in select
File "/home/user/miniconda3/envs/swift/lib/python3.10/multiprocessing/connection.py", line 931 in wait
File "/home/user/miniconda3/envs/swift/lib/python3.10/multiprocessing/connection.py", line 424 in _poll
File "/home/user/miniconda3/envs/swift/lib/python3.10/multiprocessing/connection.py", line 257 in poll
File "/home/user/miniconda3/envs/swift/lib/python3.10/multiprocessing/queues.py", line 113 in get
File "/home/user/miniconda3/envs/swift/lib/python3.10/site-packages/torch/utils/data/_utils/pin_memory.py", line 35 in do_one_step
File "/home/user/miniconda3/envs/swift/lib/python3.10/site-packages/torch/utils/data/_utils/pin_memory.py", line 59 in _pin_memory_loop
File "/home/user/miniconda3/envs/swift/lib/python3.10/threading.py", line 953 in run
File "/home/user/miniconda3/envs/swift/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
File "/home/user/miniconda3/envs/swift/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007f54be9ff700 (most recent call first):
File "/home/user/miniconda3/envs/swift/lib/python3.10/threading.py", line 324 in wait
File "/home/user/miniconda3/envs/swift/lib/python3.10/threading.py", line 607 in wait
File "/home/user/miniconda3/envs/swift/lib/python3.10/site-packages/tqdm/_monitor.py", line 60 in run
File "/home/user/miniconda3/envs/swift/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
File "/home/user/miniconda3/envs/swift/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007f55b6ffd700 (most recent call first):
File "/home/user/miniconda3/envs/swift/lib/python3.10/threading.py", line 324 in wait
File "/home/user/miniconda3/envs/swift/lib/python3.10/threading.py", line 607 in wait
File "/home/user/miniconda3/envs/swift/lib/python3.10/site-packages/tqdm/_monitor.py", line 60 in run
File "/home/user/miniconda3/envs/swift/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
File "/home/user/miniconda3/envs/swift/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007f55dd3fe700 (most recent call first):
File "/home/user/miniconda3/envs/swift/lib/python3.10/site-packages/torch/_inductor/compile_worker/subproc_pool.py", line 53 in _recv_msg
File "/home/user/miniconda3/envs/swift/lib/python3.10/site-packages/torch/_inductor/compile_worker/subproc_pool.py", line 161 in _read_thread
File "/home/user/miniconda3/envs/swift/lib/python3.10/threading.py", line 953 in run
File "/home/user/miniconda3/envs/swift/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
File "/home/user/miniconda3/envs/swift/lib/python3.10/threading.py", line 973 in _bootstrap

Current thread 0x00007f56e58844c0 (most recent call first):
File "/home/user/miniconda3/envs/swift/lib/python3.10/site-packages/vllm/device_allocator/cumem.py", line 80 in unmap_and_release
File "/home/user/miniconda3/envs/swift/lib/python3.10/site-packages/vllm/device_allocator/cumem.py", line 205 in sleep
File "/home/user/miniconda3/envs/swift/lib/python3.10/site-packages/vllm/v1/worker/gpu_worker.py", line 76 in sleep
File "/home/user/miniconda3/envs/swift/lib/python3.10/site-packages/vllm/utils.py", line 2216 in run_method
File "/home/user/miniconda3/envs/swift/lib/python3.10/site-packages/vllm/executor/uniproc_executor.py", line 56 in collective_rpc
File "/home/user/miniconda3/envs/swift/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 205 in sleep
File "/home/user/miniconda3/envs/swift/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 251 in sleep
File "/home/user/miniconda3/envs/swift/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 186 in sleep
File "/home/user/miniconda3/envs/swift/lib/python3.10/site-packages/vllm/v1/engine/llm_engine.py", line 233 in sleep
File "/data/user/zhy/ms-swift/swift/trainers/rlhf_trainer/grpo_trainer.py", line 837 in _fast_infer
File "/data/user/zhy/ms-swift/swift/trainers/rlhf_trainer/grpo_trainer.py", line 854 in _generate_completions
File "/data/user/zhy/ms-swift/swift/trainers/rlhf_trainer/grpo_trainer.py", line 873 in _generate_and_score_completions
File "/data/user/zhy/ms-swift/swift/trainers/rlhf_trainer/grpo_trainer.py", line 348 in _prepare_inputs
File "/home/user/miniconda3/envs/swift/lib/python3.10/site-packages/trl/extras/profiling.py", line 87 in wrapper
File "/home/user/miniconda3/envs/swift/lib/python3.10/site-packages/transformers/trainer.py", line 3669 in training_step
File "/data/user/zhy/ms-swift/swift/trainers/rlhf_trainer/grpo_trainer.py", line 1380 in training_step
File "/home/user/miniconda3/envs/swift/lib/python3.10/site-packages/transformers/trainer.py", line 2531 in _inner_training_loop
File "/home/user/miniconda3/envs/swift/lib/python3.10/site-packages/transformers/trainer.py", line 2171 in train
File "/data/user/zhy/ms-swift/swift/trainers/mixin.py", line 411 in train
File "/data/user/zhy/ms-swift/swift/llm/train/sft.py", line 183 in train
File "/data/user/zhy/ms-swift/swift/llm/train/sft.py", line 122 in run
File "/data/user/zhy/ms-swift/swift/llm/base.py", line 49 in main
File "/data/user/zhy/ms-swift/swift/llm/train/rlhf.py", line 172 in rlhf_main
File "/data/user/zhy/ms-swift/swift/cli/rlhf.py", line 5 in

Extension modules: numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._dynamo.autograd_compiler, torch._C._dynamo.eval_frame, torch._C._dynamo.guards, torch._C._dynamo.utils, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, zstandard.backend_c, charset_normalizer.md, simplejson._speedups, requests.packages.charset_normalizer.md, requests.packages.chardet.md, yaml._yaml, markupsafe._speedups, PIL._imaging, pyarrow.lib, pandas._libs.tslibs.ccalendar, pandas._libs.tslibs.np_datetime, pandas._libs.tslibs.dtypes, pandas._libs.tslibs.base, pandas._libs.tslibs.nattype, pandas._libs.tslibs.timezones, pandas._libs.tslibs.fields, pandas._libs.tslibs.timedeltas, pandas._libs.tslibs.tzconversion, pandas._libs.tslibs.timestamps, pandas._libs.properties, pandas._libs.tslibs.offsets, pandas._libs.tslibs.strptime, pandas._libs.tslibs.parsing, pandas._libs.tslibs.conversion, pandas._libs.tslibs.period, pandas._libs.tslibs.vectorized, pandas._libs.ops_dispatch, pandas._libs.missing, pandas._libs.hashtable, pandas._libs.algos, pandas._libs.interval, pandas._libs.lib, pyarrow._compute, pandas._libs.ops, pandas._libs.hashing, pandas._libs.arrays, pandas._libs.tslib, pandas._libs.sparse, pandas._libs.internals, pandas._libs.indexing, pandas._libs.index, pandas._libs.writers, pandas._libs.join, pandas._libs.window.aggregations, pandas._libs.window.indexers, pandas._libs.reshape, pandas._libs.groupby, pandas._libs.json, pandas._libs.parsers, pandas._libs.testing, _cffi_backend, pyarrow._parquet, pyarrow._fs, pyarrow._azurefs, pyarrow._hdfs, pyarrow._gcsfs, pyarrow._s3fs, multidict._multidict, yarl._quoting_c, propcache._helpers_c, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket.mask, aiohttp._websocket.reader_c, frozenlist._frozenlist, xxhash._xxhash, pyarrow._json, pyarrow._acero, pyarrow._csv, pyarrow._substrait, pyarrow._dataset, pyarrow._dataset_orc, pyarrow._parquet_encryption, pyarrow._dataset_parquet_encryption, pyarrow._dataset_parquet, sklearn.__check_build._check_build, scipy._lib._ccallback_c, scipy.sparse._sparsetools, _csparsetools, scipy.sparse._csparsetools, scipy.linalg._fblas, scipy.linalg._flapack, scipy.linalg.cython_lapack, scipy.linalg._cythonized_array_utils, scipy.linalg._solve_toeplitz, scipy.linalg._decomp_lu_cython, scipy.linalg._matfuncs_sqrtm_triu, scipy.linalg._matfuncs_expm, scipy.linalg._linalg_pythran, scipy.linalg.cython_blas, scipy.linalg._decomp_update, scipy.sparse.linalg._dsolve._superlu, scipy.sparse.linalg._eigen.arpack._arpack, scipy.sparse.linalg._propack._spropack, scipy.sparse.linalg._propack._dpropack, scipy.sparse.linalg._propack._cpropack, scipy.sparse.linalg._propack._zpropack, scipy.sparse.csgraph._tools, scipy.sparse.csgraph._shortest_path, scipy.sparse.csgraph._traversal, scipy.sparse.csgraph._min_spanning_tree, scipy.sparse.csgraph._flow, scipy.sparse.csgraph._matching, scipy.sparse.csgraph._reordering, psutil._psutil_linux, psutil._psutil_posix, scipy.special._ufuncs_cxx, scipy.special._ufuncs, scipy.special._specfun, scipy.special._comb, scipy.special._ellip_harm_2, scipy.spatial._ckdtree, scipy._lib.messagestream, scipy.spatial._qhull, scipy.spatial._voronoi, scipy.spatial._distance_wrap, scipy.spatial._hausdorff, scipy.spatial.transform._rotation, scipy.optimize._group_columns, scipy.optimize._trlib._trlib, scipy.optimize._lbfgsb, _moduleTNC, scipy.optimize._moduleTNC, scipy.optimize._cobyla, scipy.optimize._slsqp, scipy.optimize._minpack, scipy.optimize._lsq.givens_elimination, scipy.optimize._zeros, scipy.optimize._cython_nnls, scipy._lib._uarray._uarray, scipy.linalg._decomp_interpolative, scipy.optimize._bglu_dense, scipy.optimize._lsap, scipy.optimize._direct, scipy.integrate._odepack, scipy.integrate._quadpack, scipy.integrate._vode, scipy.integrate._dop, scipy.integrate._lsoda, scipy.interpolate._fitpack, scipy.interpolate._dfitpack, scipy.interpolate._dierckx, scipy.interpolate._ppoly, scipy.interpolate._interpnd, scipy.interpolate._rbfinterp_pythran, scipy.interpolate._rgi_cython, scipy.interpolate._bspl, scipy.special.cython_special, scipy.stats._stats, scipy.stats._sobol, scipy.stats._qmc_cy, scipy.stats._biasedurn, scipy.stats._stats_pythran, scipy.stats._levy_stable.levyst, scipy.stats._ansari_swilk_statistics, scipy.stats._mvn, scipy.stats._rcont.rcont, scipy.ndimage._nd_image, scipy.ndimage._rank_filter_1d, _ni_label, scipy.ndimage._ni_label, sklearn.utils._isfinite, sklearn.utils.sparsefuncs_fast, sklearn.utils.murmurhash, sklearn.utils._openmp_helpers, sklearn.metrics.cluster._expected_mutual_info_fast, sklearn.preprocessing._csr_polynomial_expansion, sklearn.preprocessing._target_encoder_fast, sklearn.metrics._dist_metrics, sklearn.metrics._pairwise_distances_reduction._datasets_pair, sklearn.utils._cython_blas, sklearn.metrics._pairwise_distances_reduction._base, sklearn.metrics._pairwise_distances_reduction._middle_term_computer, sklearn.utils._heap, sklearn.utils._sorting, sklearn.metrics._pairwise_distances_reduction._argkmin, sklearn.metrics._pairwise_distances_reduction._argkmin_classmode, sklearn.utils._vector_sentinel, sklearn.metrics._pairwise_distances_reduction._radius_neighbors, sklearn.metrics._pairwise_distances_reduction._radius_neighbors_classmode, sklearn.metrics._pairwise_fast, PIL._imagingft, lxml._elementpath, lxml.etree, regex._regex, sklearn.feature_extraction._hashing_fast, sklearn.utils._random, sklearn.utils._seq_dataset, sklearn.linear_model._cd_fast, _loss, sklearn._loss._loss, sklearn.utils.arrayfuncs, sklearn.svm._liblinear, sklearn.svm._libsvm, sklearn.svm._libsvm_sparse, sklearn.linear_model._sag_fast, sklearn.utils._weight_vector, sklearn.linear_model._sgd_fast, scipy.io.matlab._mio_utils, scipy.io.matlab._streams, scipy.io.matlab._mio5_utils, sklearn.datasets._svmlight_format_fast, rapidfuzz._feature_detector_cpp, rapidfuzz.distance._initialize_cpp, rapidfuzz.distance.metrics_cpp_avx2, rapidfuzz.fuzz_cpp_avx2, rapidfuzz.process_cpp_impl, rapidfuzz.utils_cpp, Levenshtein.levenshtein_cpp, cuda_utils, msgpack._cmsgpack, google._upb._message, msgspec._core, zmq.backend.cython._zmq, setproctitle, uvloop.loop, ray._raylet, sentencepiece._sentencepiece, vllm.cumem_allocator, numba.core.typeconv._typeconv, numba._helperlib, numba._dynfunc, numba._dispatcher, numba.core.runtime._nrt_python, numba.np.ufunc._internal, numba.experimental.jitclass._box, __triton_launcher, scipy.signal._sigtools, scipy.signal._max_len_seq_inner, scipy.signal._upfirdn_apply, scipy.signal._spline, scipy.signal._sosfilt, scipy.signal._peak_finding_utils (total: 257)
INFO 07-06 04:11:51 [gpu_worker.py:81] Sleep mode freed 20.97 GiB memory, 39.16 GiB memory is still in use.
INFO 07-06 04:11:51 [executor_base.py:208] It took 0.820670 seconds to fall asleep.
INFO 07-06 04:11:51 [gpu_worker.py:81] Sleep mode freed 20.97 GiB memory, 39.73 GiB memory is still in use.
INFO 07-06 04:11:51 [executor_base.py:208] It took 0.884788 seconds to fall asleep.
INFO 07-06 04:11:52 [gpu_worker.py:81] Sleep mode freed 20.97 GiB memory, 38.77 GiB memory is still in use.
INFO 07-06 04:11:52 [gpu_worker.py:81] Sleep mode freed 20.97 GiB memory, 39.09 GiB memory is still in use.
INFO 07-06 04:11:52 [executor_base.py:208] It took 1.376163 seconds to fall asleep.
INFO 07-06 04:11:52 [executor_base.py:208] It took 0.898365 seconds to fall asleep.
INFO 07-06 04:11:52 [gpu_worker.py:81] Sleep mode freed 20.97 GiB memory, 38.79 GiB memory is still in use.
INFO 07-06 04:11:52 [executor_base.py:208] It took 1.367942 seconds to fall asleep.
INFO 07-06 04:11:52 [gpu_worker.py:81] Sleep mode freed 20.97 GiB memory, 38.88 GiB memory is still in use.
INFO 07-06 04:11:52 [executor_base.py:208] It took 1.440262 seconds to fall asleep.
INFO 07-06 04:11:52 [gpu_worker.py:81] Sleep mode freed 20.97 GiB memory, 39.25 GiB memory is still in use.
INFO 07-06 04:11:52 [executor_base.py:208] It took 1.400168 seconds to fall asleep.
W0706 04:11:57.021000 3245876 /data/user/miniconda3/envs/swift/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3245966 closing signal SIGTERM
W0706 04:11:57.033000 3245876 /data/user/miniconda3/envs/swift/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3245967 closing signal SIGTERM
W0706 04:11:57.037000 3245876 /data/user/miniconda3/envs/swift/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3245968 closing signal SIGTERM
W0706 04:11:57.042000 3245876 /data/user/miniconda3/envs/swift/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3245970 closing signal SIGTERM
W0706 04:11:57.056000 3245876 /data/user/miniconda3/envs/swift/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3245971 closing signal SIGTERM
W0706 04:11:57.081000 3245876 /data/user/miniconda3/envs/swift/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3245972 closing signal SIGTERM
W0706 04:11:57.087000 3245876 /data/user/miniconda3/envs/swift/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3245973 closing signal SIGTERM
/home/user/miniconda3/envs/swift/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 21 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
/home/user/miniconda3/envs/swift/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 21 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
/home/user/miniconda3/envs/swift/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 21 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
/home/user/miniconda3/envs/swift/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 21 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
/home/user/miniconda3/envs/swift/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 21 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
/home/user/miniconda3/envs/swift/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 21 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
/home/user/miniconda3/envs/swift/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 21 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
/home/user/miniconda3/envs/swift/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 21 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
E0706 04:12:22.443000 3245876 /data/user/miniconda3/envs/swift/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: -6) local_rank: 3 (pid: 3245969) of binary: /home/user/miniconda3/envs/swift/bin/python3.10
Traceback (most recent call last):
File "/home/user/miniconda3/envs/swift/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/user/miniconda3/envs/swift/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/home/user/miniconda3/envs/swift/lib/python3.10/site-packages/torch/distributed/run.py", line 922, in
main()
File "/home/user/miniconda3/envs/swift/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 355, in wrapper
return f(*args, **kwargs)
File "/home/user/miniconda3/envs/swift/lib/python3.10/site-packages/torch/distributed/run.py", line 918, in main
run(args)
File "/home/user/miniconda3/envs/swift/lib/python3.10/site-packages/torch/distributed/run.py", line 909, in run
elastic_launch(
File "/home/user/miniconda3/envs/swift/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/user/miniconda3/envs/swift/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: `

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions