Description
我采用的是Colocate(Internal) Mode的部署方式,在训练前期都能正常进行,但是训练到1000多步的时候出现了如下报错,并且我又尝试了多次,还是在1190至1300步这个范围内出现该报错
训练指令
`export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
export NPROC_PER_NODE=8
swift rlhf
--rlhf_type grpo
--model /home/user/models/qwen2.5-audio-7B-ins
--model_type qwen2_audio
--reward_funcs external_my_reward
--reward_weights 1
--use_vllm true
--vllm_mode colocate
--vllm_gpu_memory_utilization 0.35
--sleep_level 1
--offload_optimizer false
--offload_model false
--train_type full
--torch_dtype bfloat16
--dataset '/home/user/data/30000_RL.json'
--external_plugins examples/train/grpo/plugin/plugin.py
--max_completion_length 2048
--num_train_epochs 1
--per_device_train_batch_size 8
--per_device_eval_batch_size 8
--learning_rate 1e-6
--gradient_accumulation_steps 2
--eval_steps 300
--save_steps 300
--save_total_limit 25
--logging_steps 5
--max_length 8192
--output_dir output/GRPO_Qwen-Audio
--warmup_ratio 0.05
--dataloader_num_workers 4
--num_generations 32
--temperature 1.
--deepspeed zero3
--log_completions true
--report_to wandb
--num_iterations 1
--beta 0.001 `
报错
`Fatal Python error: none_dealloc: deallocating None
Python runtime state: initialized
Thread 0x00007f4e64ffd700 (most recent call first):
Thread 0x00007f53aaffd700 (most recent call first):
Thread 0x00007f549ffff700 (most recent call first):
Thread 0x00007f4e657fe700 (most recent call first):
Thread 0x00007f4e65fff700 (most recent call first):
Thread 0x00007f5320dff700 (most recent call first):
Thread 0x00007f53d4bd6700 (most recent call first):
Thread 0x00007f5362ad3700 (most recent call first):
Thread 0x00007f54b1b1c700 (most recent call first):
File "/home/user/miniconda3/envs/swift/lib/python3.10/threading.py", line 320 in wait
File "/home/user/miniconda3/envs/swift/lib/python3.10/multiprocessing/queues.py", line 231 in _feed
File "/home/user/miniconda3/envs/swift/lib/python3.10/threading.py", line 953 in run
File "/home/user/miniconda3/envs/swift/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
File "/home/user/miniconda3/envs/swift/lib/python3.10/threading.py", line 973 in _bootstrap
Thread 0x00007f54b231d700 (most recent call first):
File "/home/user/miniconda3/envs/swift/lib/python3.10/threading.py", line 320 in wait
File "/home/user/miniconda3/envs/swift/lib/python3.10/multiprocessing/queues.py", line 231 in _feed
File "/home/user/miniconda3/envs/swift/lib/python3.10/threading.py", line 953 in run
File "/home/user/miniconda3/envs/swift/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
File "/home/user/miniconda3/envs/swift/lib/python3.10/threading.py", line 973 in _bootstrap
Thread 0x00007f54b2b1e700 (most recent call first):
File "/home/user/miniconda3/envs/swift/lib/python3.10/threading.py", line 320 in wait
File "/home/user/miniconda3/envs/swift/lib/python3.10/multiprocessing/queues.py", line 231 in _feed
File "/home/user/miniconda3/envs/swift/lib/python3.10/threading.py", line 953 in run
File "/home/user/miniconda3/envs/swift/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
File "/home/user/miniconda3/envs/swift/lib/python3.10/threading.py", line 973 in _bootstrap
Thread 0x00007f54bcbfd700 (most recent call first):
File "/home/user/miniconda3/envs/swift/lib/python3.10/threading.py", line 320 in wait
File "/home/user/miniconda3/envs/swift/lib/python3.10/multiprocessing/queues.py", line 231 in _feed
File "/home/user/miniconda3/envs/swift/lib/python3.10/threading.py", line 953 in run
File "/home/user/miniconda3/envs/swift/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
File "/home/user/miniconda3/envs/swift/lib/python3.10/threading.py", line 973 in _bootstrap
Thread 0x00007f54bd5fe700 (most recent call first):
File "/home/user/miniconda3/envs/swift/lib/python3.10/selectors.py", line 416 in select
File "/home/user/miniconda3/envs/swift/lib/python3.10/multiprocessing/connection.py", line 931 in wait
File "/home/user/miniconda3/envs/swift/lib/python3.10/multiprocessing/connection.py", line 424 in _poll
File "/home/user/miniconda3/envs/swift/lib/python3.10/multiprocessing/connection.py", line 257 in poll
File "/home/user/miniconda3/envs/swift/lib/python3.10/multiprocessing/queues.py", line 113 in get
File "/home/user/miniconda3/envs/swift/lib/python3.10/site-packages/torch/utils/data/_utils/pin_memory.py", line 35 in do_one_step
File "/home/user/miniconda3/envs/swift/lib/python3.10/site-packages/torch/utils/data/_utils/pin_memory.py", line 59 in _pin_memory_loop
File "/home/user/miniconda3/envs/swift/lib/python3.10/threading.py", line 953 in run
File "/home/user/miniconda3/envs/swift/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
File "/home/user/miniconda3/envs/swift/lib/python3.10/threading.py", line 973 in _bootstrap
Thread 0x00007f54be9ff700 (most recent call first):
File "/home/user/miniconda3/envs/swift/lib/python3.10/threading.py", line 324 in wait
File "/home/user/miniconda3/envs/swift/lib/python3.10/threading.py", line 607 in wait
File "/home/user/miniconda3/envs/swift/lib/python3.10/site-packages/tqdm/_monitor.py", line 60 in run
File "/home/user/miniconda3/envs/swift/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
File "/home/user/miniconda3/envs/swift/lib/python3.10/threading.py", line 973 in _bootstrap
Thread 0x00007f55b6ffd700 (most recent call first):
File "/home/user/miniconda3/envs/swift/lib/python3.10/threading.py", line 324 in wait
File "/home/user/miniconda3/envs/swift/lib/python3.10/threading.py", line 607 in wait
File "/home/user/miniconda3/envs/swift/lib/python3.10/site-packages/tqdm/_monitor.py", line 60 in run
File "/home/user/miniconda3/envs/swift/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
File "/home/user/miniconda3/envs/swift/lib/python3.10/threading.py", line 973 in _bootstrap
Thread 0x00007f55dd3fe700 (most recent call first):
File "/home/user/miniconda3/envs/swift/lib/python3.10/site-packages/torch/_inductor/compile_worker/subproc_pool.py", line 53 in _recv_msg
File "/home/user/miniconda3/envs/swift/lib/python3.10/site-packages/torch/_inductor/compile_worker/subproc_pool.py", line 161 in _read_thread
File "/home/user/miniconda3/envs/swift/lib/python3.10/threading.py", line 953 in run
File "/home/user/miniconda3/envs/swift/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
File "/home/user/miniconda3/envs/swift/lib/python3.10/threading.py", line 973 in _bootstrap
Current thread 0x00007f56e58844c0 (most recent call first):
File "/home/user/miniconda3/envs/swift/lib/python3.10/site-packages/vllm/device_allocator/cumem.py", line 80 in unmap_and_release
File "/home/user/miniconda3/envs/swift/lib/python3.10/site-packages/vllm/device_allocator/cumem.py", line 205 in sleep
File "/home/user/miniconda3/envs/swift/lib/python3.10/site-packages/vllm/v1/worker/gpu_worker.py", line 76 in sleep
File "/home/user/miniconda3/envs/swift/lib/python3.10/site-packages/vllm/utils.py", line 2216 in run_method
File "/home/user/miniconda3/envs/swift/lib/python3.10/site-packages/vllm/executor/uniproc_executor.py", line 56 in collective_rpc
File "/home/user/miniconda3/envs/swift/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 205 in sleep
File "/home/user/miniconda3/envs/swift/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 251 in sleep
File "/home/user/miniconda3/envs/swift/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 186 in sleep
File "/home/user/miniconda3/envs/swift/lib/python3.10/site-packages/vllm/v1/engine/llm_engine.py", line 233 in sleep
File "/data/user/zhy/ms-swift/swift/trainers/rlhf_trainer/grpo_trainer.py", line 837 in _fast_infer
File "/data/user/zhy/ms-swift/swift/trainers/rlhf_trainer/grpo_trainer.py", line 854 in _generate_completions
File "/data/user/zhy/ms-swift/swift/trainers/rlhf_trainer/grpo_trainer.py", line 873 in _generate_and_score_completions
File "/data/user/zhy/ms-swift/swift/trainers/rlhf_trainer/grpo_trainer.py", line 348 in _prepare_inputs
File "/home/user/miniconda3/envs/swift/lib/python3.10/site-packages/trl/extras/profiling.py", line 87 in wrapper
File "/home/user/miniconda3/envs/swift/lib/python3.10/site-packages/transformers/trainer.py", line 3669 in training_step
File "/data/user/zhy/ms-swift/swift/trainers/rlhf_trainer/grpo_trainer.py", line 1380 in training_step
File "/home/user/miniconda3/envs/swift/lib/python3.10/site-packages/transformers/trainer.py", line 2531 in _inner_training_loop
File "/home/user/miniconda3/envs/swift/lib/python3.10/site-packages/transformers/trainer.py", line 2171 in train
File "/data/user/zhy/ms-swift/swift/trainers/mixin.py", line 411 in train
File "/data/user/zhy/ms-swift/swift/llm/train/sft.py", line 183 in train
File "/data/user/zhy/ms-swift/swift/llm/train/sft.py", line 122 in run
File "/data/user/zhy/ms-swift/swift/llm/base.py", line 49 in main
File "/data/user/zhy/ms-swift/swift/llm/train/rlhf.py", line 172 in rlhf_main
File "/data/user/zhy/ms-swift/swift/cli/rlhf.py", line 5 in
Extension modules: numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._dynamo.autograd_compiler, torch._C._dynamo.eval_frame, torch._C._dynamo.guards, torch._C._dynamo.utils, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, zstandard.backend_c, charset_normalizer.md, simplejson._speedups, requests.packages.charset_normalizer.md, requests.packages.chardet.md, yaml._yaml, markupsafe._speedups, PIL._imaging, pyarrow.lib, pandas._libs.tslibs.ccalendar, pandas._libs.tslibs.np_datetime, pandas._libs.tslibs.dtypes, pandas._libs.tslibs.base, pandas._libs.tslibs.nattype, pandas._libs.tslibs.timezones, pandas._libs.tslibs.fields, pandas._libs.tslibs.timedeltas, pandas._libs.tslibs.tzconversion, pandas._libs.tslibs.timestamps, pandas._libs.properties, pandas._libs.tslibs.offsets, pandas._libs.tslibs.strptime, pandas._libs.tslibs.parsing, pandas._libs.tslibs.conversion, pandas._libs.tslibs.period, pandas._libs.tslibs.vectorized, pandas._libs.ops_dispatch, pandas._libs.missing, pandas._libs.hashtable, pandas._libs.algos, pandas._libs.interval, pandas._libs.lib, pyarrow._compute, pandas._libs.ops, pandas._libs.hashing, pandas._libs.arrays, pandas._libs.tslib, pandas._libs.sparse, pandas._libs.internals, pandas._libs.indexing, pandas._libs.index, pandas._libs.writers, pandas._libs.join, pandas._libs.window.aggregations, pandas._libs.window.indexers, pandas._libs.reshape, pandas._libs.groupby, pandas._libs.json, pandas._libs.parsers, pandas._libs.testing, _cffi_backend, pyarrow._parquet, pyarrow._fs, pyarrow._azurefs, pyarrow._hdfs, pyarrow._gcsfs, pyarrow._s3fs, multidict._multidict, yarl._quoting_c, propcache._helpers_c, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket.mask, aiohttp._websocket.reader_c, frozenlist._frozenlist, xxhash._xxhash, pyarrow._json, pyarrow._acero, pyarrow._csv, pyarrow._substrait, pyarrow._dataset, pyarrow._dataset_orc, pyarrow._parquet_encryption, pyarrow._dataset_parquet_encryption, pyarrow._dataset_parquet, sklearn.__check_build._check_build, scipy._lib._ccallback_c, scipy.sparse._sparsetools, _csparsetools, scipy.sparse._csparsetools, scipy.linalg._fblas, scipy.linalg._flapack, scipy.linalg.cython_lapack, scipy.linalg._cythonized_array_utils, scipy.linalg._solve_toeplitz, scipy.linalg._decomp_lu_cython, scipy.linalg._matfuncs_sqrtm_triu, scipy.linalg._matfuncs_expm, scipy.linalg._linalg_pythran, scipy.linalg.cython_blas, scipy.linalg._decomp_update, scipy.sparse.linalg._dsolve._superlu, scipy.sparse.linalg._eigen.arpack._arpack, scipy.sparse.linalg._propack._spropack, scipy.sparse.linalg._propack._dpropack, scipy.sparse.linalg._propack._cpropack, scipy.sparse.linalg._propack._zpropack, scipy.sparse.csgraph._tools, scipy.sparse.csgraph._shortest_path, scipy.sparse.csgraph._traversal, scipy.sparse.csgraph._min_spanning_tree, scipy.sparse.csgraph._flow, scipy.sparse.csgraph._matching, scipy.sparse.csgraph._reordering, psutil._psutil_linux, psutil._psutil_posix, scipy.special._ufuncs_cxx, scipy.special._ufuncs, scipy.special._specfun, scipy.special._comb, scipy.special._ellip_harm_2, scipy.spatial._ckdtree, scipy._lib.messagestream, scipy.spatial._qhull, scipy.spatial._voronoi, scipy.spatial._distance_wrap, scipy.spatial._hausdorff, scipy.spatial.transform._rotation, scipy.optimize._group_columns, scipy.optimize._trlib._trlib, scipy.optimize._lbfgsb, _moduleTNC, scipy.optimize._moduleTNC, scipy.optimize._cobyla, scipy.optimize._slsqp, scipy.optimize._minpack, scipy.optimize._lsq.givens_elimination, scipy.optimize._zeros, scipy.optimize._cython_nnls, scipy._lib._uarray._uarray, scipy.linalg._decomp_interpolative, scipy.optimize._bglu_dense, scipy.optimize._lsap, scipy.optimize._direct, scipy.integrate._odepack, scipy.integrate._quadpack, scipy.integrate._vode, scipy.integrate._dop, scipy.integrate._lsoda, scipy.interpolate._fitpack, scipy.interpolate._dfitpack, scipy.interpolate._dierckx, scipy.interpolate._ppoly, scipy.interpolate._interpnd, scipy.interpolate._rbfinterp_pythran, scipy.interpolate._rgi_cython, scipy.interpolate._bspl, scipy.special.cython_special, scipy.stats._stats, scipy.stats._sobol, scipy.stats._qmc_cy, scipy.stats._biasedurn, scipy.stats._stats_pythran, scipy.stats._levy_stable.levyst, scipy.stats._ansari_swilk_statistics, scipy.stats._mvn, scipy.stats._rcont.rcont, scipy.ndimage._nd_image, scipy.ndimage._rank_filter_1d, _ni_label, scipy.ndimage._ni_label, sklearn.utils._isfinite, sklearn.utils.sparsefuncs_fast, sklearn.utils.murmurhash, sklearn.utils._openmp_helpers, sklearn.metrics.cluster._expected_mutual_info_fast, sklearn.preprocessing._csr_polynomial_expansion, sklearn.preprocessing._target_encoder_fast, sklearn.metrics._dist_metrics, sklearn.metrics._pairwise_distances_reduction._datasets_pair, sklearn.utils._cython_blas, sklearn.metrics._pairwise_distances_reduction._base, sklearn.metrics._pairwise_distances_reduction._middle_term_computer, sklearn.utils._heap, sklearn.utils._sorting, sklearn.metrics._pairwise_distances_reduction._argkmin, sklearn.metrics._pairwise_distances_reduction._argkmin_classmode, sklearn.utils._vector_sentinel, sklearn.metrics._pairwise_distances_reduction._radius_neighbors, sklearn.metrics._pairwise_distances_reduction._radius_neighbors_classmode, sklearn.metrics._pairwise_fast, PIL._imagingft, lxml._elementpath, lxml.etree, regex._regex, sklearn.feature_extraction._hashing_fast, sklearn.utils._random, sklearn.utils._seq_dataset, sklearn.linear_model._cd_fast, _loss, sklearn._loss._loss, sklearn.utils.arrayfuncs, sklearn.svm._liblinear, sklearn.svm._libsvm, sklearn.svm._libsvm_sparse, sklearn.linear_model._sag_fast, sklearn.utils._weight_vector, sklearn.linear_model._sgd_fast, scipy.io.matlab._mio_utils, scipy.io.matlab._streams, scipy.io.matlab._mio5_utils, sklearn.datasets._svmlight_format_fast, rapidfuzz._feature_detector_cpp, rapidfuzz.distance._initialize_cpp, rapidfuzz.distance.metrics_cpp_avx2, rapidfuzz.fuzz_cpp_avx2, rapidfuzz.process_cpp_impl, rapidfuzz.utils_cpp, Levenshtein.levenshtein_cpp, cuda_utils, msgpack._cmsgpack, google._upb._message, msgspec._core, zmq.backend.cython._zmq, setproctitle, uvloop.loop, ray._raylet, sentencepiece._sentencepiece, vllm.cumem_allocator, numba.core.typeconv._typeconv, numba._helperlib, numba._dynfunc, numba._dispatcher, numba.core.runtime._nrt_python, numba.np.ufunc._internal, numba.experimental.jitclass._box, __triton_launcher, scipy.signal._sigtools, scipy.signal._max_len_seq_inner, scipy.signal._upfirdn_apply, scipy.signal._spline, scipy.signal._sosfilt, scipy.signal._peak_finding_utils (total: 257)
INFO 07-06 04:11:51 [gpu_worker.py:81] Sleep mode freed 20.97 GiB memory, 39.16 GiB memory is still in use.
INFO 07-06 04:11:51 [executor_base.py:208] It took 0.820670 seconds to fall asleep.
INFO 07-06 04:11:51 [gpu_worker.py:81] Sleep mode freed 20.97 GiB memory, 39.73 GiB memory is still in use.
INFO 07-06 04:11:51 [executor_base.py:208] It took 0.884788 seconds to fall asleep.
INFO 07-06 04:11:52 [gpu_worker.py:81] Sleep mode freed 20.97 GiB memory, 38.77 GiB memory is still in use.
INFO 07-06 04:11:52 [gpu_worker.py:81] Sleep mode freed 20.97 GiB memory, 39.09 GiB memory is still in use.
INFO 07-06 04:11:52 [executor_base.py:208] It took 1.376163 seconds to fall asleep.
INFO 07-06 04:11:52 [executor_base.py:208] It took 0.898365 seconds to fall asleep.
INFO 07-06 04:11:52 [gpu_worker.py:81] Sleep mode freed 20.97 GiB memory, 38.79 GiB memory is still in use.
INFO 07-06 04:11:52 [executor_base.py:208] It took 1.367942 seconds to fall asleep.
INFO 07-06 04:11:52 [gpu_worker.py:81] Sleep mode freed 20.97 GiB memory, 38.88 GiB memory is still in use.
INFO 07-06 04:11:52 [executor_base.py:208] It took 1.440262 seconds to fall asleep.
INFO 07-06 04:11:52 [gpu_worker.py:81] Sleep mode freed 20.97 GiB memory, 39.25 GiB memory is still in use.
INFO 07-06 04:11:52 [executor_base.py:208] It took 1.400168 seconds to fall asleep.
W0706 04:11:57.021000 3245876 /data/user/miniconda3/envs/swift/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3245966 closing signal SIGTERM
W0706 04:11:57.033000 3245876 /data/user/miniconda3/envs/swift/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3245967 closing signal SIGTERM
W0706 04:11:57.037000 3245876 /data/user/miniconda3/envs/swift/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3245968 closing signal SIGTERM
W0706 04:11:57.042000 3245876 /data/user/miniconda3/envs/swift/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3245970 closing signal SIGTERM
W0706 04:11:57.056000 3245876 /data/user/miniconda3/envs/swift/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3245971 closing signal SIGTERM
W0706 04:11:57.081000 3245876 /data/user/miniconda3/envs/swift/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3245972 closing signal SIGTERM
W0706 04:11:57.087000 3245876 /data/user/miniconda3/envs/swift/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3245973 closing signal SIGTERM
/home/user/miniconda3/envs/swift/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 21 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
/home/user/miniconda3/envs/swift/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 21 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
/home/user/miniconda3/envs/swift/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 21 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
/home/user/miniconda3/envs/swift/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 21 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
/home/user/miniconda3/envs/swift/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 21 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
/home/user/miniconda3/envs/swift/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 21 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
/home/user/miniconda3/envs/swift/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 21 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
/home/user/miniconda3/envs/swift/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 21 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
E0706 04:12:22.443000 3245876 /data/user/miniconda3/envs/swift/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: -6) local_rank: 3 (pid: 3245969) of binary: /home/user/miniconda3/envs/swift/bin/python3.10
Traceback (most recent call last):
File "/home/user/miniconda3/envs/swift/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/user/miniconda3/envs/swift/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/home/user/miniconda3/envs/swift/lib/python3.10/site-packages/torch/distributed/run.py", line 922, in
main()
File "/home/user/miniconda3/envs/swift/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 355, in wrapper
return f(*args, **kwargs)
File "/home/user/miniconda3/envs/swift/lib/python3.10/site-packages/torch/distributed/run.py", line 918, in main
run(args)
File "/home/user/miniconda3/envs/swift/lib/python3.10/site-packages/torch/distributed/run.py", line 909, in run
elastic_launch(
File "/home/user/miniconda3/envs/swift/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/user/miniconda3/envs/swift/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: `