-
Notifications
You must be signed in to change notification settings - Fork 26
Open
Description
When we change bsz in bmk_comm_latency_multiserver.py, we get error logs like below.
RuntimeError: [15:17:41] /infrawaves/StepMesh/fserver/csrc/./public.hpp:103: Check failed: (tensors.size()) == (reqmeta.pull_tensors.size())
After deep dive into the code, we found that the hard code below only use the first tensor in input_tensors.
elif is_server:
ret_buffer = torch.rand([65535, dim], dtype=torch.bfloat16, device='cuda')
count = 0
f.barrier(True, False)
def server():
global count
iter_count = 0
while True:
batches = f.get_batch()
if len(batches) != 0:
iter_count += 1
# hard code
- recv_tensor_list = [batches[i][1][0] for i in range(worker_count)]
+ recv_tensor_list = [batches[i][1] for i in range(worker_count)]
comm_id_list = [batches[i][0] for i in range(worker_count)]
f.respond_vec(ret_buffer, recv_tensor_list, comm_id_list)
if iter_count == num_iters:
break
server()
f.stop()
Besides change the hard code above, we also change the related code in fserver/csrc/public.hpp as follows:
void respond_vec(torch::Tensor& ret_buffer,
std::vector<std::vector<torch::Tensor>>& tensors_vec,
std::vector<uint64_t>& handler_vec) {
PS_CHECK_EQ(tensors_vec.size(), handler_vec.size());
for (size_t i = 0; i < handler_vec.size(); i++) {
std::vector<torch::Tensor> sliced_buffer_list;
int64_t tensor_shape_0 = tensors_vec[i][0].size(0);
for (int j = 0; j < tensors_vec[i].size() - 1; j++) {
sliced_buffer_list.push_back(
ret_buffer.slice(0, j * tensor_shape_0, tensor_shape_0));
}
//std::vector<torch::Tensor> sliced_buffer_list = {
// ret_buffer.slice(0, 0, tensor_shape_0)
//};
respond(sliced_buffer_list, handler_vec[i], i == 0);
}
}
Unfortunatly, we see further error as follows.
[01:55:14] server /gpfs/Stepmesh/include/dmlc/logging.h:301: [01:55:14] /infrawaves/StepMesh/src/./rdma_van.h:875: Check failed: (temp_mr) != (mem_mr_.end())
Stack trace returned 6 entries:
[bt] (0) /gpfs/Stepmesh/fserver_lib.cpython-310-x86_64-linux-gnu.so(+0x5422f) [0x7f897146822f]
[bt] (1) /gpfs/Stepmesh/fserver_lib.cpython-310-x86_64-linux-gnu.so(+0x54563) [0x7f8971468563]
[bt] (2) /gpfs/Stepmesh/fserver_lib.cpython-310-x86_64-linux-gnu.so(ps::RDMAVan::PollCQ()+0x1a81) [0x7f89714d8951]
[bt] (3) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f8a8b6b0253]
[bt] (4) /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f8aeb334ac3]
[bt] (5) /usr/lib/x86_64-linux-gnu/libc.so.6(+0x126850) [0x7f8aeb3c6850]
So I wonder if you have encounter the question below and if any solutions to this. I will highly appreciate it if any feedbacks from you.
Metadata
Metadata
Assignees
Labels
No labels