Skip to content

[Bug] Get crashed when change batchsize in bmk_comm_latency_multiserver.py #28

@Zhangmj0621

Description

@Zhangmj0621

When we change bsz in bmk_comm_latency_multiserver.py, we get error logs like below.
RuntimeError: [15:17:41] /infrawaves/StepMesh/fserver/csrc/./public.hpp:103: Check failed: (tensors.size()) == (reqmeta.pull_tensors.size())
After deep dive into the code, we found that the hard code below only use the first tensor in input_tensors.

elif is_server:
    ret_buffer = torch.rand([65535, dim], dtype=torch.bfloat16, device='cuda')
    count = 0
    f.barrier(True, False)
    def server():
        global count
        iter_count = 0
        while True:
            batches = f.get_batch()
            if len(batches) != 0:
                iter_count += 1
                # hard code
                - recv_tensor_list = [batches[i][1][0] for i in range(worker_count)]
                + recv_tensor_list = [batches[i][1] for i in range(worker_count)]
                comm_id_list = [batches[i][0] for i in range(worker_count)]

                f.respond_vec(ret_buffer, recv_tensor_list, comm_id_list)
                if iter_count == num_iters:
                    break
    server()

f.stop()

Besides change the hard code above, we also change the related code in fserver/csrc/public.hpp as follows:

void respond_vec(torch::Tensor& ret_buffer,
                 std::vector<std::vector<torch::Tensor>>& tensors_vec,
                 std::vector<uint64_t>& handler_vec) {
  PS_CHECK_EQ(tensors_vec.size(), handler_vec.size());
  for (size_t i = 0; i < handler_vec.size(); i++) {
    std::vector<torch::Tensor> sliced_buffer_list;
    int64_t tensor_shape_0 = tensors_vec[i][0].size(0);
    for (int j = 0; j < tensors_vec[i].size() - 1; j++) {
      sliced_buffer_list.push_back(
          ret_buffer.slice(0, j * tensor_shape_0, tensor_shape_0));
    }
    //std::vector<torch::Tensor> sliced_buffer_list = {
    //    ret_buffer.slice(0, 0, tensor_shape_0)
    //};
    respond(sliced_buffer_list, handler_vec[i], i == 0);
  }
}

Unfortunatly, we see further error as follows.

[01:55:14] server /gpfs/Stepmesh/include/dmlc/logging.h:301: [01:55:14] /infrawaves/StepMesh/src/./rdma_van.h:875: Check failed: (temp_mr) != (mem_mr_.end()) 
Stack trace returned 6 entries:
[bt] (0) /gpfs/Stepmesh/fserver_lib.cpython-310-x86_64-linux-gnu.so(+0x5422f) [0x7f897146822f]
[bt] (1) /gpfs/Stepmesh/fserver_lib.cpython-310-x86_64-linux-gnu.so(+0x54563) [0x7f8971468563]
[bt] (2) /gpfs/Stepmesh/fserver_lib.cpython-310-x86_64-linux-gnu.so(ps::RDMAVan::PollCQ()+0x1a81) [0x7f89714d8951]
[bt] (3) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f8a8b6b0253]
[bt] (4) /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f8aeb334ac3]
[bt] (5) /usr/lib/x86_64-linux-gnu/libc.so.6(+0x126850) [0x7f8aeb3c6850]

So I wonder if you have encounter the question below and if any solutions to this. I will highly appreciate it if any feedbacks from you.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions