Skip to content

Conversation

CAROLZXYZXY
Copy link
Contributor

@CAROLZXYZXY CAROLZXYZXY commented May 16, 2025

Make TPU CI pipeline so that

  1. Tests run sequentially because TPU can only one process at the same time.
  2. If one of the tests fails, the command would exit with non-zero code.

Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@mergify mergify bot added the ci/build label May 16, 2025
@yaochengji yaochengji added the ready ONLY add when PR is ready to merge/full CI is needed label May 17, 2025
@CAROLZXYZXY CAROLZXYZXY force-pushed the cazheng/fix-tpu-ci branch from 9c2a526 to 8976aa4 Compare May 17, 2025 01:00
@yaochengji
Copy link
Collaborator

I saw a lot of

RuntimeError: Bad StatusOr access: UNKNOWN: TPU initialization failed: open(/dev/vfio/0): Device or resource busy: Device or resource busy; Couldn't open iommu group /dev/vfio/0

These tests cannot run in parallel because two processes cannot use the TPU at the same time.

@CAROLZXYZXY CAROLZXYZXY force-pushed the cazheng/fix-tpu-ci branch 2 times, most recently from 819dc10 to 1f42686 Compare May 17, 2025 21:57
@CAROLZXYZXY
Copy link
Contributor Author

Done. Changed to sequentially run tests. With the current set up, I see code-level errors.

PASSEDWARNING 05-17 22:37:41 [parallel_state.py:1229] torch._C._host_emptyCache() only available in Pytorch >=2.5
--
  |  
  |  
  | =================================== FAILURES ===================================
  | ________________________ test_update_states_new_request ________________________
  |  
  | model_runner = <vllm.v1.worker.tpu_model_runner.TPUModelRunner object at 0x7bd274e80070>
  |  
  | def test_update_states_new_request(model_runner):
  | req_id = "req_0"
  |  
  | # new req
  | scheduler_output = _schedule_new_request(req_id)
  |  
  | >       model_runner._update_states(scheduler_output)
  |  
  | tests/v1/tpu/worker/test_tpu_model_runner.py:131:
  | _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
  |  
  | self = <vllm.v1.worker.tpu_model_runner.TPUModelRunner object at 0x7bd274e80070>
  | scheduler_output = SchedulerOutput(scheduled_new_reqs=[NewRequestData(req_id=req_0,prompt_token_ids=[1, 2, 3],mm_inputs=[],mm_hashes=[],m...s=set(), free_encoder_input_ids=[], structured_output_request_ids={}, grammar_bitmask=None, kv_connector_metadata=None)
  |  
  | def _update_states(self, scheduler_output: "SchedulerOutput") -> bool:
  | """Update the cached states and the persistent batch with the scheduler
  | output.
  |  
  | The updated states are used by the `_prepare_inputs` function to create
  | the input GPU tensors for the model.
  |  
  | Returns:
  | True if there is a new/resumed/paused/finished request.
  | If False, we can skip copying SamplingMetadata to the GPU.
  | """
  | # Remove finished requests from the cached states.
  | for req_id in scheduler_output.finished_req_ids:
  | self.requests.pop(req_id, None)
  | self.encoder_cache.pop(req_id, None)
  |  
  | # Remove the finished requests from the persistent batch.
  | # NOTE(woosuk): There could be an edge case where finished_req_ids and
  | # scheduled_req_ids overlap. This happens when a request is aborted and
  | # then resubmitted with the same ID. In this case, we treat them as two
  | # distinct requests - clearing the cached states for the first request
  | # and handling the second as a new request.
  | removed_req_indices: list[int] = []
  | for req_id in scheduler_output.finished_req_ids:
  | req_index = self.input_batch.remove_request(req_id)
  | if req_index is not None:
  | removed_req_indices.append(req_index)
  |  
  | # Free the cached encoder outputs.
  | for req_id, input_id in scheduler_output.free_encoder_input_ids:
  | encoder_outputs = self.encoder_cache.get(req_id)
  | if encoder_outputs is not None:
  | encoder_outputs.pop(input_id, None)
  | if not encoder_outputs:
  | self.encoder_cache.pop(req_id, None)
  |  
  | # Remove the unscheduled requests from the persistent batch.
  | # NOTE(woosuk): The unscheduled requests are either preempted requests
  | # or running requests that are not scheduled in this step. We remove
  | # them from the persistent batch but keep their cached states since
  | # they will be scheduled again sometime in the future.
  | scheduled_req_ids = scheduler_output.num_scheduled_tokens.keys()
  | >       cached_req_ids = self.input_batch.req_id_to_index.keys()
  | E       AttributeError: 'TPUModelRunner' object has no attribute 'input_batch'
  |  
  | vllm/v1/worker/tpu_model_runner.py:327: AttributeError
  | _____________________ test_update_states_request_finished ______________________
  |  
  | model_runner = <vllm.v1.worker.tpu_model_runner.TPUModelRunner object at 0x7bd274d48520>
  |  
  | def test_update_states_request_finished(model_runner):
  | req_id = "req_0"
  |  
  | # new req
  | scheduler_output = _schedule_new_request(req_id)
  |  
  | >       model_runner._update_states(scheduler_output)
  |  
  | tests/v1/tpu/worker/test_tpu_model_runner.py:144:
  | _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
  |  
  | self = <vllm.v1.worker.tpu_model_runner.TPUModelRunner object at 0x7bd274d48520>
  | scheduler_output = SchedulerOutput(scheduled_new_reqs=[NewRequestData(req_id=req_0,prompt_token_ids=[1, 2, 3],mm_inputs=[],mm_hashes=[],m...s=set(), free_encoder_input_ids=[], structured_output_request_ids={}, grammar_bitmask=None, kv_connector_metadata=None)
  |  
  | def _update_states(self, scheduler_output: "SchedulerOutput") -> bool:
  | """Update the cached states and the persistent batch with the scheduler
  | output.
  |  
  | The updated states are used by the `_prepare_inputs` function to create
  | the input GPU tensors for the model.
  |  
  | Returns:
  | True if there is a new/resumed/paused/finished request.
  | If False, we can skip copying SamplingMetadata to the GPU.
  | """
  | # Remove finished requests from the cached states.
  | for req_id in scheduler_output.finished_req_ids:
  | self.requests.pop(req_id, None)
  | self.encoder_cache.pop(req_id, None)
  |  
  | # Remove the finished requests from the persistent batch.
  | # NOTE(woosuk): There could be an edge case where finished_req_ids and
  | # scheduled_req_ids overlap. This happens when a request is aborted and
  | # then resubmitted with the same ID. In this case, we treat them as two
  | # distinct requests - clearing the cached states for the first request
  | # and handling the second as a new request.
  | removed_req_indices: list[int] = []
  | for req_id in scheduler_output.finished_req_ids:
  | req_index = self.input_batch.remove_request(req_id)
  | if req_index is not None:
  | removed_req_indices.append(req_index)
  |  
  | # Free the cached encoder outputs.
  | for req_id, input_id in scheduler_output.free_encoder_input_ids:
  | encoder_outputs = self.encoder_cache.get(req_id)
  | if encoder_outputs is not None:
  | encoder_outputs.pop(input_id, None)
  | if not encoder_outputs:
  | self.encoder_cache.pop(req_id, None)
  |  
  | # Remove the unscheduled requests from the persistent batch.
  | # NOTE(woosuk): The unscheduled requests are either preempted requests
  | # or running requests that are not scheduled in this step. We remove
  | # them from the persistent batch but keep their cached states since
  | # they will be scheduled again sometime in the future.
  | scheduled_req_ids = scheduler_output.num_scheduled_tokens.keys()
  | >       cached_req_ids = self.input_batch.req_id_to_index.keys()
  | E       AttributeError: 'TPUModelRunner' object has no attribute 'input_batch'
  |  
  | vllm/v1/worker/tpu_model_runner.py:327: AttributeError
  | ______________________ test_update_states_request_resumed ______________________
  |  
  | model_runner = <vllm.v1.worker.tpu_model_runner.TPUModelRunner object at 0x7bd274e92500>
  |  
  | def test_update_states_request_resumed(model_runner):
  | req_id = "req_0"
  |  
  | # new req
  | scheduler_output = _schedule_new_request(req_id)
  |  
  | >       model_runner._update_states(scheduler_output)
  |  
  | tests/v1/tpu/worker/test_tpu_model_runner.py:174:
  | _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
  |  
  | self = <vllm.v1.worker.tpu_model_runner.TPUModelRunner object at 0x7bd274e92500>
  | scheduler_output = SchedulerOutput(scheduled_new_reqs=[NewRequestData(req_id=req_0,prompt_token_ids=[1, 2, 3],mm_inputs=[],mm_hashes=[],m...s=set(), free_encoder_input_ids=[], structured_output_request_ids={}, grammar_bitmask=None, kv_connector_metadata=None)
  |  
  | def _update_states(self, scheduler_output: "SchedulerOutput") -> bool:
  | """Update the cached states and the persistent batch with the scheduler
  | output.
  |  
  | The updated states are used by the `_prepare_inputs` function to create
  | the input GPU tensors for the model.
  |  
  | Returns:
  | True if there is a new/resumed/paused/finished request.
  | If False, we can skip copying SamplingMetadata to the GPU.
  | """
  | # Remove finished requests from the cached states.
  | for req_id in scheduler_output.finished_req_ids:
  | self.requests.pop(req_id, None)
  | self.encoder_cache.pop(req_id, None)
  |  
  | # Remove the finished requests from the persistent batch.
  | # NOTE(woosuk): There could be an edge case where finished_req_ids and
  | # scheduled_req_ids overlap. This happens when a request is aborted and
  | # then resubmitted with the same ID. In this case, we treat them as two
  | # distinct requests - clearing the cached states for the first request
  | # and handling the second as a new request.
  | removed_req_indices: list[int] = []
  | for req_id in scheduler_output.finished_req_ids:
  | req_index = self.input_batch.remove_request(req_id)
  | if req_index is not None:
  | removed_req_indices.append(req_index)
  |  
  | # Free the cached encoder outputs.
  | for req_id, input_id in scheduler_output.free_encoder_input_ids:
  | encoder_outputs = self.encoder_cache.get(req_id)
  | if encoder_outputs is not None:
  | encoder_outputs.pop(input_id, None)
  | if not encoder_outputs:
  | self.encoder_cache.pop(req_id, None)
  |  
  | # Remove the unscheduled requests from the persistent batch.
  | # NOTE(woosuk): The unscheduled requests are either preempted requests
  | # or running requests that are not scheduled in this step. We remove
  | # them from the persistent batch but keep their cached states since
  | # they will be scheduled again sometime in the future.
  | scheduled_req_ids = scheduler_output.num_scheduled_tokens.keys()
  | >       cached_req_ids = self.input_batch.req_id_to_index.keys()
  | E       AttributeError: 'TPUModelRunner' object has no attribute 'input_batch'
  |  
  | vllm/v1/worker/tpu_model_runner.py:327: AttributeError

The code level failure could be addressed in follow-up PRs.

I saw a lot of

RuntimeError: Bad StatusOr access: UNKNOWN: TPU initialization failed: open(/dev/vfio/0): Device or resource busy: Device or resource busy; Couldn't open iommu group /dev/vfio/0

These tests cannot run in parallel because two processes cannot use the TPU at the same time.

@CAROLZXYZXY CAROLZXYZXY force-pushed the cazheng/fix-tpu-ci branch from 1f42686 to 6f8edc8 Compare May 18, 2025 18:19
@yaochengji
Copy link
Collaborator

For the error

AttributeError: 'TPUModelRunner' object has no attribute 'input_batch'

tpu model runner v0 doesn't have input_batch, did you happen to use v0 not v1?

@CAROLZXYZXY CAROLZXYZXY force-pushed the cazheng/fix-tpu-ci branch 5 times, most recently from afa4372 to 4dfd0a7 Compare May 23, 2025 00:45
@CAROLZXYZXY CAROLZXYZXY force-pushed the cazheng/fix-tpu-ci branch 3 times, most recently from a5ee5ff to d66369f Compare May 27, 2025 17:51
Signed-off-by: Carol Zheng <[email protected]>
Signed-off-by: Carol Zheng <[email protected]>
Signed-off-by: Carol Zheng <[email protected]>
Signed-off-by: Carol Zheng <[email protected]>
Signed-off-by: Carol Zheng <[email protected]>
Signed-off-by: Carol Zheng <[email protected]>
Signed-off-by: Carol Zheng <[email protected]>
Signed-off-by: Carol Zheng <[email protected]>
Signed-off-by: Carol Zheng <[email protected]>
Signed-off-by: Carol Zheng <[email protected]>
@CAROLZXYZXY CAROLZXYZXY force-pushed the cazheng/fix-tpu-ci branch from ba6479d to bc9fc3c Compare May 27, 2025 18:13
@yaochengji
Copy link
Collaborator

There're 12 tests in total.

For the 11th test, it printed # Received cancellation signal, interrupting. And the 12th test didn't run. Is it intended?

Copy link
Collaborator

@yaochengji yaochengji left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

The 12th test didn't because of 3 hours timeout. (BUILDKITE_TIMEOUT="180")
We can seperate these tests in a following PR.

@yaochengji yaochengji merged commit b48d5cc into vllm-project:main May 27, 2025
43 checks passed
amitm02 pushed a commit to amitm02/vllm that referenced this pull request Jun 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants