Open
Description
The Docstring of the method SamPredictor.set_image
describes the Input dimensions as the following:
Arguments:
image (np.ndarray): The image for calculating masks. Expects an
image in HWC uint8 format, with pixel values in [0, 255].
While the Output of the method SamPredictor.predict_torch
is described like the following:
Returns:
(torch.Tensor): The output masks in BxCxHxW format, where C is the
number of masks, and (H, W) is the original image size.
While running the code of the example notebook predictor_example.ipynb
I have noticed that this is not the case.
Here a MWE to recreate the issue:
import sys
import cv2
import numpy as np
import torch
sys.path.append("..")
from segment_anything import SamPredictor, sam_model_registry
image = cv2.imread("images/truck.jpg")
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
sam_checkpoint = "sam_vit_h_4b8939.pth"
model_type = "vit_h"
device = "cuda"
sam = sam_model_registry[model_type](checkpoint=sam_checkpoint)
sam.to(device=device)
predictor = SamPredictor(sam)
predictor.set_image(image)
# Example with predict
input_point = np.array([[500, 375]])
input_label = np.array([1])
masks_predict, scores, logits = predictor.predict(
point_coords=input_point,
point_labels=input_label,
multimask_output=True,
)
# Example with predict_torch
input_boxes = torch.tensor(
[
[75, 275, 1725, 850],
[425, 600, 700, 875],
[1375, 550, 1650, 800],
[1240, 675, 1400, 750],
],
device=predictor.device,
)
transformed_boxes = predictor.transform.apply_boxes_torch(input_boxes, image.shape[:2])
masks_predict_torch, _, _ = predictor.predict_torch(
point_coords=None,
point_labels=None,
boxes=transformed_boxes,
multimask_output=False,
)
print("Image Shape: ", image.shape) # expected shape: (1200, 1800, 3); returned shape: (1200, 1800, 3)
print("Masks Predict Shape: ", masks_predict.shape) # expected shape: (3, 1200, 1800); returned shape: (3, 1200, 1800)
print("Masks Predict Torch Shape: ", masks_predict_torch.shape) # expected shape (1,4, 1200, 1800); returned shape (4, 1, 1200, 1800)
From the Output it seems that the number of masks moved to the Batch position.
Further thinking about it leads to my conclusion, that this seems to be correct, since with Batch is probably the Number of BBoxes meant.
But if this is the case the Docstring is not very clear about it. At least the B
-dimension should be clearly addressed.
Metadata
Metadata
Assignees
Labels
No labels