Skip to content

Docstrings about dimesions are unclear. #810

Open
@MaKaNu

Description

@MaKaNu

The Docstring of the method SamPredictor.set_image describes the Input dimensions as the following:

Arguments:
          image (np.ndarray): The image for calculating masks. Expects an
            image in HWC uint8 format, with pixel values in [0, 255].

While the Output of the method SamPredictor.predict_torch is described like the following:

Returns:
          (torch.Tensor): The output masks in BxCxHxW format, where C is the
            number of masks, and (H, W) is the original image size.

While running the code of the example notebook predictor_example.ipynb I have noticed that this is not the case.

Here a MWE to recreate the issue:

import sys

import cv2
import numpy as np
import torch

sys.path.append("..")
from segment_anything import SamPredictor, sam_model_registry

image = cv2.imread("images/truck.jpg")
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)

sam_checkpoint = "sam_vit_h_4b8939.pth"
model_type = "vit_h"

device = "cuda"

sam = sam_model_registry[model_type](checkpoint=sam_checkpoint)
sam.to(device=device)

predictor = SamPredictor(sam)

predictor.set_image(image)

# Example with predict
input_point = np.array([[500, 375]])
input_label = np.array([1])

masks_predict, scores, logits = predictor.predict(
    point_coords=input_point,
    point_labels=input_label,
    multimask_output=True,
)

# Example with predict_torch
input_boxes = torch.tensor(
    [
        [75, 275, 1725, 850],
        [425, 600, 700, 875],
        [1375, 550, 1650, 800],
        [1240, 675, 1400, 750],
    ],
    device=predictor.device,
)

transformed_boxes = predictor.transform.apply_boxes_torch(input_boxes, image.shape[:2])
masks_predict_torch, _, _ = predictor.predict_torch(
    point_coords=None,
    point_labels=None,
    boxes=transformed_boxes,
    multimask_output=False,
)

print("Image Shape: ", image.shape)  # expected shape: (1200, 1800, 3); returned shape: (1200, 1800, 3)
print("Masks Predict Shape: ", masks_predict.shape) # expected shape: (3, 1200, 1800); returned shape: (3, 1200, 1800)
print("Masks Predict Torch Shape: ", masks_predict_torch.shape) # expected shape (1,4, 1200, 1800); returned shape (4, 1, 1200, 1800)

From the Output it seems that the number of masks moved to the Batch position.
Further thinking about it leads to my conclusion, that this seems to be correct, since with Batch is probably the Number of BBoxes meant.
But if this is the case the Docstring is not very clear about it. At least the B-dimension should be clearly addressed.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions