Docstrings about dimesions are unclear.

The Docstring of the method `SamPredictor.set_image` describes the Input dimensions as the following:

```
Arguments:
          image (np.ndarray): The image for calculating masks. Expects an
            image in HWC uint8 format, with pixel values in [0, 255].
```

While the Output of the method `SamPredictor.predict_torch` is described like the following:

```
Returns:
          (torch.Tensor): The output masks in BxCxHxW format, where C is the
            number of masks, and (H, W) is the original image size.
```

While running the code of the example notebook `predictor_example.ipynb` I have noticed that this is not the case.

Here a MWE to recreate the issue:

```python
import sys

import cv2
import numpy as np
import torch

sys.path.append("..")
from segment_anything import SamPredictor, sam_model_registry

image = cv2.imread("images/truck.jpg")
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)

sam_checkpoint = "sam_vit_h_4b8939.pth"
model_type = "vit_h"

device = "cuda"

sam = sam_model_registry[model_type](checkpoint=sam_checkpoint)
sam.to(device=device)

predictor = SamPredictor(sam)

predictor.set_image(image)

# Example with predict
input_point = np.array([[500, 375]])
input_label = np.array([1])

masks_predict, scores, logits = predictor.predict(
    point_coords=input_point,
    point_labels=input_label,
    multimask_output=True,
)

# Example with predict_torch
input_boxes = torch.tensor(
    [
        [75, 275, 1725, 850],
        [425, 600, 700, 875],
        [1375, 550, 1650, 800],
        [1240, 675, 1400, 750],
    ],
    device=predictor.device,
)

transformed_boxes = predictor.transform.apply_boxes_torch(input_boxes, image.shape[:2])
masks_predict_torch, _, _ = predictor.predict_torch(
    point_coords=None,
    point_labels=None,
    boxes=transformed_boxes,
    multimask_output=False,
)

print("Image Shape: ", image.shape)  # expected shape: (1200, 1800, 3); returned shape: (1200, 1800, 3)
print("Masks Predict Shape: ", masks_predict.shape) # expected shape: (3, 1200, 1800); returned shape: (3, 1200, 1800)
print("Masks Predict Torch Shape: ", masks_predict_torch.shape) # expected shape (1,4, 1200, 1800); returned shape (4, 1, 1200, 1800)
```

From the Output it seems that the number of masks moved to the Batch position.
Further thinking about it leads to my conclusion, that this seems to be correct, since with Batch is probably the Number of BBoxes meant.
But if this is the case the Docstring is not very clear about it. At least the `B`-dimension should be clearly addressed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Docstrings about dimesions are unclear. #810

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Docstrings about dimesions are unclear. #810

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions