You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
You are seeing video frames from an egocentric view of a person. The person is interacting with objects in a kitchen.
88
-
Describe the action the person is performing but do not say you see the person as you can only see the person's hands.
89
-
You can say something that the video is showing the egocentric view of person doing something.
90
-
Pay attention to the objects the person's hands are interacting.
91
-
The true ground-truth action is {gt_answer}. However, I want you to come to your ownconclusion from your own observation and show your reasoning steps. Make sure it matches the ground-truth action.
92
-
Your reasoning steps should include supporting evidences for the action. Useful evidences include the duration of the video, the objects the person is interacting with, and the context of the video.
91
+
You are viewing video frames from an egocentric perspective of a person interacting with objects in a kitchen. Describe the video frames in detail and reason about the actions the person is performing. You will be provided with the human-annotated ground-truth for the action, but you should independently come to your own conclusion.
92
+
If you disagree with the human annotation, indicate "true" in the "disagree_with_human_annotation" field of your response, and provide your reasoning without mentioning the ground-truth answer. This will keep your reasoning clean. If you agree with the human annotation, indicate "false" in the "disagree_with_human_annotation" field and provide your reasoning without referencing the ground-truth to maintain a clean description.
93
+
Pay close attention to the objects the person's hands are interacting with.
94
+
The true ground-truth action is {gt_answer}.
95
+
Your reasoning steps should include supporting evidence for the action, such as the duration of the video, the sequence of actions the person performs, the objects they interact with, and the overall context of the video.
96
+
As a general guideline, for videos longer than 3 seconds, provide detailed reasoning steps, and for videos shorter than 3 seconds, generate less detailed reasoning.
93
97
The video duration is {end_second-start_second:.3f} seconds.
94
98
"""
95
99
print (prompt)
96
100
returnprompt
97
101
98
-
99
-
classGT_Agnostic_Response(BaseModel):
100
-
"""
101
-
The GT was not known. The response is to generate a new answer
102
-
"""
103
-
explanation: str
104
-
answer: str
105
-
106
102
classGT_Augmentation_Response(BaseModel):
107
103
"""
108
104
The GT was known. The response is to add more information to the GT
0 commit comments