You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
You are seeing video frames from an egocentric view of a person. Pretend that you are the person. Your task is to describe what action you are performing.
413
-
To assist you for how to describe the action, the video's start time is {start_second} and the end time is {end_second:.3f} and the duration is {duration:.3f} seconds.
414
-
You were given multiple choice options {option_text}. Pick the correct one and put that into the answer. Note in the answer do not include the option letter, just the name of the action.
system_prompt+=f"""To further assist you, we mark hands and object when they are visible. The left hand is marked with a bounding box that contains letter L and the right hand's bounding box contains letter R. The object is marked as 'O'."""
419
434
420
-
system_prompt+=f"""Before giving the answer, explain why the correct answer is correct and why the other options are incorrect. You must pay attention to the hands and objects to support your reasoning when they are present."""
We are thinking about tweaking the prompt based on the action representation.
169
169
"""
170
170
ifperspective=="first_person":
171
-
perspective_prefix="You are seeing this video from egocentric view and your hands are sometimes interacting with obects. What action are you performing? "
171
+
perspective_prefix="You are seeing this video from egocentric view and you are the person. Your hands are sometimes interacting with obects. What action are you performing? "
172
172
elifperspective=="third_person":
173
173
perspective_prefix="The video is taken from egocentric view. What action is the person performing? "
174
174
ifquestion_type.startswith("mc_"):
175
175
action_rep_suffix="Given multiple choices, format your answer briefly such as 'A. move knife'. "
prefix=f"You are seeing a video taken from egocentric view. The video lasts for {video_duration:.3f} seconds, and {n_frames} frames are uniformly sampled from it."
205
+
prefix=f"The provided video lasts for {video_duration:.3f} seconds, and {n_frames} frames are uniformly sampled from it."
0 commit comments