You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
prompt=f"""Your job is to create 3 question and answer pairs based on the text below.
40
-
{reason_mc_string}
41
-
Example questions you can ask include. Note you are not limited to these questions:
42
-
What object the person is interacting with?
43
-
What objects are visible in the video?
44
-
What is the sequence of the atomic actions that the person is performing?
45
-
Make sure your only ask questions that can be answered with enough grounding in the text.
39
+
prompt=f"""Your job is to create 3 question-answer pairs based on the text below. The text contains a first-person narrative of video frames from an egocentric perspective of a person interacting with objects in a kitchen.
40
+
{reason_mc_string}
41
+
You can ask questions such as:
42
+
What object am I interacting with?
43
+
What objects are visible in the video?
44
+
What is the sequence of the atomic actions I am performing?
45
+
46
+
Make sure your questions can be answered based on the information provided in the text. Do not ask questions that require additional context or information beyond what is given.
You are seeing video frames from an egocentric view of a person.
59
-
Please talk as if you are the person in the video and describe what action you are performing.
60
-
To assist you for how to describe the action, the video's start time is {start_second} and the end time is {end_second} and the duration is {end_second-start_second} seconds.
61
-
62
-
To further assist you for how to describe the action, note that in a multi-choice video question answering, you were given following options {option_text} and the correct answer is {gt_answer}.
63
-
In addition to describe what you see, describe why wrong answers were wrong and why right answer was right.
64
-
When you explain why wrong answers were wrong and why right answer was right, you should use the following flow of reasoning:
65
-
66
-
The flow of reasoning:
67
-
1. What objects need to be visible to support the answer?
68
-
2. Whether the duration in time supports that answer?
69
-
70
-
Based on the answers above, why right answer is right and why wrong answers were wrong."""
You are seeing video frames from an egocentric view of a person. The person is interacting with objects in a kitchen.
82
-
Describe the action the person is performing. Pay attention to the objects the person's hands are interacting.
83
-
Explain in details what are the supporting evidences for the action. Useful evidences include the duration of the video, the objects the person is interacting with, and the context of the video.
You are a helpful AI assistant, and you will assist in creating question-answer pairs.
57
+
I will provide you with the state of the left hand and the right hand, as well as the ground-truth narration.
58
+
For the hand states:
59
+
- -1 denotes the hand is not visible
60
+
- 0 denotes the hand is visible but not interacting with objects
61
+
- 1 denotes the hand is interacting with another hand
62
+
- 3 denotes the hand is interacting with a portable object
63
+
- 4 denotes the hand is interacting with a stationary object
64
+
65
+
The state for the left hand is {left_hand_state}, and the state for the right hand is {right_hand_state}.
66
+
Using this information, create 3 question-answer pairs. Pretend you are seeing an image from a first-person perspective and can see your hands and the objects you are interacting with.
67
+
Do not ask questions about the action, as you are viewing an image and not a video.
68
+
Do not describe what the object is, only mention whether it's portable or stationary.
69
+
Ask and answer the questions in the first-person perspective.
Your reasoning steps should include supporting evidence for the action, such as the duration of the video, the sequence of actions the person performs, the objects they interact with, and the overall context of the video.
96
82
As a general guideline, for videos longer than 3 seconds, provide detailed reasoning steps, and for videos shorter than 3 seconds, generate less detailed reasoning.
97
83
The video duration is {end_second-start_second:.3f} seconds.
84
+
Make sure you use the first-person perspective in your reasoning.
98
85
"""
99
86
print (prompt)
100
87
returnprompt
@@ -107,6 +94,17 @@ class GT_Augmentation_Response(BaseModel):
107
94
disagree_with_human_annotation: bool
108
95
109
96
97
+
classGPTHandObjectResponse(BaseModel):
98
+
"""
99
+
The response for the GPTHandObjectPrompt
100
+
"""
101
+
first_question: str
102
+
first_answer: str
103
+
second_question: str
104
+
second_answer: str
105
+
third_question: str
106
+
third_answer: str
107
+
110
108
classExpandReasonMCResponse(BaseModel):
111
109
"""
112
110
The response for the ExpandReasonMCPrompt
@@ -119,12 +117,14 @@ class ExpandReasonMCResponse(BaseModel):
Task related prompt is impacted by the question_type.
167
167
We currently support mc_{action_representation} and gpt-gt-reason
168
168
We are thinking about tweaking the prompt based on the action representation.
169
169
"""
170
-
170
+
ifperspective=="first_person":
171
+
perspective_prefix="You are seeing this video from egocentric view and your hands are sometimes interacting with obects. What action are you performing? "
172
+
elifperspective=="third_person":
173
+
perspective_prefix="The video is taken from egocentric view. What action is the person performing? "
171
174
ifquestion_type.startswith("mc_"):
172
-
action_rep_suffix="Given multiple choices, format your answer briefly such as 'A. move knife'"
173
-
prefix=f"The video is taken from egocentric view. What action is the person performing? {action_rep_suffix}\n"
175
+
action_rep_suffix="Given multiple choices, format your answer briefly such as 'A. move knife'. "
suffix="Here are the options you are tasked:\n"+suffix
179
+
suffix="Here are the options you are tasked:\n"+suffix
177
180
ret=prefix+suffix
178
181
elifquestion_type=="gpt-gt-reason":
179
-
ret="The video is taken from egocentric view. What action is the person performing? Please explain your reasoning steps before reaching to your answer."
182
+
ret=f"{perspective_prefix}Please explain your reasoning steps before reaching to your answer."
180
183
elifquestion_type=="gpt-gt-instruct-reason":
181
184
ret=question
185
+
elifquestion_type=="gpt-hand-object":
186
+
ret=question
182
187
elifquestion_type=="cot_mc":
183
188
"""
184
189
Explain the reasoning first and do the multiple-choice.
185
190
"""
186
-
action_rep_suffix="Given multiple choices, explain your reasoning steps before you reach to your answer."
187
-
prefix=f"The video is taken from egocentric view. What action is the person performing?{action_rep_suffix}\n"
191
+
action_rep_suffix="Given multiple choices, explain your reasoning steps before you reach to your answer."
prefix=f"You are seeing a video taken from egocentric view. The video lasts for {video_duration:.2f} seconds, and {n_frames} frames are uniformly sampled from it."
205
+
prefix=f"You are seeing a video taken from egocentric view. The video lasts for {video_duration:.3f} seconds, and {n_frames} frames are uniformly sampled from it."
0 commit comments