Skip to content

Commit 04e9427

Browse files
Shaokai/add colab (#22)
* removed the project.urls to prevent build failure * Added colab notebook * w/button * Update llavaction_video_demo.ipynb --------- Co-authored-by: Mackenzie Mathis <[email protected]>
1 parent 438256a commit 04e9427

File tree

1 file changed

+339
-0
lines changed

1 file changed

+339
-0
lines changed
Lines changed: 339 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,339 @@
1+
{
2+
"nbformat": 4,
3+
"nbformat_minor": 0,
4+
"metadata": {
5+
"colab": {
6+
"provenance": [],
7+
"machine_shape": "hm",
8+
"gpuType": "A100",
9+
"include_colab_link": true
10+
},
11+
"kernelspec": {
12+
"name": "python3",
13+
"display_name": "Python 3"
14+
},
15+
"language_info": {
16+
"name": "python"
17+
},
18+
"accelerator": "GPU"
19+
},
20+
"cells": [
21+
{
22+
"cell_type": "markdown",
23+
"metadata": {
24+
"id": "view-in-github",
25+
"colab_type": "text"
26+
},
27+
"source": [
28+
"<a href=\"https://colab.research.google.com/github/AdaptiveMotorControlLab/LLaVAction/blob/release_iccv/llavaction_video_demo.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
29+
]
30+
},
31+
{
32+
"cell_type": "markdown",
33+
"source": [
34+
"# LLaVAction: Evaluating and Training Multi-Modal Large Language Models for Action Recognition\n",
35+
"\n",
36+
"- This repository contains the implementation for our ICCV 2025 submission on evaluating and training multi-modal large language models for action recognition.\n",
37+
"\n"
38+
],
39+
"metadata": {
40+
"id": "moPRHYOWkOKg"
41+
}
42+
},
43+
{
44+
"cell_type": "markdown",
45+
"source": [
46+
"**Please download the shared folder to your google drive and name it llavaction_demo_data https://drive.google.com/drive/folders/1ql8MSWTK-2_uGH1EzPOrifauwUNg4E6i?usp=sharing**"
47+
],
48+
"metadata": {
49+
"id": "wwfcD1VYBvzU"
50+
}
51+
},
52+
{
53+
"cell_type": "code",
54+
"execution_count": null,
55+
"metadata": {
56+
"id": "WNc4iT0Rj87z"
57+
},
58+
"outputs": [],
59+
"source": [
60+
"from google.colab import drive\n",
61+
"drive.mount('/content/drive')"
62+
]
63+
},
64+
{
65+
"cell_type": "markdown",
66+
"source": [
67+
"Installing flash attention, which is important for fast inference.\n",
68+
"\n"
69+
],
70+
"metadata": {
71+
"id": "rot5HYWoHoBl"
72+
}
73+
},
74+
{
75+
"cell_type": "code",
76+
"source": [
77+
"!pip install ninja\n",
78+
"!pip install flash-attn --no-build-isolation"
79+
],
80+
"metadata": {
81+
"id": "zMgB6_Kkv26W"
82+
},
83+
"execution_count": null,
84+
"outputs": []
85+
},
86+
{
87+
"cell_type": "markdown",
88+
"source": [
89+
"Creating a folder for caching the library files"
90+
],
91+
"metadata": {
92+
"id": "gNA2nB1xHwfX"
93+
}
94+
},
95+
{
96+
"cell_type": "code",
97+
"source": [
98+
"!mkdir -p /content/drive/MyDrive/python_packages"
99+
],
100+
"metadata": {
101+
"id": "KgFjdpidpi9J"
102+
},
103+
"execution_count": null,
104+
"outputs": []
105+
},
106+
{
107+
"cell_type": "markdown",
108+
"source": [
109+
"Installing LLaVAction from the github."
110+
],
111+
"metadata": {
112+
"id": "K_sfl08eH2Nr"
113+
}
114+
},
115+
{
116+
"cell_type": "code",
117+
"source": [
118+
"from getpass import getpass\n",
119+
"\n",
120+
"GITHUB_TOKEN = getpass(\"Enter your GitHub token: \") # Hidden input\n",
121+
"REPO_URL = f\"https://{GITHUB_TOKEN}@github.com/AdaptiveMotorControlLab/LLaVAction.git@release_iccv\"\n",
122+
"\n",
123+
"!pip install git+{REPO_URL}\n"
124+
],
125+
"metadata": {
126+
"id": "mCZ6EN8yQXID"
127+
},
128+
"execution_count": null,
129+
"outputs": []
130+
},
131+
{
132+
"cell_type": "markdown",
133+
"source": [
134+
"Install decord for efficient video reading"
135+
],
136+
"metadata": {
137+
"id": "3PN6JYRtH3mU"
138+
}
139+
},
140+
{
141+
"cell_type": "code",
142+
"source": [
143+
"!pip install decord\n"
144+
],
145+
"metadata": {
146+
"id": "Aohr8FcXpquN"
147+
},
148+
"execution_count": null,
149+
"outputs": []
150+
},
151+
{
152+
"cell_type": "markdown",
153+
"source": [
154+
"Adding the library into the system path"
155+
],
156+
"metadata": {
157+
"id": "Y1198wQPIHJO"
158+
}
159+
},
160+
{
161+
"cell_type": "code",
162+
"source": [
163+
"import sys\n",
164+
"sys.path.append('/content/drive/MyDrive/python_packages')"
165+
],
166+
"metadata": {
167+
"id": "n4l4xSdipu0w"
168+
},
169+
"execution_count": null,
170+
"outputs": []
171+
},
172+
{
173+
"cell_type": "markdown",
174+
"source": [
175+
"Import inference and visualization functions from LLaVAction"
176+
],
177+
"metadata": {
178+
"id": "e6SsiJzDIJt8"
179+
}
180+
},
181+
{
182+
"cell_type": "code",
183+
"source": [
184+
"from llavaction.action.selective_inference import SelectiveInferencer\n",
185+
"from llavaction.action.make_visualizations import visualize_with_uid\n",
186+
"import os"
187+
],
188+
"metadata": {
189+
"id": "tK7pnU99qJzI"
190+
},
191+
"execution_count": null,
192+
"outputs": []
193+
},
194+
{
195+
"cell_type": "markdown",
196+
"source": [
197+
"Speciy where to load the EPIC-KITCHENS-100 videos and the LLaVAction checkpoint for the inference.\n",
198+
"You can adjust n_frames to higher numbers for better performance (we observe it empirically), with the cost of using more compute.\n"
199+
],
200+
"metadata": {
201+
"id": "wvopr62aIM94"
202+
}
203+
},
204+
{
205+
"cell_type": "code",
206+
"source": [
207+
"data_root = '/content/drive/MyDrive/llavaction_demo_data/EK100_512/EK100'\n",
208+
"checkpoint_folder = '/content/drive/MyDrive/llavaction_demo_data/checkpoint/dev_ov_0.5b_16f_top5_full'\n",
209+
"inferencer = SelectiveInferencer(data_root,\n",
210+
" checkpoint_folder,\n",
211+
" include_time_instruction = False,\n",
212+
" n_frames = 16,\n",
213+
" use_flash_attention = True)"
214+
],
215+
"metadata": {
216+
"id": "HntA8BHGqRb2"
217+
},
218+
"execution_count": null,
219+
"outputs": []
220+
},
221+
{
222+
"cell_type": "markdown",
223+
"source": [
224+
"Define the 'caption' mode of the inference."
225+
],
226+
"metadata": {
227+
"id": "C0BPzu3PIRIP"
228+
}
229+
},
230+
{
231+
"cell_type": "code",
232+
"source": [
233+
"def get_caption(inferencer,\n",
234+
" uid,\n",
235+
" checkpoint_folder):\n",
236+
" caption = inferencer.inference('',\n",
237+
" uid,\n",
238+
" 'caption')\n",
239+
" return caption"
240+
],
241+
"metadata": {
242+
"id": "qp0xYBYZvFgs"
243+
},
244+
"execution_count": null,
245+
"outputs": []
246+
},
247+
{
248+
"cell_type": "markdown",
249+
"source": [
250+
"Define the video id and the timestamp in that video for visual inspection.\n",
251+
"Note that P01-P01_01 represents the video id. 3.00_4.00 denotes the start in second and end in second respectively."
252+
],
253+
"metadata": {
254+
"id": "OW0bOMPrITT3"
255+
}
256+
},
257+
{
258+
"cell_type": "code",
259+
"source": [
260+
"uid = 'P01-P01_01_3.00_4.00'\n",
261+
"\n",
262+
"visualize_with_uid(data_root, uid, 'vis_folder')\n",
263+
"\n",
264+
"import IPython.display as display\n",
265+
"from PIL import Image\n",
266+
"import os\n",
267+
"import matplotlib.pyplot as plt\n",
268+
"import cv2\n",
269+
"\n",
270+
"folder_path = f\"vis_folder/{uid}\" # Change this to your actual filename\n",
271+
"\n",
272+
"\n",
273+
"# List all image files\n",
274+
"image_files = sorted([f for f in os.listdir(folder_path) if f.endswith((\".jpg\", \".png\", \".jpeg\"))])\n",
275+
"\n",
276+
"# Set grid dimensions\n",
277+
"cols = 4 # Adjust this for the number of images per row\n",
278+
"rows = (len(image_files) + cols - 1) // cols # Calculate the required number of rows\n",
279+
"\n",
280+
"# Create a figure with subplots\n",
281+
"fig, axes = plt.subplots(rows, cols, figsize=(12, 3 * rows)) # Adjust figure size\n",
282+
"plt.subplots_adjust(wspace=0.05, hspace=0.05) # Reduce horizontal & vertical spacing\n",
283+
"\n",
284+
"# Loop through images and display them in the grid\n",
285+
"for ax, img_file in zip(axes.flatten(), image_files):\n",
286+
" img_path = os.path.join(folder_path, img_file)\n",
287+
" img = cv2.imread(img_path)\n",
288+
" img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB) # Convert BGR to RGB for proper display\n",
289+
" ax.imshow(img)\n",
290+
" ax.set_title(img_file, fontsize=8) # Display filename in smaller font\n",
291+
" ax.axis(\"off\") # Hide axis labels\n",
292+
"\n",
293+
"# Hide unused subplots (if any)\n",
294+
"for ax in axes.flatten()[len(image_files):]:\n",
295+
" ax.axis(\"off\")\n",
296+
"\n",
297+
"plt.show()\n",
298+
"\n",
299+
"\n",
300+
"\n"
301+
],
302+
"metadata": {
303+
"id": "HNXNoBmmyeCL"
304+
},
305+
"execution_count": null,
306+
"outputs": []
307+
},
308+
{
309+
"cell_type": "markdown",
310+
"source": [
311+
"Run the caption inference using llavaction on the video (with the specified timestamps)"
312+
],
313+
"metadata": {
314+
"id": "w5IQMfSYIXHp"
315+
}
316+
},
317+
{
318+
"cell_type": "code",
319+
"source": [
320+
"caption = get_caption(inferencer, uid, checkpoint_folder)\n",
321+
"caption"
322+
],
323+
"metadata": {
324+
"id": "8LhrLRGk8jTo"
325+
},
326+
"execution_count": null,
327+
"outputs": []
328+
},
329+
{
330+
"cell_type": "code",
331+
"source": [],
332+
"metadata": {
333+
"id": "d_NY6yJ67eAl"
334+
},
335+
"execution_count": null,
336+
"outputs": []
337+
}
338+
]
339+
}

0 commit comments

Comments
 (0)