Shaokai/add colab (#22)

yeshaokai · MMathisLab · web-flow · commit 04e942712db7 · 2025-03-07T13:42:18.000+01:00
* removed the project.urls to prevent build failure

* Added colab notebook

* w/button

* Update llavaction_video_demo.ipynb

---------

Co-authored-by: Mackenzie Mathis &lt;mackenzie.mathis@epfl.ch&gt;
diff --git a/example/llavaction_video_demo.ipynb b/example/llavaction_video_demo.ipynb
@@ -0,0 +1,339 @@
+{
+  "nbformat": 4,
+  "nbformat_minor": 0,
+  "metadata": {
+    "colab": {
+      "provenance": [],
+      "machine_shape": "hm",
+      "gpuType": "A100",
+      "include_colab_link": true
+    },
+    "kernelspec": {
+      "name": "python3",
+      "display_name": "Python 3"
+    },
+    "language_info": {
+      "name": "python"
+    },
+    "accelerator": "GPU"
+  },
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "view-in-github",
+        "colab_type": "text"
+      },
+      "source": [
+        "<a href=\"https://colab.research.google.com/github/AdaptiveMotorControlLab/LLaVAction/blob/release_iccv/llavaction_video_demo.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# LLaVAction: Evaluating and Training Multi-Modal Large Language Models for Action Recognition\n",
+        "\n",
+        "- This repository contains the implementation for our ICCV 2025 submission on evaluating and training multi-modal large language models for action recognition.\n",
+        "\n"
+      ],
+      "metadata": {
+        "id": "moPRHYOWkOKg"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "**Please download the shared folder to your google drive and name it llavaction_demo_data https://drive.google.com/drive/folders/1ql8MSWTK-2_uGH1EzPOrifauwUNg4E6i?usp=sharing**"
+      ],
+      "metadata": {
+        "id": "wwfcD1VYBvzU"
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "WNc4iT0Rj87z"
+      },
+      "outputs": [],
+      "source": [
+        "from google.colab import drive\n",
+        "drive.mount('/content/drive')"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "Installing flash attention, which is important for fast inference.\n",
+        "\n"
+      ],
+      "metadata": {
+        "id": "rot5HYWoHoBl"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "!pip install ninja\n",
+        "!pip install flash-attn --no-build-isolation"
+      ],
+      "metadata": {
+        "id": "zMgB6_Kkv26W"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "Creating a folder for caching the library files"
+      ],
+      "metadata": {
+        "id": "gNA2nB1xHwfX"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "!mkdir -p /content/drive/MyDrive/python_packages"
+      ],
+      "metadata": {
+        "id": "KgFjdpidpi9J"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "Installing LLaVAction from the github."
+      ],
+      "metadata": {
+        "id": "K_sfl08eH2Nr"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "from getpass import getpass\n",
+        "\n",
+        "GITHUB_TOKEN = getpass(\"Enter your GitHub token: \")  # Hidden input\n",
+        "REPO_URL = f\"https://{GITHUB_TOKEN}@github.com/AdaptiveMotorControlLab/LLaVAction.git@release_iccv\"\n",
+        "\n",
+        "!pip install git+{REPO_URL}\n"
+      ],
+      "metadata": {
+        "id": "mCZ6EN8yQXID"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "Install decord for efficient video reading"
+      ],
+      "metadata": {
+        "id": "3PN6JYRtH3mU"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "!pip install decord\n"
+      ],
+      "metadata": {
+        "id": "Aohr8FcXpquN"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "Adding the library into the system path"
+      ],
+      "metadata": {
+        "id": "Y1198wQPIHJO"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "import sys\n",
+        "sys.path.append('/content/drive/MyDrive/python_packages')"
+      ],
+      "metadata": {
+        "id": "n4l4xSdipu0w"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "Import inference and visualization functions from LLaVAction"
+      ],
+      "metadata": {
+        "id": "e6SsiJzDIJt8"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "from llavaction.action.selective_inference import SelectiveInferencer\n",
+        "from llavaction.action.make_visualizations import visualize_with_uid\n",
+        "import os"
+      ],
+      "metadata": {
+        "id": "tK7pnU99qJzI"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "Speciy where to load the EPIC-KITCHENS-100 videos and the LLaVAction checkpoint for the inference.\n",
+        "You can adjust n_frames to higher numbers for better performance (we observe it empirically), with the cost of using more compute.\n"
+      ],
+      "metadata": {
+        "id": "wvopr62aIM94"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "data_root = '/content/drive/MyDrive/llavaction_demo_data/EK100_512/EK100'\n",
+        "checkpoint_folder = '/content/drive/MyDrive/llavaction_demo_data/checkpoint/dev_ov_0.5b_16f_top5_full'\n",
+        "inferencer = SelectiveInferencer(data_root,\n",
+        "                                     checkpoint_folder,\n",
+        "                                     include_time_instruction = False,\n",
+        "                                     n_frames = 16,\n",
+        "                                    use_flash_attention = True)"
+      ],
+      "metadata": {
+        "id": "HntA8BHGqRb2"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "Define the 'caption' mode of the inference."
+      ],
+      "metadata": {
+        "id": "C0BPzu3PIRIP"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "def get_caption(inferencer,\n",
+        "                uid,\n",
+        "                checkpoint_folder):\n",
+        "    caption =  inferencer.inference('',\n",
+        "                                     uid,\n",
+        "                                     'caption')\n",
+        "    return caption"
+      ],
+      "metadata": {
+        "id": "qp0xYBYZvFgs"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "Define the video id and the timestamp in that video for visual inspection.\n",
+        "Note that P01-P01_01 represents the video id. 3.00_4.00 denotes the start in second and end in second respectively."
+      ],
+      "metadata": {
+        "id": "OW0bOMPrITT3"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "uid = 'P01-P01_01_3.00_4.00'\n",
+        "\n",
+        "visualize_with_uid(data_root, uid, 'vis_folder')\n",
+        "\n",
+        "import IPython.display as display\n",
+        "from PIL import Image\n",
+        "import os\n",
+        "import matplotlib.pyplot as plt\n",
+        "import cv2\n",
+        "\n",
+        "folder_path = f\"vis_folder/{uid}\"  # Change this to your actual filename\n",
+        "\n",
+        "\n",
+        "# List all image files\n",
+        "image_files = sorted([f for f in os.listdir(folder_path) if f.endswith((\".jpg\", \".png\", \".jpeg\"))])\n",
+        "\n",
+        "# Set grid dimensions\n",
+        "cols = 4  # Adjust this for the number of images per row\n",
+        "rows = (len(image_files) + cols - 1) // cols  # Calculate the required number of rows\n",
+        "\n",
+        "# Create a figure with subplots\n",
+        "fig, axes = plt.subplots(rows, cols, figsize=(12, 3 * rows))  # Adjust figure size\n",
+        "plt.subplots_adjust(wspace=0.05, hspace=0.05)  # Reduce horizontal & vertical spacing\n",
+        "\n",
+        "# Loop through images and display them in the grid\n",
+        "for ax, img_file in zip(axes.flatten(), image_files):\n",
+        "    img_path = os.path.join(folder_path, img_file)\n",
+        "    img = cv2.imread(img_path)\n",
+        "    img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)  # Convert BGR to RGB for proper display\n",
+        "    ax.imshow(img)\n",
+        "    ax.set_title(img_file, fontsize=8)  # Display filename in smaller font\n",
+        "    ax.axis(\"off\")  # Hide axis labels\n",
+        "\n",
+        "# Hide unused subplots (if any)\n",
+        "for ax in axes.flatten()[len(image_files):]:\n",
+        "    ax.axis(\"off\")\n",
+        "\n",
+        "plt.show()\n",
+        "\n",
+        "\n",
+        "\n"
+      ],
+      "metadata": {
+        "id": "HNXNoBmmyeCL"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "Run the caption inference using llavaction on the video (with the specified timestamps)"
+      ],
+      "metadata": {
+        "id": "w5IQMfSYIXHp"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "caption = get_caption(inferencer, uid, checkpoint_folder)\n",
+        "caption"
+      ],
+      "metadata": {
+        "id": "8LhrLRGk8jTo"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "source": [],
+      "metadata": {
+        "id": "d_NY6yJ67eAl"
+      },
+      "execution_count": null,
+      "outputs": []
+    }
+  ]
+}