Skip to content

Commit 064b21f

Browse files
authored
phi-4 multimodal demo added (NVIDIA#276)
* phi-4 multimodal demo added * Update Phi-4 multimodal podcast assistant notebook * Update repository URL in ai-podcast-assistant README * Remove .DS_Store file * Added instructions to setup API key * removed outputs
1 parent a39128d commit 064b21f

File tree

3 files changed

+450
-1
lines changed

3 files changed

+450
-1
lines changed

community/README.md

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -68,6 +68,10 @@ Community examples are sample code and deployments for RAG pipelines that are no
6868

6969
This tool demonstrates how to utilize a user-friendly interface to interact with NVIDIA NIMs, including those available in the API catalog, self-deployed NIM endpoints, and NIMs hosted on Hugging Face. It also provides settings to integrate RAG pipelines with either local and temporary vector stores or self-hosted search engines. Developers can use this tool to design system prompts, few-shot prompts, and configure LLM settings.
7070

71-
* [Chatbot with RAG and Glean](./chat-and-rag-glean/)
71+
* [Chatbot with RAG and Glean](./chat-and-rag-glean/)
7272

7373
This tool shows how to build a chat interface that uses NVIDIA NIMs along with the Glean Search API to enable internal knowledge base search, chat, and retrieval.
74+
75+
* [AI Podcast Assistant](./ai-podcast-assistant/)
76+
77+
This example demonstrates a comprehensive workflow for processing podcast audio using the Phi-4-Multimodal LLM through NVIDIA NIM Microservices. It includes functionality for generating detailed notes from audio content, creating concise summaries, and translating both transcriptions and summaries into different languages. The implementation handles long audio files by automatically chunking them for efficient processing and preserves formatting during translation.
Lines changed: 92 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,92 @@
1+
# AI Podcast Assistant
2+
3+
A comprehensive toolkit for generating detailed notes, summary, and translation of podcast content using the Phi-4-Multimodal LLM with NVIDIA NIM Microservices.
4+
5+
## Overview
6+
7+
This repository contains a Jupyter notebook that demonstrates a complete workflow for processing podcast audio:
8+
9+
1. **Notes generation**: Convert spoken content from podcasts into detailed text notes
10+
2. **Summarization**: Generate concise summaries of the transcribed content
11+
3. **Translation**: Translate both the transcription and summary into different languages
12+
13+
The implementation leverages the powerful Phi-4-Multimodal LLM (5.6B parameters) through NVIDIA's NIM Microservices, enabling efficient processing of long-form audio content.
14+
15+
Learn more about the model [here](https://developer.nvidia.com/blog/latest-multimodal-addition-to-microsoft-phi-slms-trained-on-nvidia-gpus/).
16+
17+
## Features
18+
19+
- **Long Audio Processing**: Automatically chunks long audio files for processing
20+
- **Detailed Notes Generation**: Creates well-formatted, detailed notes from audio content
21+
- **Summarization**: Generates concise summaries capturing key points
22+
- **Translation**: Translates content to multiple languages while preserving formatting
23+
- **File Export**: Saves results as text files for easy sharing and reference
24+
25+
## Requirements
26+
27+
- Python 3.10+
28+
- Jupyter Notebook or JupyterLab
29+
- NVIDIA API Key (see [Installation](#installation) section for setup instructions)
30+
- Required Python packages:
31+
- requests
32+
- base64
33+
- pydub
34+
- Pillow (PIL)
35+
36+
## Installation
37+
38+
1. Clone this repository:
39+
```bash
40+
git clone https://github.com/NVIDIA/GenerativeAIExamples.git
41+
cd GenerativeAIExamples/community/ai-podcast-assistant
42+
```
43+
44+
2. Set up your NVIDIA API key:
45+
- Sign up for [NVIDIA NIM Microservices](https://build.nvidia.com/explore/discover?signin=true)
46+
- Generate an [API key](https://build.nvidia.com/microsoft/phi-4-multimodal-instruct?api_key=true)
47+
- Replace the placeholder in the notebook with your API key
48+
49+
## Usage
50+
51+
1. Open the Jupyter notebook:
52+
```bash
53+
jupyter notebook ai-podcast-assistant-phi4-mulitmodal.ipynb
54+
```
55+
56+
2. Update the `podcast_audio_path` variable with the path to your audio file.
57+
58+
3. Run the notebook cells sequentially to:
59+
- Process the audio file
60+
- Generate detailed notes
61+
- Create a summary
62+
- Translate the content (optional)
63+
- Save results to text files
64+
65+
## Example Output
66+
67+
The notebook generates:
68+
69+
1. **Detailed Notes**: Bullet-pointed notes capturing the main content of the podcast
70+
2. **Summary**: A concise paragraph summarizing the key points
71+
3. **Translation**: The notes and summary translated to your chosen language
72+
73+
All outputs are saved as text files for easy reference and sharing.
74+
75+
## Model Details
76+
77+
The Phi-4-Multimodal LLM used in this project has the following specifications:
78+
- **Parameters**: 5.6B
79+
- **Inputs**: Text, Image, Audio
80+
- **Context Length**: 128K tokens
81+
- **Training Data**: 5T text tokens, 2.3M speech hours, 1.1T image-text tokens
82+
- **Supported Languages**: Multilingual text and audio (English, Chinese, German, French, etc.)
83+
84+
85+
## Acknowledgments
86+
87+
- Microsoft for providing access to the Phi-4 multimodal
88+
- NVIDIA for providing access to the NIM microservices preview
89+
90+
## Contact
91+
92+
For questions, feedback, or collaboration opportunities, please open an issue in this repository.

0 commit comments

Comments
 (0)