An LLM-powered robot control system that enables natural language interaction with industrial robots through voice commands or text input.
The system follows a 5-step pipeline architecture that processes human voice commands into precise robot actions:
- Voice Activity Detection (VAD): Captures and filters human speech from voice instructions.
- Speech-to-Text Transcription: Uses Gemini-Flash-2.5 for automatic speech recognition, converting filtered audio to text tokens.
- AI Agent Processing: The agent processes text input, maintains conversation memory, and generates appropriate robot tasks.
- Robot Control Interface: Python-based control tools execute tasks and provide real-time feedback to the agent.
- Text-to-Speech Response: Azure TTS converts the agent's responses back to speech for seamless human interaction.
This architecture enables bidirectional communication between human and robot through natural language, with persistent memory for context-aware conversations.
@article{KADRI2025106660,
title = {LLM-driven agent for speech-enabled control of industrial robots: A case study in snow-crab quality inspection},
journal = {Results in Engineering},
volume = {27},
pages = {106660},
year = {2025},
issn = {2590-1230},
doi = {https://doi.org/10.1016/j.rineng.2025.106660},
url = {https://www.sciencedirect.com/science/article/pii/S2590123025027276},
author = {Ibrahim Kadri and Sid Ahmed Selouani and Mohsen Ghribi and Rayen Ghali and Sabrina Mekhoukh},
keywords = {Large language models (LLMs), Voice interface, KUKA industrial robot, Human-robot interaction, Autonomous robotic planning, Computer vision},}
- Clone the repository:
git clone https://github.com/Rayen023/RobotVoiceControl.git cd RobotVoiceControl
- Project Structure:
RobotVoiceControl/ ├── voice_agent.py # Voice-controlled interface ├── text_agent.py # Text-based interface ├── src/ │ ├── agent_common.py # Shared agent configuration │ ├── agent_tools.py # Robot control tools and functions │ ├── robot_control.py # Core robot communication │ ├── tts.py # Text-to-speech implementation │ ├── simple_emotion_detector.py# Emotion recognition │ └── tech_doc.md # RAG Technical documentation for └── pyproject.toml # Project dependencies
-
Install dependencies using uv:
uv sync
-
Configure environment variables: Create a .env file in the root directory with the following API keys:
# Azure Speech Services (for Text-to-Speech) AZURE_SUBSCRIPTION_KEY=your_azure_subscription_key AZURE_REGION=your_azure_region # Google GenAI (for Speech-to-Text) GOOGLE_API_KEY=your_google_api_key # OpenRouter (for the main AI agent) OPENROUTER_API_KEY=your_openrouter_api_key OPENROUTER_BASE_URL=https://openrouter.ai/api/v1
Voice Control Mode:
python voice_agent.py
Text Control Mode:
python text_agent.py
- Natural Language Processing: Understands robot commands in conversational language.
- Context Awareness: Maintains conversation history and robot state.
- Multi-modal Input: Supports both voice and text commands.
- Command Translation: Converts natural language to robot instructions.
- Error Recovery: Robust error handling and automatic retry mechanisms.
- Modular Architecture: Easily integrates new tools and capabilities.
- Real-time Feedback: Provides live position monitoring and status updates.
The system establishes communication with the robot using:
- Primary Communication: Telnet connection for sending KRL (KUKA Robot Language) commands.
- Monitoring: py-openshowvar library for reading robot variables and positions.
- Vision Integration: HTTP/FTP communication with Cognex vision systems.
The system generates and executes KRL commands for:
- Position Control: Cartesian and joint space movements (linear and point-to-point).
- Joint Movements: Precise angular positioning with real-time feedback.
- Pick and Place Operations: Automated object manipulation with gripper control.
- Home Position Initialization: Safe startup and reference positioning.
- Real-time Position Monitoring: Continuous status updates and position tracking.
- Safety Monitoring: Real-time position feedback and collision avoidance.
- Vision Integration: Object detection via Cognex cameras.