Skip to content

Voice Input

Replace typing with speaking - everywhere on your computer.

Two powerful use cases:

  1. Talk to DeskAgent - Give instructions by voice instead of typing. "Reply to this email professionally" or "Create an offer for this customer"

  2. Dictate in any application - Use voice input in Word, your browser, email clients, chat apps - anywhere you can type. Your speech is accurately transcribed and inserted instantly.

Both use OpenAI's Whisper for professional-grade transcription that handles technical terms, names, and multiple languages with high accuracy.


Overview

DeskAgent supports voice input in two ways:

Method Use Case
WebUI Microphone Click the mic button in the chat input area
System Hotkey Press a hotkey from any application - even with DeskAgent minimized

Both methods use OpenAI's Whisper for accurate speech recognition.


Requirements

OpenAI API Key Required

Voice input requires an OpenAI API key for the Whisper transcription service.

Cost: ~$0.006 per minute of audio (~0.5 cents per minute)

Setup

  1. Get an API key from OpenAI Platform
  2. Add it to config/backends.json:
"openai": {
  "type": "openai_api",
  "api_key": "sk-your-api-key-here"
}

WebUI Voice Input

How to Use

  1. Click the microphone button (🎤) next to the text input
  2. Speak your request - the button pulses while recording
  3. Click again to stop - your speech is transcribed and optionally sent
Voice input button
The microphone button (🎤) next to the input field
Recording active
While recording: Red stop button to end recording

Keyboard Shortcut

Shortcut Action
Ctrl+M Start/stop recording
Esc Cancel recording

Auto-Submit

By default, the transcribed text is automatically submitted. To review before sending:

config/system.json
"voice_input": {
  "auto_submit": false
}

Agent Input Dialogs

Voice input also works in agent pre-prompt dialogs. When an agent requires text input before starting (like a description or instructions), you can use the microphone button to dictate instead of typing.

This is especially useful for agents like:

  • Archive Files - Dictate the description for documents
  • Create Offer - Speak special requirements or notes
  • Any agent with text inputs - Look for the 🎤 button next to text fields

System-Wide Hotkeys

The real power comes from system-wide hotkeys. Use them from any application - Outlook, your browser, Word, anywhere.

Available Hotkeys

Hotkey Name Action
Ctrl+Shift+Space Dictate Record → paste text into active app
Ctrl+Shift+Enter Dictate + Enter Record → paste text → press Enter
Ctrl+Shift+Backspace Agent Record → start email reply agent

Dictate Mode

Dictate into any application:

1. Click in a text field (Word, browser, Notepad, chat, etc.)
2. Press Ctrl+Shift+Space → 🎤 Recording starts
3. Dictate your text
4. Press Ctrl+Shift+Space again → Text is pasted

Tip: Use Ctrl+Shift+Enter to paste and press Enter automatically - perfect for chat apps like Teams or Slack.

Agent Mode

Start the email reply agent with voice instructions:

1. Select an email in Outlook
2. Press Ctrl+Shift+Backspace → 🎤 Recording starts
3. Say: "Please reply professionally, mention our 30-day trial"
4. Press Ctrl+Shift+Backspace again → Recording stops
5. DeskAgent starts the reply agent with your instructions

The agent reads the selected email, drafts a reply based on your instructions, and opens it in Outlook for review.


Configuration

Full configuration options in config/system.json:

config/system.json
"voice_input": {
  "enabled": true,
  "language": "de",
  "auto_submit": true,
  "hotkey": "Ctrl+M",
  "dictate_hotkey": "Ctrl+Shift+Space",
  "dictate_hotkey_enter": "Ctrl+Shift+Enter",
  "agent_hotkey": "Ctrl+Shift+Backspace",
  "outlook_agent": "reply_email"
}
Option Default Description
enabled true Enable/disable voice input globally
language "de" Transcription language (de, en, fr, etc.)
auto_submit true Auto-send after transcription in WebUI
hotkey "Ctrl+M" WebUI recording hotkey
dictate_hotkey "Ctrl+Shift+Space" Dictate hotkey (paste text)
dictate_hotkey_enter "Ctrl+Shift+Enter" Dictate + Enter hotkey
agent_hotkey "Ctrl+Shift+Backspace" Agent hotkey (starts outlook_agent)
outlook_agent "reply_email" Agent to start with agent hotkey

Improving Recognition

Whisper works well out of the box, but you can improve accuracy for specialized terms.

Create knowledge/whisper_keywords.md with terms Whisper should recognize:

knowledge/whisper_keywords.md
realvirtual GmbH, game4automation, DeskAgent, Digital Twin, Unity
OPC UA, PLC, Siemens, Beckhoff, MQTT
Professional Edition, Research & Education Bundle
Thomas Strigl, Kranya

Include:

  • Company and product names
  • Industry terms and acronyms
  • People's names
  • Unusual spellings

Tip: Keep it to ~20 keywords for best performance.

Automatic Extraction

If you don't create a keywords file, DeskAgent automatically extracts terms from:

  1. knowledge/company.md
  2. knowledge/products.md

Audio Feedback

DeskAgent provides audio feedback so you know what's happening:

Sound Meaning
High beep (800 Hz) Recording started
Low beep (400 Hz) Recording stopped
Soft ticks Processing/transcribing

Outlook Web Support

The system hotkey also works with Outlook Web (Office 365 in browser):

  1. Open Outlook Web in Chrome/Edge
  2. Click on an email to select it
  3. Press Ctrl+Shift+Space to record
  4. DeskAgent extracts the message ID from the URL
  5. The reply agent processes it like desktop Outlook

Browser Integration

First use may trigger a consent dialog for browser integration. This starts a browser with remote debugging to read the current URL.


Troubleshooting

Voice button not showing

Check: Is the OpenAI API key configured?

# In DeskAgent chat, ask:
"Is voice input available?"

"OpenAI API key not configured"

Add your API key to config/backends.json under ai_backends.openai.api_key.

Recording doesn't start

Check dependencies:

pip install sounddevice soundfile numpy pyperclip keyboard pynput

Text not pasting (generic mode)

  • Make sure a text field is focused
  • Try clicking in the target field before pressing the hotkey
  • Check if pyperclip is installed

Agent hotkey not starting agent

  • Make sure an email is selected in Outlook (single click, not opened)
  • Check that outlook_agent is configured in config/system.json
  • On Outlook Web: Make sure you're on an email detail page (URL contains message ID)

Poor transcription quality

  1. Create a knowledge/whisper_keywords.md file
  2. Speak clearly and at normal pace
  3. Reduce background noise
  4. Check your microphone settings

API Details

For developers integrating voice input:

Endpoint Method Description
/transcribe/status GET Check availability and config
/transcribe POST Transcribe audio file (multipart/form-data)

Transcription cost: $0.006 per minute (tracked in cost statistics)


Next Steps

  • Keyboard Shortcuts


    Learn all the shortcuts for efficient work

    Your Assistant

  • Email Automation


    Automate your email workflows

    Email Guide