Voice Input¶
Replace typing with speaking - everywhere on your computer.
Two powerful use cases:
-
Talk to DeskAgent - Give instructions by voice instead of typing. "Reply to this email professionally" or "Create an offer for this customer"
-
Dictate in any application - Use voice input in Word, your browser, email clients, chat apps - anywhere you can type. Your speech is accurately transcribed and inserted instantly.
Both use OpenAI's Whisper for professional-grade transcription that handles technical terms, names, and multiple languages with high accuracy.
Overview¶
DeskAgent supports voice input in two ways:
| Method | Use Case |
|---|---|
| WebUI Microphone | Click the mic button in the chat input area |
| System Hotkey | Press a hotkey from any application - even with DeskAgent minimized |
Both methods use OpenAI's Whisper for accurate speech recognition.
Requirements¶
OpenAI API Key Required
Voice input requires an OpenAI API key for the Whisper transcription service.
Cost: ~$0.006 per minute of audio (~0.5 cents per minute)
Setup¶
- Get an API key from OpenAI Platform
- Add it to
config/backends.json:
WebUI Voice Input¶
How to Use¶
- Click the microphone button (🎤) next to the text input
- Speak your request - the button pulses while recording
- Click again to stop - your speech is transcribed and optionally sent
Keyboard Shortcut¶
| Shortcut | Action |
|---|---|
| Ctrl+M | Start/stop recording |
| Esc | Cancel recording |
Auto-Submit¶
By default, the transcribed text is automatically submitted. To review before sending:
Agent Input Dialogs¶
Voice input also works in agent pre-prompt dialogs. When an agent requires text input before starting (like a description or instructions), you can use the microphone button to dictate instead of typing.
This is especially useful for agents like:
- Archive Files - Dictate the description for documents
- Create Offer - Speak special requirements or notes
- Any agent with text inputs - Look for the 🎤 button next to text fields
System-Wide Hotkeys¶
The real power comes from system-wide hotkeys. Use them from any application - Outlook, your browser, Word, anywhere.
Available Hotkeys¶
| Hotkey | Name | Action |
|---|---|---|
| Ctrl+Shift+Space | Dictate | Record → paste text into active app |
| Ctrl+Shift+Enter | Dictate + Enter | Record → paste text → press Enter |
| Ctrl+Shift+Backspace | Agent | Record → start email reply agent |
Dictate Mode¶
Dictate into any application:
1. Click in a text field (Word, browser, Notepad, chat, etc.)
2. Press Ctrl+Shift+Space → 🎤 Recording starts
3. Dictate your text
4. Press Ctrl+Shift+Space again → Text is pasted
Tip: Use Ctrl+Shift+Enter to paste and press Enter automatically - perfect for chat apps like Teams or Slack.
Agent Mode¶
Start the email reply agent with voice instructions:
1. Select an email in Outlook
2. Press Ctrl+Shift+Backspace → 🎤 Recording starts
3. Say: "Please reply professionally, mention our 30-day trial"
4. Press Ctrl+Shift+Backspace again → Recording stops
5. DeskAgent starts the reply agent with your instructions
The agent reads the selected email, drafts a reply based on your instructions, and opens it in Outlook for review.
Configuration¶
Full configuration options in config/system.json:
"voice_input": {
"enabled": true,
"language": "de",
"auto_submit": true,
"hotkey": "Ctrl+M",
"dictate_hotkey": "Ctrl+Shift+Space",
"dictate_hotkey_enter": "Ctrl+Shift+Enter",
"agent_hotkey": "Ctrl+Shift+Backspace",
"outlook_agent": "reply_email"
}
| Option | Default | Description |
|---|---|---|
enabled | true | Enable/disable voice input globally |
language | "de" | Transcription language (de, en, fr, etc.) |
auto_submit | true | Auto-send after transcription in WebUI |
hotkey | "Ctrl+M" | WebUI recording hotkey |
dictate_hotkey | "Ctrl+Shift+Space" | Dictate hotkey (paste text) |
dictate_hotkey_enter | "Ctrl+Shift+Enter" | Dictate + Enter hotkey |
agent_hotkey | "Ctrl+Shift+Backspace" | Agent hotkey (starts outlook_agent) |
outlook_agent | "reply_email" | Agent to start with agent hotkey |
Improving Recognition¶
Whisper works well out of the box, but you can improve accuracy for specialized terms.
Keywords File (Recommended)¶
Create knowledge/whisper_keywords.md with terms Whisper should recognize:
realvirtual GmbH, game4automation, DeskAgent, Digital Twin, Unity
OPC UA, PLC, Siemens, Beckhoff, MQTT
Professional Edition, Research & Education Bundle
Thomas Strigl, Kranya
Include:
- Company and product names
- Industry terms and acronyms
- People's names
- Unusual spellings
Tip: Keep it to ~20 keywords for best performance.
Automatic Extraction¶
If you don't create a keywords file, DeskAgent automatically extracts terms from:
knowledge/company.mdknowledge/products.md
Audio Feedback¶
DeskAgent provides audio feedback so you know what's happening:
| Sound | Meaning |
|---|---|
| High beep (800 Hz) | Recording started |
| Low beep (400 Hz) | Recording stopped |
| Soft ticks | Processing/transcribing |
Outlook Web Support¶
The system hotkey also works with Outlook Web (Office 365 in browser):
- Open Outlook Web in Chrome/Edge
- Click on an email to select it
- Press Ctrl+Shift+Space to record
- DeskAgent extracts the message ID from the URL
- The reply agent processes it like desktop Outlook
Browser Integration
First use may trigger a consent dialog for browser integration. This starts a browser with remote debugging to read the current URL.
Troubleshooting¶
Voice button not showing¶
Check: Is the OpenAI API key configured?
"OpenAI API key not configured"¶
Add your API key to config/backends.json under ai_backends.openai.api_key.
Recording doesn't start¶
Check dependencies:
Text not pasting (generic mode)¶
- Make sure a text field is focused
- Try clicking in the target field before pressing the hotkey
- Check if
pyperclipis installed
Agent hotkey not starting agent¶
- Make sure an email is selected in Outlook (single click, not opened)
- Check that
outlook_agentis configured inconfig/system.json - On Outlook Web: Make sure you're on an email detail page (URL contains message ID)
Poor transcription quality¶
- Create a
knowledge/whisper_keywords.mdfile - Speak clearly and at normal pace
- Reduce background noise
- Check your microphone settings
API Details¶
For developers integrating voice input:
| Endpoint | Method | Description |
|---|---|---|
/transcribe/status | GET | Check availability and config |
/transcribe | POST | Transcribe audio file (multipart/form-data) |
Transcription cost: $0.006 per minute (tracked in cost statistics)
Next Steps¶
-
Keyboard Shortcuts
Learn all the shortcuts for efficient work
-
Email Automation
Automate your email workflows