Text-to-speech (TTS)
OpenClaw can convert outbound replies into audio using ElevenLabs, Microsoft, or OpenAI. It works anywhere OpenClaw can send audio.Supported services
- ElevenLabs (primary or fallback provider)
- Microsoft (primary or fallback provider; current bundled implementation uses
node-edge-tts) - OpenAI (primary or fallback provider; also used for summaries)
Microsoft speech notes
The bundled Microsoft speech provider currently uses Microsoft Edge’s online neural TTS service via thenode-edge-tts library. It’s a hosted service (not
local), uses Microsoft endpoints, and does not require an API key.
node-edge-tts exposes speech configuration options and output formats, but
not all options are supported by the service. Legacy config and directive input
using edge still works and is normalized to microsoft.
Because this path is a public web service without a published SLA or quota,
treat it as best-effort. If you need guaranteed limits and support, use OpenAI
or ElevenLabs.
Optional keys
If you want OpenAI or ElevenLabs:ELEVENLABS_API_KEY(orXI_API_KEY)OPENAI_API_KEY
summaryModel (or agents.defaults.model.primary),
so that provider must also be authenticated if you enable summaries.
Service links
- OpenAI Text-to-Speech guide
- OpenAI Audio API reference
- ElevenLabs Text to Speech
- ElevenLabs Authentication
- node-edge-tts
- Microsoft Speech output formats
Is it enabled by default?
No. Auto‑TTS is off by default. Enable it in config withmessages.tts.auto or per session with /tts always (alias: /tts on).
When messages.tts.provider is unset, OpenClaw picks the first configured
speech provider in registry auto-select order.
Config
TTS config lives undermessages.tts in openclaw.json.
Full schema is in Gateway configuration.
Minimal config (enable + provider)
OpenAI primary with ElevenLabs fallback
Microsoft primary (no API key)
Disable Microsoft speech
Custom limits + prefs path
Only reply with audio after an inbound voice message
Disable auto-summary for long replies
Notes on fields
auto: auto‑TTS mode (off,always,inbound,tagged).inboundonly sends audio after an inbound voice message.taggedonly sends audio when the reply includes[[tts]]tags.
enabled: legacy toggle (doctor migrates this toauto).mode:"final"(default) or"all"(includes tool/block replies).provider: speech provider id such as"elevenlabs","microsoft", or"openai"(fallback is automatic).- If
provideris unset, OpenClaw uses the first configured speech provider in registry auto-select order. - Legacy
provider: "edge"still works and is normalized tomicrosoft. summaryModel: optional cheap model for auto-summary; defaults toagents.defaults.model.primary.- Accepts
provider/modelor a configured model alias.
- Accepts
modelOverrides: allow the model to emit TTS directives (on by default).allowProviderdefaults tofalse(provider switching is opt-in).
providers.<id>: provider-owned settings keyed by speech provider id.- Legacy direct provider blocks (
messages.tts.openai,messages.tts.elevenlabs,messages.tts.microsoft,messages.tts.edge) are auto-migrated tomessages.tts.providers.<id>on load. maxTextLength: hard cap for TTS input (chars)./tts audiofails if exceeded.timeoutMs: request timeout (ms).prefsPath: override the local prefs JSON path (provider/limit/summary).apiKeyvalues fall back to env vars (ELEVENLABS_API_KEY/XI_API_KEY,OPENAI_API_KEY).providers.elevenlabs.baseUrl: override ElevenLabs API base URL.providers.openai.baseUrl: override the OpenAI TTS endpoint.- Resolution order:
messages.tts.providers.openai.baseUrl->OPENAI_TTS_BASE_URL->https://api.openai.com/v1 - Non-default values are treated as OpenAI-compatible TTS endpoints, so custom model and voice names are accepted.
- Resolution order:
providers.elevenlabs.voiceSettings:stability,similarityBoost,style:0..1useSpeakerBoost:true|falsespeed:0.5..2.0(1.0 = normal)
providers.elevenlabs.applyTextNormalization:auto|on|offproviders.elevenlabs.languageCode: 2-letter ISO 639-1 (e.g.en,de)providers.elevenlabs.seed: integer0..4294967295(best-effort determinism)providers.microsoft.enabled: allow Microsoft speech usage (defaulttrue; no API key).providers.microsoft.voice: Microsoft neural voice name (e.g.en-US-MichelleNeural).providers.microsoft.lang: language code (e.g.en-US).providers.microsoft.outputFormat: Microsoft output format (e.g.audio-24khz-48kbitrate-mono-mp3).- See Microsoft Speech output formats for valid values; not all formats are supported by the bundled Edge-backed transport.
providers.microsoft.rate/providers.microsoft.pitch/providers.microsoft.volume: percent strings (e.g.+10%,-5%).providers.microsoft.saveSubtitles: write JSON subtitles alongside the audio file.providers.microsoft.proxy: proxy URL for Microsoft speech requests.providers.microsoft.timeoutMs: request timeout override (ms).edge.*: legacy alias for the same Microsoft settings.
Model-driven overrides (default on)
By default, the model can emit TTS directives for a single reply. Whenmessages.tts.auto is tagged, these directives are required to trigger audio.
When enabled, the model can emit [[tts:...]] directives to override the voice
for a single reply, plus an optional [[tts:text]]...[[/tts:text]] block to
provide expressive tags (laughter, singing cues, etc) that should only appear in
the audio.
provider=... directives are ignored unless modelOverrides.allowProvider: true.
Example reply payload:
provider(registered speech provider id, for exampleopenai,elevenlabs, ormicrosoft; requiresallowProvider: true)voice(OpenAI voice) orvoiceId(ElevenLabs)model(OpenAI TTS model or ElevenLabs model id)stability,similarityBoost,style,speed,useSpeakerBoostapplyTextNormalization(auto|on|off)languageCode(ISO 639-1)seed
Per-user preferences
Slash commands write local overrides toprefsPath (default:
~/.openclaw/settings/tts.json, override with OPENCLAW_TTS_PREFS or
messages.tts.prefsPath).
Stored fields:
enabledprovidermaxLength(summary threshold; default 1500 chars)summarize(defaulttrue)
messages.tts.* for that host.
Output formats (fixed)
- Feishu / Matrix / Telegram / WhatsApp: Opus voice message (
opus_48000_64from ElevenLabs,opusfrom OpenAI).- 48kHz / 64kbps is a good voice message tradeoff.
- Other channels: MP3 (
mp3_44100_128from ElevenLabs,mp3from OpenAI).- 44.1kHz / 128kbps is the default balance for speech clarity.
- Microsoft: uses
microsoft.outputFormat(defaultaudio-24khz-48kbitrate-mono-mp3).- The bundled transport accepts an
outputFormat, but not all formats are available from the service. - Output format values follow Microsoft Speech output formats (including Ogg/WebM Opus).
- Telegram
sendVoiceaccepts OGG/MP3/M4A; use OpenAI/ElevenLabs if you need guaranteed Opus voice messages. - If the configured Microsoft output format fails, OpenClaw retries with MP3.
- The bundled transport accepts an
Auto-TTS behavior
When enabled, OpenClaw:- skips TTS if the reply already contains media or a
MEDIA:directive. - skips very short replies (< 10 chars).
- summarizes long replies when enabled using
agents.defaults.model.primary(orsummaryModel). - attaches the generated audio to the reply.
maxLength and summary is off (or no API key for the
summary model), audio
is skipped and the normal text reply is sent.
Flow diagram
Slash command usage
There is a single command:/tts.
See Slash commands for enablement details.
Discord note: /tts is a built-in Discord command, so OpenClaw registers
/voice as the native command there. Text /tts ... still works.
- Commands require an authorized sender (allowlist/owner rules still apply).
commands.textor native command registration must be enabled.off|always|inbound|taggedare per‑session toggles (/tts onis an alias for/tts always).limitandsummaryare stored in local prefs, not the main config./tts audiogenerates a one-off audio reply (does not toggle TTS on)./tts statusincludes fallback visibility for the latest attempt:- success fallback:
Fallback: <primary> -> <used>plusAttempts: ... - failure:
Error: ...plusAttempts: ... - detailed diagnostics:
Attempt details: provider:outcome(reasonCode) latency
- success fallback:
- OpenAI and ElevenLabs API failures now include parsed provider error detail and request id (when returned by the provider), which is surfaced in TTS errors/logs.
Agent tool
Thetts tool converts text to speech and returns an audio attachment for
reply delivery. When the channel is Feishu, Matrix, Telegram, or WhatsApp,
the audio is delivered as a voice message rather than a file attachment.
Gateway RPC
Gateway methods:tts.statustts.enabletts.disabletts.converttts.setProvidertts.providers