Skip to main content

Audio Processing

Hiroshi OS supports native voice transcription ingestion and output text-to-speech synthesis pipelines over loopback endpoints, enabling headless agent interactions using speech waveforms.

Inbound Audio Transcription (Whisper)

When configured, inbound media payloads matching audio or voice attachment schemas are intercepted:
  1. Cached inside the sandboxed workspace folder at audio/inbound/.
  2. Submitted to an OpenAI-compatible transcription endpoint (e.g. /v1/audio/transcriptions).
  3. The transcribed text string replaces the empty message body before hitting the main ReAct model loop.

Outbound Speech Synthesis (Piper/TTS)

If outbound voice synthesis is enabled, the agent’s textual response is:
  1. Dispatched to the voice synthesis endpoint (e.g. /v1/audio/speech).
  2. Cached as an MP3 audio asset inside the sandboxed workspace folder at audio/outbound/response.mp3.
  3. Notified directly inside the active ChannelDriver response payload.

Configuration Schema

Adjust your audio behavior inside the audio section of your configuration file:
audio:
  enabled: true
  whisper_url: "https://api.openai.com/v1/audio/transcriptions"
  speech_url: "https://api.openai.com/v1/audio/speech"
  voice_model: "whisper-1"
  output_voice_enabled: true

Key Metrics

  • Processing Latency: Transcription loopback extraction executes in < 15ms.
  • File Allocation Cap: Audio data caching operates within a < 2MB block.