> ## Documentation Index
> Fetch the complete documentation index at: https://docs.hiroshios.xyz/llms.txt
> Use this file to discover all available pages before exploring further.

# Audio processing

# Audio Processing

Hiroshi OS supports native voice transcription ingestion and output text-to-speech synthesis pipelines over loopback endpoints, enabling headless agent interactions using speech waveforms.

## Inbound Audio Transcription (`Whisper`)

When configured, inbound media payloads matching `audio` or `voice` attachment schemas are intercepted:

1. Cached inside the sandboxed workspace folder at `audio/inbound/`.
2. Submitted to an OpenAI-compatible transcription endpoint (e.g. `/v1/audio/transcriptions`).
3. The transcribed text string replaces the empty message body before hitting the main ReAct model loop.

## Outbound Speech Synthesis (`Piper`/`TTS`)

If outbound voice synthesis is enabled, the agent's textual response is:

1. Dispatched to the voice synthesis endpoint (e.g. `/v1/audio/speech`).
2. Cached as an MP3 audio asset inside the sandboxed workspace folder at `audio/outbound/response.mp3`.
3. Notified directly inside the active `ChannelDriver` response payload.

## Configuration Schema

Adjust your audio behavior inside the `audio` section of your configuration file:

```yaml theme={null}
audio:
  enabled: true
  whisper_url: "https://api.openai.com/v1/audio/transcriptions"
  speech_url: "https://api.openai.com/v1/audio/speech"
  voice_model: "whisper-1"
  output_voice_enabled: true
```

### Key Metrics

* **Processing Latency:** Transcription loopback extraction executes in **\< 15ms**.
* **File Allocation Cap:** Audio data caching operates within a **\< 2MB** block.
