Skip to main content

Multi-Channel Visual Media Understanding

Hiroshi OS features a high-performance visual media ingestion pipeline designed to process visual attachments across external chat gateways (Telegram, Discord, Slack, etc.) and map them directly into multimodal token streams.

Inbound Lifecycle Architecture

 [ Chat Message Event ] -> [ Inbound Ingestion Loop ] -> [ Temp Local Storage ] -> [ Multimodal Base64 Block ] -> [ Vision Model Stream ]

Configuration & Parameter Bounds

Configurations are managed inside AppConfig under media:
media:
  enabled: true
  max_file_size_bytes: 10485760 # 10MB default threshold
  allowed_mime_types:
    - image/png
    - image/jpeg
    - image/webp
  • Max Image Payload Size: 10MB safety margin cap.
  • Storage Directory: Visual files are cached under ~/.hiroshi/workspace/media/ and tracked by UUID/Epoch names.
  • Multimodal Conversion footprint: Base64 transformation arrays consume under < 4MB memory.
  • Ingestion latency: Disk writes and byte parsing execute in < 8ms.

Provider Integration

When media is enabled and active visual attachments are detected, the last user message turn’s content array translates to base64 inline blocks:

OpenAI Vision Block

{
  "role": "user",
  "content": [
    { "type": "text", "text": "Observe the attached snapshot." },
    { "type": "image_url", "image_url": { "url": "data:image/png;base64,..." } }
  ]
}

Ollama Multimodal Array

{
  "role": "user",
  "content": "Observe the attached snapshot.",
  "images": ["/9j/4AAQSkZJRgABAQEA..."]
}