DesktopVisionEngine) that lets multi-modal agent personae interact with active displays and input devices.
🖼️ 1. Frame Buffer Capture
Thedesktop_screenshot tool takes a snapshot of the primary display context, compressing it to memory as a .png file. Under isolated sandbox environments, the engine returns simulated offline mock buffers.
🖱️ 2. Absolute Screen Coordinates Click
- Interactive Clicking: Using coordinate scaling grids
(X, Y)mapped from vision observations, the agent executes programmatic cursor positioning viadesktop_click. - Keyboard Ingestion: The
desktop_typecommand maps textual sequences into raw keystrokes directly on the active operating system container interface.
⚙️ 3. Cross-Platform OS Requirements
- macOS: Requires enabling Accessibility and Screen Recording permissions for the parent terminal or IDE hosting the Hiroshi binary.
- Linux: Requires an active display server context. For X11, ensure
DISPLAYenv variable is set; for Wayland, utilize compatible pipewire screenshot portals. - Windows: Operates natively via the standard Win32 API window hooks.