> ## Documentation Index
> Fetch the complete documentation index at: https://docs.hiroshios.xyz/llms.txt
> Use this file to discover all available pages before exploring further.

# Codex Computer Use & Vision

> Vision screenshot captures, mouse click mapping coordinates, and OS interactions.

Hiroshi features a modular vision engine (`DesktopVisionEngine`) that lets multi-modal agent personae interact with active displays and input devices.

### 🖼️ 1. Frame Buffer Capture

The `desktop_screenshot` tool takes a snapshot of the primary display context, compressing it to memory as a `.png` file. Under isolated sandbox environments, the engine returns simulated offline mock buffers.

### 🖱️ 2. Absolute Screen Coordinates Click

* **Interactive Clicking:** Using coordinate scaling grids `(X, Y)` mapped from vision observations, the agent executes programmatic cursor positioning via `desktop_click`.
* **Keyboard Ingestion:** The `desktop_type` command maps textual sequences into raw keystrokes directly on the active operating system container interface.

### ⚙️ 3. Cross-Platform OS Requirements

* **macOS:** Requires enabling Accessibility and Screen Recording permissions for the parent terminal or IDE hosting the Hiroshi binary.
* **Linux:** Requires an active display server context. For **X11**, ensure `DISPLAY` env variable is set; for **Wayland**, utilize compatible pipewire screenshot portals.
* **Windows:** Operates natively via the standard Win32 API window hooks.
