Web Scraper Node

Hiroshi OS features a dynamic web scraper tool (web_scrape) enabling agents to retrieve, clean, and convert raw HTML bodies from arbitrary target URLs into structured, readable markdown blocks.

Execution tag format

Agents call the scraper via self-closing or explicit tag layouts:

<web_scrape url="https://example.com" />

Or:

<web_scrape>https://example.com</web_scrape>

Internal Workflow

Authentication Check: The tool checks for Firecrawl (firecrawl_api_key) and Exa (exa_api_key) API tokens inside the global configurations.
Third-Party Routing: If api keys are configured, it routes the payload to Firecrawl’s /v1/scrape or Exa’s /contents endpoints.
Resilient Local Fallback: If API keys are unconfigured, a fallback downloader retrieves raw HTML, strips script and style blocks, formats header elements (h1-h6), structures bold/strong tags (**), and compresses paragraph blocks into unified, clean markdown.

Configurations Schema

Configurations are stored in AppConfig under scraper:

scraper:
  enabled: true
  firecrawl_api_key: "fc-..."
  exa_api_key: "exa-..."

Latency Footprint: Fallback HTML parser completes string sanitization in < 3ms.
Memory Compression: Stripping formatting blocks compresses document footprint by ~65-80%.

​Web Scraper Node

​Execution tag format

​Internal Workflow

​Configurations Schema

Web Scraper Node

Execution tag format

Internal Workflow

Configurations Schema