Skip to main content

Web Scraper Node

Hiroshi OS features a dynamic web scraper tool (web_scrape) enabling agents to retrieve, clean, and convert raw HTML bodies from arbitrary target URLs into structured, readable markdown blocks.

Execution tag format

Agents call the scraper via self-closing or explicit tag layouts:
<web_scrape url="https://example.com" />
Or:
<web_scrape>https://example.com</web_scrape>

Internal Workflow

  1. Authentication Check: The tool checks for Firecrawl (firecrawl_api_key) and Exa (exa_api_key) API tokens inside the global configurations.
  2. Third-Party Routing: If api keys are configured, it routes the payload to Firecrawl’s /v1/scrape or Exa’s /contents endpoints.
  3. Resilient Local Fallback: If API keys are unconfigured, a fallback downloader retrieves raw HTML, strips script and style blocks, formats header elements (h1-h6), structures bold/strong tags (**), and compresses paragraph blocks into unified, clean markdown.

Configurations Schema

Configurations are stored in AppConfig under scraper:
scraper:
  enabled: true
  firecrawl_api_key: "fc-..."
  exa_api_key: "exa-..."
  • Latency Footprint: Fallback HTML parser completes string sanitization in < 3ms.
  • Memory Compression: Stripping formatting blocks compresses document footprint by ~65-80%.