Skip to content

LLM Orchestration

BrowseGenius reuses and extends the Taxy action loop to run entire test cases rather than one-off commands. This page explains the prompt design, action grammar, and safeguards that keep the GPT-4 agent aligned.

Prompt anatomy

Each iteration sends a ChatCompletion request with:

  1. System message
    Defines the assistant as a browser automation agent and enumerates available actions from availableActions.ts.
  2. User message
    Includes:
    • Test case metadata (narrative, priority, expectations).
    • The simplified + templatized DOM snapshot.
    • Serialized history of prior actions in the run.

Example snippet:

text
<Thought>I should click the Create Invoice button</Thought>
<Action>click(223)</Action>

The response must adhere to the <Thought></Thought><Action></Action> schema. determineNextAction enforces this grammar and appends a closing </Action> if needed to simplify parsing.

Action grammar

Current actions (see src/helpers/availableActions.ts):

  • click(elementId: number)
  • setValue(elementId: number, value: string)
  • finish()
  • fail()

Element IDs map back to DOM nodes via the annotated DOM generated in the content script. The orchestrator translates actions into Chrome DevTools protocol calls in domActions.ts.

Loop control

  • Max attempts: 40 iterations per test case to prevent infinite loops.
  • Retry window: determineNextAction will retry up to three times on server errors.
  • Wait heuristic: A 1.5s sleep helps pages stabilise between actions. Future improvements include DOM mutation observers.
  • Debugger lifecycle: Attach before the suite, detach after completion—even on error paths.

Error handling

  • Parsing errors: Immediately mark the case failed with a descriptive message.
  • Model failure: If no response is returned, the suite records a failed state and stops.
  • Unexpected exceptions: The catch block in runGeneratedSuite logs the message, updates the report, and rethrows for visibility.

Recording system

First Run: Recording Mode

During the initial test execution, BrowseGenius operates in recording mode:

  1. Action Recorder captures every action with:

    • Multiple element selectors (ID, data-testid, name, CSS, XPath, aria-label, text, coordinates)
    • Input values and interaction details
    • Screenshots before and after actions
    • Execution method (vision vs DOM)
    • Success/failure status
  2. Network Monitor uses Chrome DevTools Protocol to capture:

    • All HTTP requests/responses
    • Request/response headers and bodies
    • Timing information
    • Association with specific actions
  3. JSON Export stores the complete recording:

    json
    {
      "testCaseId": "tc_login",
      "actions": [
        {
          "actionType": "input",
          "target": {
            "id": "email",
            "css": "input[type='email']",
            "xpath": "//input[@id='email']",
            "coordinates": { "x": 450, "y": 300 }
          },
          "value": "user@example.com",
          "networkRequests": [...]
        }
      ]
    }

Subsequent Runs: Replay Mode

Replay mode uses stored JSON recordings for fast, deterministic execution:

  1. Selector Matching: Tries selectors in priority order

    • ID (most stable)
    • data-testid (test-specific)
    • CSS selector
    • XPath
    • Coordinates (vision fallback)
  2. Action Execution: Performs recorded actions without LLM calls

  3. Network Validation: Compares actual network requests against recording

    • URL and method matching
    • Status code validation
    • Response body comparison (optional)
  4. Differential Reporting: Highlights any deviations from the recorded baseline

Execution Modes

Vision-Based (Primary)

  • Uses OpenAI's computer use model
  • Visually locates elements on screen
  • Works when DOM structure changes
  • Slower but more resilient

DOM-Based (Fallback)

  • Traditional selector-based automation
  • Fast and deterministic
  • Requires stable selectors
  • Preferred for replay

Enhancements roadmap

  • captureScreenshot action implemented
  • ✅ Network request monitoring via DevTools Protocol
  • ✅ Deterministic replays using recorded JSON
  • 🔄 Visual screenshot comparison for validation
  • 🔄 Smart selector healing (auto-find similar elements)
  • 🔄 Dynamic wait strategies based on network activity
  • 🔄 Shared execution context between test cases

Released under the MIT License.