LLM Orchestration

BrowseGenius reuses and extends the Taxy action loop to run entire test cases rather than one-off commands. This page explains the prompt design, action grammar, and safeguards that keep the GPT-4 agent aligned.

Prompt anatomy

Each iteration sends a ChatCompletion request with:

System message
Defines the assistant as a browser automation agent and enumerates available actions from availableActions.ts.
User message
Includes:
- Test case metadata (narrative, priority, expectations).
- The simplified + templatized DOM snapshot.
- Serialized history of prior actions in the run.

Example snippet:

text

<Thought>I should click the Create Invoice button</Thought>
<Action>click(223)</Action>

The response must adhere to the <Thought></Thought><Action></Action> schema. determineNextAction enforces this grammar and appends a closing </Action> if needed to simplify parsing.

Action grammar

Current actions (see src/helpers/availableActions.ts):

click(elementId: number)
setValue(elementId: number, value: string)
finish()
fail()

Element IDs map back to DOM nodes via the annotated DOM generated in the content script. The orchestrator translates actions into Chrome DevTools protocol calls in domActions.ts.

Loop control

Max attempts: 40 iterations per test case to prevent infinite loops.
Retry window: determineNextAction will retry up to three times on server errors.
Wait heuristic: A 1.5s sleep helps pages stabilise between actions. Future improvements include DOM mutation observers.
Debugger lifecycle: Attach before the suite, detach after completion—even on error paths.

Error handling

Parsing errors: Immediately mark the case failed with a descriptive message.
Model failure: If no response is returned, the suite records a failed state and stops.
Unexpected exceptions: The catch block in runGeneratedSuite logs the message, updates the report, and rethrows for visibility.

Recording system

First Run: Recording Mode

During the initial test execution, BrowseGenius operates in recording mode:

Action Recorder captures every action with:
- Multiple element selectors (ID, data-testid, name, CSS, XPath, aria-label, text, coordinates)
- Input values and interaction details
- Screenshots before and after actions
- Execution method (vision vs DOM)
- Success/failure status
Network Monitor uses Chrome DevTools Protocol to capture:
- All HTTP requests/responses
- Request/response headers and bodies
- Timing information
- Association with specific actions

JSON Export stores the complete recording:

json

{
  "testCaseId": "tc_login",
  "actions": [
    {
      "actionType": "input",
      "target": {
        "id": "email",
        "css": "input[type='email']",
        "xpath": "//input[@id='email']",
        "coordinates": { "x": 450, "y": 300 }
      },
      "value": "user@example.com",
      "networkRequests": [...]
    }
  ]
}

Subsequent Runs: Replay Mode

Replay mode uses stored JSON recordings for fast, deterministic execution:

Selector Matching: Tries selectors in priority order
- ID (most stable)
- data-testid (test-specific)
- CSS selector
- XPath
- Coordinates (vision fallback)
Action Execution: Performs recorded actions without LLM calls
Network Validation: Compares actual network requests against recording
- URL and method matching
- Status code validation
- Response body comparison (optional)
Differential Reporting: Highlights any deviations from the recorded baseline

Execution Modes

Vision-Based (Primary)

Uses OpenAI's computer use model
Visually locates elements on screen
Works when DOM structure changes
Slower but more resilient

DOM-Based (Fallback)

Traditional selector-based automation
Fast and deterministic
Requires stable selectors
Preferred for replay

Enhancements roadmap

✅ captureScreenshot action implemented
✅ Network request monitoring via DevTools Protocol
✅ Deterministic replays using recorded JSON
🔄 Visual screenshot comparison for validation
🔄 Smart selector healing (auto-find similar elements)
🔄 Dynamic wait strategies based on network activity
🔄 Shared execution context between test cases

LLM Orchestration ​

Prompt anatomy ​

Action grammar ​

Loop control ​

Error handling ​

Recording system ​

First Run: Recording Mode ​

Subsequent Runs: Replay Mode ​

Execution Modes ​

Enhancements roadmap ​