LLM Orchestration
BrowseGenius reuses and extends the Taxy action loop to run entire test cases rather than one-off commands. This page explains the prompt design, action grammar, and safeguards that keep the GPT-4 agent aligned.
Prompt anatomy
Each iteration sends a ChatCompletion request with:
- System message
Defines the assistant as a browser automation agent and enumerates available actions fromavailableActions.ts. - User message
Includes:- Test case metadata (narrative, priority, expectations).
- The simplified + templatized DOM snapshot.
- Serialized history of prior actions in the run.
Example snippet:
<Thought>I should click the Create Invoice button</Thought>
<Action>click(223)</Action>The response must adhere to the <Thought></Thought><Action></Action> schema. determineNextAction enforces this grammar and appends a closing </Action> if needed to simplify parsing.
Action grammar
Current actions (see src/helpers/availableActions.ts):
click(elementId: number)setValue(elementId: number, value: string)finish()fail()
Element IDs map back to DOM nodes via the annotated DOM generated in the content script. The orchestrator translates actions into Chrome DevTools protocol calls in domActions.ts.
Loop control
- Max attempts: 40 iterations per test case to prevent infinite loops.
- Retry window:
determineNextActionwill retry up to three times on server errors. - Wait heuristic: A 1.5s
sleephelps pages stabilise between actions. Future improvements include DOM mutation observers. - Debugger lifecycle: Attach before the suite, detach after completion—even on error paths.
Error handling
- Parsing errors: Immediately mark the case failed with a descriptive message.
- Model failure: If no response is returned, the suite records a
failedstate and stops. - Unexpected exceptions: The catch block in
runGeneratedSuitelogs the message, updates the report, and rethrows for visibility.
Recording system
First Run: Recording Mode
During the initial test execution, BrowseGenius operates in recording mode:
Action Recorder captures every action with:
- Multiple element selectors (ID, data-testid, name, CSS, XPath, aria-label, text, coordinates)
- Input values and interaction details
- Screenshots before and after actions
- Execution method (vision vs DOM)
- Success/failure status
Network Monitor uses Chrome DevTools Protocol to capture:
- All HTTP requests/responses
- Request/response headers and bodies
- Timing information
- Association with specific actions
JSON Export stores the complete recording:
json{ "testCaseId": "tc_login", "actions": [ { "actionType": "input", "target": { "id": "email", "css": "input[type='email']", "xpath": "//input[@id='email']", "coordinates": { "x": 450, "y": 300 } }, "value": "user@example.com", "networkRequests": [...] } ] }
Subsequent Runs: Replay Mode
Replay mode uses stored JSON recordings for fast, deterministic execution:
Selector Matching: Tries selectors in priority order
- ID (most stable)
- data-testid (test-specific)
- CSS selector
- XPath
- Coordinates (vision fallback)
Action Execution: Performs recorded actions without LLM calls
Network Validation: Compares actual network requests against recording
- URL and method matching
- Status code validation
- Response body comparison (optional)
Differential Reporting: Highlights any deviations from the recorded baseline
Execution Modes
Vision-Based (Primary)
- Uses OpenAI's computer use model
- Visually locates elements on screen
- Works when DOM structure changes
- Slower but more resilient
DOM-Based (Fallback)
- Traditional selector-based automation
- Fast and deterministic
- Requires stable selectors
- Preferred for replay
Enhancements roadmap
- ✅
captureScreenshotaction implemented - ✅ Network request monitoring via DevTools Protocol
- ✅ Deterministic replays using recorded JSON
- 🔄 Visual screenshot comparison for validation
- 🔄 Smart selector healing (auto-find similar elements)
- 🔄 Dynamic wait strategies based on network activity
- 🔄 Shared execution context between test cases