ROIBench
Agentic Activation Engineering: optimizing time-to-aha with synthetic users.
ROIBench is an evaluation framework for app generation where the optimization target is first-session value (activation / time-to-aha), not just “builds” or “passes tests”.
Core idea
The unit of evaluation is a synthetic first session. Each run tries to reach a machine-verified “aha moment” within strict effort budgets, while ROIBench measures:
- Activation rate (did the user reach minimum value?)
- Time-to-aha
- Friction (errors, loops, retries, dead ends)
- Full success rate (to reduce gaming)
Value Contracts
Each task defines a Value Contract: a machine-checkable definition of what the app must deliver and what “good first-session ROI” means. This makes the target measurable and reproducible.
Required (v1)
- Goal
- Minimum value definition + validator
- Budgets (time / steps / errors)
- Constraints
Optional (v1)
- Full value definition + validator
- Cost budget
- ROI weights
Scoring (v1)
A simple, bounded scoring form that rewards activation while smoothly penalizing time and friction:
score_r = A_r * exp(-T_r / τ) * exp(-F_r / φ) ActivationROIScore = mean_r(score_r)
Always reported alongside breakdown metrics (activation rate, median time-to-aha, friction, and full success).
End-to-end workflow
- Discover: mine problem signals and generate candidate task ideas.
- Specify: formalize tasks with Value Contracts and canonical journeys.
- Generate: build candidate apps from the spec.
- Evaluate (synthetic): run synthetic users and score Activation ROI.
- Optimize: iterate edits to increase Activation ROI under constraints.
- Calibrate: periodically compare synthetic rankings to human studies.
Evaluation modes
graph_sim
Fast iteration by executing the underlying ValueGraph directly. Useful for tight optimization loops.
ui_script
Deterministic, replayable UI evaluation (e.g., Playwright) that measures real wall-clock time-to-aha.
ui_agent
Most realistic but noisier: a computer-use model interacts with arbitrary UI layouts.