← Back home
In progressEvaluation framework

ROIBench

Agentic Activation Engineering: optimizing time-to-aha with synthetic users.

ROIBench is an evaluation framework for app generation where the optimization target is first-session value (activation / time-to-aha), not just “builds” or “passes tests”.

Core idea

The unit of evaluation is a synthetic first session. Each run tries to reach a machine-verified “aha moment” within strict effort budgets, while ROIBench measures:

  • Activation rate (did the user reach minimum value?)
  • Time-to-aha
  • Friction (errors, loops, retries, dead ends)
  • Full success rate (to reduce gaming)

Value Contracts

Each task defines a Value Contract: a machine-checkable definition of what the app must deliver and what “good first-session ROI” means. This makes the target measurable and reproducible.

Required (v1)

  • Goal
  • Minimum value definition + validator
  • Budgets (time / steps / errors)
  • Constraints

Optional (v1)

  • Full value definition + validator
  • Cost budget
  • ROI weights

Scoring (v1)

A simple, bounded scoring form that rewards activation while smoothly penalizing time and friction:

score_r = A_r * exp(-T_r / τ) * exp(-F_r / φ)
ActivationROIScore = mean_r(score_r)

Always reported alongside breakdown metrics (activation rate, median time-to-aha, friction, and full success).

End-to-end workflow

  1. Discover: mine problem signals and generate candidate task ideas.
  2. Specify: formalize tasks with Value Contracts and canonical journeys.
  3. Generate: build candidate apps from the spec.
  4. Evaluate (synthetic): run synthetic users and score Activation ROI.
  5. Optimize: iterate edits to increase Activation ROI under constraints.
  6. Calibrate: periodically compare synthetic rankings to human studies.

Evaluation modes

graph_sim

Fast iteration by executing the underlying ValueGraph directly. Useful for tight optimization loops.

ui_script

Deterministic, replayable UI evaluation (e.g., Playwright) that measures real wall-clock time-to-aha.

ui_agent

Most realistic but noisier: a computer-use model interacts with arbitrary UI layouts.