In progressEvaluation framework

ROIBench

Agentic Activation Engineering: optimizing time-to-aha with synthetic users.

ROIBench is an evaluation framework for app generation where the optimization target is first-session value (activation / time-to-aha), not just “builds” or “passes tests”.

Core idea

The unit of evaluation is a synthetic first session. Each run tries to reach a machine-verified “aha moment” within strict effort budgets, while ROIBench measures:

Activation rate (did the user reach minimum value?)
Time-to-aha
Friction (errors, loops, retries, dead ends)
Full success rate (to reduce gaming)

Value Contracts

Each task defines a Value Contract: a machine-checkable definition of what the app must deliver and what “good first-session ROI” means. This makes the target measurable and reproducible.

Required (v1)

Goal
Minimum value definition + validator
Budgets (time / steps / errors)
Constraints

Optional (v1)

Full value definition + validator
Cost budget
ROI weights

Scoring (v1)

A simple, bounded scoring form that rewards activation while smoothly penalizing time and friction:

score_r = A_r * exp(-T_r / τ) * exp(-F_r / φ)
ActivationROIScore = mean_r(score_r)

Always reported alongside breakdown metrics (activation rate, median time-to-aha, friction, and full success).

End-to-end workflow

Discover: mine problem signals and generate candidate task ideas.
Specify: formalize tasks with Value Contracts and canonical journeys.
Generate: build candidate apps from the spec.
Evaluate (synthetic): run synthetic users and score Activation ROI.
Optimize: iterate edits to increase Activation ROI under constraints.
Calibrate: periodically compare synthetic rankings to human studies.

Evaluation modes

graph_sim

Fast iteration by executing the underlying ValueGraph directly. Useful for tight optimization loops.

ui_script

Deterministic, replayable UI evaluation (e.g., Playwright) that measures real wall-clock time-to-aha.

ui_agent

Most realistic but noisier: a computer-use model interacts with arbitrary UI layouts.