Back to scroll

Concept / project study

Behavior Bench

How might teams review AI behavior like a designed product surface?

Shows product thinking around AI quality, not just interface polish: behavior, evaluation, and team decision-making.

Opening

Behavior Bench treats AI behavior as something designed, tested, and reviewed. The interface makes model responses visible as product decisions rather than mysterious outputs.

Scenario

A product team is refining an AI feature. They have multiple prompt versions and model outputs, but they need to compare quality across realistic user scenarios before shipping.

Design challenge

AI behavior can feel invisible until it fails. Teams need a practical way to compare outputs and make product-level decisions before launch.

Design decisions

What the interface needs to make clear.

01

Compare behavior side by side

The core layout makes differences easy to see: same scenario, different outputs, clear evaluation criteria.

02

Make quality discussable

Scoring is lightweight and tied to product goals so teams can discuss whether a response is useful, safe, clear, or on-brand.

03

Preserve the reason behind decisions

Accepted and rejected versions need notes. This keeps future changes from repeating old mistakes.

Prototype flow

How someone would move through it.

01

Choose a scenario

The team selects a realistic user situation to test against.

02

Compare outputs

Prompt and model versions appear side by side with visible differences.

03

Record a decision

Reviewers score quality, write notes, and preserve why one behavior should ship.

Interaction model

How the experience works

  • Scenarios define the behavior the product should support
  • Side-by-side outputs reveal differences across versions
  • Quality criteria make review less subjective
  • Decision notes preserve why a behavior was accepted or rejected

Prototype plan

How it becomes real

  1. Comparison dashboard with sample scenarios
  2. Prompt and model version history
  3. Lightweight scoring rubric and evaluator notes
  4. Behavior change log for release review

What to notice

Why this project matters for Wanling's profile.

The next build should use a small set of scenarios and prompt versions, then add scoring and reviewer notes.