Back to research archive

Protocol 001 / published

Business ideas under invariants.

This is the first SampleLens benchmark protocol. It asks a practical question: which decoding settings produce business ideas that are specific, plausible, and still meaningfully exploratory when the model has to stay inside hard constraints?

Question

What are we actually testing?

The benchmark asks a model to generate business ideas under explicit invariants. The goal is not to maximize novelty for its own sake. The goal is to understand which decoding settings preserve useful exploration while still respecting the constraints that make an idea credible and actionable.

Target question: which decoding ranges produce ideas that are specific, plausible, and operationally coherent without collapsing into repetitive safe answers or drifting outside the stated constraints?

Setup

The reusable task shape behind the benchmark.

This protocol publishes the question class rather than a one-off magic prompt. That makes the benchmark more reusable, easier to critique, and more useful to anyone working on similar ideation tasks.

Prompt shape

The model is asked to propose one business idea that fits a fixed set of invariants, explain the offer, define the buyer, and justify why the idea is commercially viable. The output format is constrained enough to support scoring and side-by-side comparison.

Core invariants

  • The idea must be software-first or software-amplified, not a capital-heavy physical business.
  • The buyer must be legible and narrow enough to describe in one sentence.
  • The pain point must be concrete rather than broad AI aspiration.
  • The business must look operable by a lean human-plus-agent team.
  • The commercialization path must be plausible without requiring massive upfront distribution power.

Sweep plan

The protocol has to vary enough to show real behavior change.

One good answer proves almost nothing. This benchmark is meant to surface the tradeoff between exploration, discipline, and repeatability, so it needs a small but meaningful grid rather than a single preset.

Dimension 01

Temperature

Test low, medium, and mildly high ranges to separate rigid compliance from productive exploration.

Dimension 02

Tail control

Compare nucleus-heavy and Min-P-heavy regimes so the benchmark can see whether relative floors improve constraint adherence without crushing ideation.

Dimension 03

Seed variance

Multiple seeds per configuration help distinguish true configuration effects from one lucky or unlucky sample.

Dimension 04

Runtime notes

The protocol records model, runtime, and date so future runs do not pretend provider or version drift never happened.

Rubric

The scores should reward useful ideas, not flashy ones.

The benchmark is only as good as its scoring. The rubric should make it obvious why a configuration is being preferred, rejected, or treated as promising but unstable.

Constraint adherence

  • Did the answer stay inside the declared invariants?
  • Did it avoid drifting into invalid business shapes or vague futurism?

Specificity

  • Is the buyer narrow and concrete?
  • Is the problem described sharply enough to sell or build against?

Commercial plausibility

  • Does the idea have a believable wedge and monetization path?
  • Would a lean operator believe this can be launched?

Distinctiveness

  • Is the idea merely generic AI wrapper boilerplate?
  • Does the answer feel meaningfully differentiated without becoming unrealistic?

Artifact plan

What the eventual results page should let you compare.

When this protocol gets a real result page, it should not just announce a winner. It should make the behavior shift visible enough that a reader could adopt a better default or challenge the conclusion.

Required outputs

  • The sanitized prompt or prompt template.
  • The parameter grid and seed policy.
  • Side-by-side generated ideas grouped by configuration.
  • Rubric scores and the reasoning behind them.
  • A practical default such as "use this range when you want exploration without constraint drift."

What not to overclaim

This benchmark will be useful, but it is still one task family on one model/runtime setup at a time. The right posture is to publish a bounded takeaway, not a universal law about creativity settings.