Dimension 01
Temperature
Test low, medium, and mildly high ranges to separate rigid compliance from productive exploration.
Protocol 001 / published
This is the first SampleLens benchmark protocol. It asks a practical question: which decoding settings produce business ideas that are specific, plausible, and still meaningfully exploratory when the model has to stay inside hard constraints?
Question
The benchmark asks a model to generate business ideas under explicit invariants. The goal is not to maximize novelty for its own sake. The goal is to understand which decoding settings preserve useful exploration while still respecting the constraints that make an idea credible and actionable.
Target question: which decoding ranges produce ideas that are specific, plausible, and operationally coherent without collapsing into repetitive safe answers or drifting outside the stated constraints?
Setup
This protocol publishes the question class rather than a one-off magic prompt. That makes the benchmark more reusable, easier to critique, and more useful to anyone working on similar ideation tasks.
The model is asked to propose one business idea that fits a fixed set of invariants, explain the offer, define the buyer, and justify why the idea is commercially viable. The output format is constrained enough to support scoring and side-by-side comparison.
Sweep plan
One good answer proves almost nothing. This benchmark is meant to surface the tradeoff between exploration, discipline, and repeatability, so it needs a small but meaningful grid rather than a single preset.
Dimension 01
Test low, medium, and mildly high ranges to separate rigid compliance from productive exploration.
Dimension 02
Compare nucleus-heavy and Min-P-heavy regimes so the benchmark can see whether relative floors improve constraint adherence without crushing ideation.
Dimension 03
Multiple seeds per configuration help distinguish true configuration effects from one lucky or unlucky sample.
Dimension 04
The protocol records model, runtime, and date so future runs do not pretend provider or version drift never happened.
Rubric
The benchmark is only as good as its scoring. The rubric should make it obvious why a configuration is being preferred, rejected, or treated as promising but unstable.
Artifact plan
When this protocol gets a real result page, it should not just announce a winner. It should make the behavior shift visible enough that a reader could adopt a better default or challenge the conclusion.
This benchmark will be useful, but it is still one task family on one model/runtime setup at a time. The right posture is to publish a bounded takeaway, not a universal law about creativity settings.