Evaluating Phylax Preview
How we plan to evaluate an earlier preview model: the framework we use, what each test can and cannot tell us, and why benchmark numbers are only one signal among several.
Phylax Preview is an earlier preview generation. It is not the upcoming Phylax 1.0, and it is not yet available in the product. Before we publish results for any model, we want to be clear about how we measure it, what those measurements mean, and where they stop being useful.
This note describes the evaluation framework we apply to Phylax Preview. It does not contain scores. We would rather publish a careful method now and the numbers later than the other way round.
Benchmark numbers for Phylax Preview are not yet published. We are completing internal and public evaluation first. When results are ready, they will appear here with their methodology and caveats attached.
How we think about it
A benchmark is a question with a known answer asked at scale. That makes it useful and also narrow. A good score tells you the model handled that distribution of questions. It does not tell you how the model behaves on your work, in your language, under your constraints. We hold both ideas at once.
Measure before claiming
A capability is only worth describing once we can observe it on held-out tasks the model has not seen during development.
Prefer honest gaps
Where a model is weaker, we want the evaluation to surface it. A method that only flatters the model is not a method.
Separate signal from story
A single number is easy to quote and easy to misread. We report it with the conditions that produced it.
The evaluation framework
We move through stages, from broad capability checks to behaviour that only shows up in real use. Each stage answers a different question, and a model can pass one while failing the next.
Capability suites
Established public tasks for reasoning, knowledge, instruction following, and coding. These give a comparable, repeatable starting point.
Held-out tasks
Problems written in-house and kept out of any training or tuning data, to check whether a score reflects ability rather than exposure.
Behavioural review
How the model handles ambiguity, refuses safely, asks for missing context, and admits uncertainty. Much of this is read by people, not scored by a script.
Safety and policy review
Careful model selection means checking conduct on sensitive prompts and edge cases as part of evaluation, not after it.
Product signals
Latency, retrieval quality, prompting, and the surrounding interface shape the experience as much as the model. We watch these together, not in isolation.
What we test, and what we do not claim
Being explicit about the boundary is part of the method. The left column is what an evaluation can support. The right column is what a score, on its own, cannot.
What we test
- Reasoning and knowledge on public, repeatable suites
- Instruction following on held-out, in-house tasks
- Calibrated uncertainty: does the model know when it does not know
- Safe refusal and handling of sensitive prompts
- Stability of results across repeated runs
What we do not claim
- That a benchmark score predicts quality on your specific work
- Any ranking against other providers or products
- That a preview model reflects the upcoming Phylax 1.0
- That a single run is a guarantee of consistent behaviour
- Numbers we cannot reproduce and attribute clearly
Why a benchmark is only one signal
Imagine two models with the same headline score. One asks a clarifying question when a request is ambiguous and says when it is unsure. The other answers confidently every time, including when it is wrong. The score does not separate them. The experience of using them is not close.
A score measures the answer. It does not measure the judgement that decides whether to answer at all. We care about both, and only one of them fits in a table.
Real quality depends on parts of the system that no single benchmark captures: how requests are routed to the right model, how prompts are constructed, how retrieval brings in the right context, how latency feels in practice, and how the interface helps a person stay in control. We treat the model as one component of that whole.
Reading results carefully
When the numbers are ready, we will report them with three things attached, so they can be read for what they are.
Conditions
The exact suite, version, and settings used, so a result can be reproduced rather than taken on trust.
Scope
What the test covers and, just as important, what it leaves out.
Drift
A note that results can change as models, prompts, and evaluation sets change over time.
Next step
We are completing evaluation for Phylax Preview now. When results are ready, this page will carry them, with their methodology and caveats kept alongside the numbers rather than separate from them.