Previewing Guardian Lite — Fast, Small, Private
A lightweight model optimized for speed. Sub-200ms latency with the same privacy guarantees.
Today we are sharing an early preview of Guardian Lite — the second model in the Guardian family, designed for workloads where speed and cost-efficiency matter as much as capability.
The core insight behind Lite is that most AI tasks do not require deep reasoning. Classification, extraction, summarization, real-time chat completions, intent detection — these tasks benefit far more from low latency than from extended chain-of-thought. Guardian 1.0 Thinking is exceptional at complex reasoning. Guardian Lite is exceptional at everything else.
Early benchmarks show Guardian Lite achieving sub-200ms median response times on standard completion tasks. This is roughly 8–12x faster than Guardian 1.0 Thinking at low reasoning. For applications where users expect immediate responses — chat interfaces, autocomplete, voice assistants — that difference is felt directly.
Lite inherits the full privacy architecture of the Guardian platform. Privacy-first architecture, minimal retention by default, encryption in transit and at rest — all of it carries over without compromise. Speed does not mean cutting corners on what matters.
The model is optimized for high-concurrency environments. Our inference runtime handles Lite workloads at significantly higher throughput than our larger models, which translates directly to lower pricing for users. We expect Lite to be the default choice for most production applications once it ships.
Guardian Lite is currently in internal testing. We are targeting a public release later this year. Preview access will be available to Onora users ahead of the general release. If you are building something that needs fast, private inference at scale, reach out.