BenchmarksMarch 31, 2026

Guardian 2.0 Benchmark Report

Full performance analysis of the Guardian 2.0 family across reasoning, code, instruction following, and multimodal benchmarks.

MARCH 2026

Today we are publishing the full benchmark report for the Guardian 2.0 family — Guardian 2.0 Thinking, G-2.0-Lite, G-2.0-Code, and Guardian Pulse 2.0. This report covers evaluation methodology, score breakdowns, and context for interpreting each result.

Guardian 2.0 Thinking scores 76.5 on IFBench (instruction following). It also scores 86 on GPQA Diamond (graduate-level science reasoning), 85 on MMLU-Pro (professional knowledge), and 93 on AIME 2026. The model supports 200+ languages natively.

G-2.0-Lite is our compact, fast model. It achieves MATH-500 at 97, AIME 2026 at 93, and GPQA Diamond at 86. The context window expands substantially, and native vision support ships for the first time. These results represent the largest generational improvement in our lineup.

G-2.0-Code scores 72 on BFCL-V4 (function calling) and 77% on SWE-bench Verified, placing it among capable models for real-world software engineering. The sparse Mixture-of-Experts architecture delivers strong code quality at efficient inference cost, with long context for repository-scale understanding.

Guardian Pulse 2.0 unifies audio, video, and text in one model. ASR operates across 113 languages. TTS covers 36 languages, with strong voice quality in blind human evaluation. It handles a broad range of audio-visual tasks in a single endpoint.

Where competitor scores are cited, we use the figures published by the model providers. Our own figures reflect our internal evaluation. Methodology notes are available on each model's detail page.