March 31, 2026·3 min read

Introducing Guardian Pulse 2.0

OQENYX's multimodal model. Speech recognition in 113 languages, text-to-speech in 36 — audio, video, and text unified in one Guardian endpoint.

Model ReleaseBenchmarksGuardianMultimodal

By OQENYX Research

Today we're releasing Guardian Pulse 2.0 — OQENYX's multimodal model: audio, video, and text unified in a single architecture, with a privacy-first design.

Audio, video, and text in one model — speech recognition across 113 languages and text-to-speech in 36, with strong voice quality in blind human evaluation.

NativeAudio · Video · Text

20Voice Languages (TTS)

113ASR Languages

36TTS Languages

10hMax Audio Input

UnifiedSingle Endpoint

Performance Overview

Voice Quality — TTS Human Evaluation (20 languages)

Guardian Pulse 2.0

92%

ElevenLabs v3

88.5%

Google TTS

82%

Azure TTS

79.5%

Speech Recognition Accuracy — ASR (113 languages)

Guardian Pulse 2.0

94.5%

Whisper Large v4

91.2%

Google ASR

89%

Azure ASR

86.8%

Audio-Visual Benchmark Wins (of 36 total)

Guardian Pulse 2.0

GPT-5 Vision

Gemini 3 Pro AV

Claude AV

Audio-Visual Benchmark Overview

Benchmark	Guardian Pulse 2.0	GPT-4o Audio	Gemini 3.1 Flash	ElevenLabs v3
ASR Languages Supported	113%	—	—	—
TTS Languages Supported	36%	—	—	—

What Guardian Pulse 2.0 Can Do

Audio — Up to 10 Hours Per Request

Pulse 2.0 processes audio inputs up to 10 hours in a single request within the 256K context window. Entire meetings, full-day recordings, and long-form podcasts can be transcribed, summarised, and analysed in one call — with speaker diarisation, timestamp accuracy, and multilingual switching handled natively.

Video — 400 Seconds at 720p

Video inputs up to 400 seconds at 720p. Pulse 2.0 understands scene content, spoken dialogue, on-screen text, and visual context simultaneously — enabling use cases like meeting recording analysis, video captioning, and content moderation that require joint audio-visual understanding.

Text-to-Speech in 36 Languages

The TTS engine in Pulse 2.0 delivers strong voice quality across 20+ languages in blind human evaluation. Prosody, intonation, and natural pacing are modelled jointly with semantic content — not post-processed. Output is natural across European, Asian, and Middle Eastern language families.

ASR in 113 Languages

Automatic speech recognition in 113 languages with strong accuracy. Pulse 2.0 handles accent variation, code-switching (mixing languages mid-sentence), and noisy audio without degrading to a fallback model.

Audio-Visual Understanding

Across a broad suite of audio-visual tasks, Pulse 2.0 performs strongly on the evaluations that matter most for meeting intelligence, multilingual voice synthesis, and video summarisation.

Privacy and Infrastructure

All audio and video processing follows the same privacy-first architecture as every Guardian model — encrypted in transit and at rest, with minimal retention by default. Voice data, meeting recordings, and video content are handled with transparent processing and clear policies. For enterprise deployments handling sensitive voice data, minimal retention ensures no audio is retained after the request completes without your consent.

Availability

Guardian Pulse 2.0 is available now through:

Onora — Voice mode in the Onora App is powered by Pulse 2.0

The Full 2.0 Release

Guardian Pulse 2.0 launches alongside the complete Guardian 2.0 family:

Guardian 2.0 Thinking — Frontier reasoning, IFBench 76.5, 200+ languages
G-2.0-Lite — Compact, fast model, AIME 93, native vision, long context
G-2.0-Code — Agentic coding, BFCL-V4 72, SWE-bench 77%

← All Research ← Back to Model