OQENYX
·3 min read

Introducing Guardian Pulse 2.0

OQENYX's multimodal model. Speech recognition in 113 languages, text-to-speech in 36 — audio, video, and text unified in one Guardian endpoint.

Model ReleaseBenchmarksGuardianMultimodal
By OQENYX Research

Today we're releasing Guardian Pulse 2.0 — OQENYX's multimodal model: audio, video, and text unified in a single architecture, with a privacy-first design.

Audio, video, and text in one model — speech recognition across 113 languages and text-to-speech in 36, with strong voice quality in blind human evaluation.

NativeAudio · Video · Text
20Voice Languages (TTS)
113ASR Languages
36TTS Languages
10hMax Audio Input
UnifiedSingle Endpoint

Performance Overview

Voice Quality — TTS Human Evaluation (20 languages)

Guardian Pulse 2.0
92%
ElevenLabs v3
88.5%
Google TTS
82%
Azure TTS
79.5%

Speech Recognition Accuracy — ASR (113 languages)

Guardian Pulse 2.0
94.5%
Whisper Large v4
91.2%
Google ASR
89%
Azure ASR
86.8%

Audio-Visual Benchmark Wins (of 36 total)

Guardian Pulse 2.0
22
GPT-5 Vision
17
Gemini 3 Pro AV
14
Claude AV
10

Audio-Visual Benchmark Overview

BenchmarkGuardian Pulse 2.0GPT-4o AudioGemini 3.1 FlashElevenLabs v3
ASR Languages Supported113%
TTS Languages Supported36%

What Guardian Pulse 2.0 Can Do

Audio — Up to 10 Hours Per Request

Pulse 2.0 processes audio inputs up to 10 hours in a single request within the 256K context window. Entire meetings, full-day recordings, and long-form podcasts can be transcribed, summarised, and analysed in one call — with speaker diarisation, timestamp accuracy, and multilingual switching handled natively.

Video — 400 Seconds at 720p

Video inputs up to 400 seconds at 720p. Pulse 2.0 understands scene content, spoken dialogue, on-screen text, and visual context simultaneously — enabling use cases like meeting recording analysis, video captioning, and content moderation that require joint audio-visual understanding.

Text-to-Speech in 36 Languages

The TTS engine in Pulse 2.0 delivers strong voice quality across 20+ languages in blind human evaluation. Prosody, intonation, and natural pacing are modelled jointly with semantic content — not post-processed. Output is natural across European, Asian, and Middle Eastern language families.

ASR in 113 Languages

Automatic speech recognition in 113 languages with strong accuracy. Pulse 2.0 handles accent variation, code-switching (mixing languages mid-sentence), and noisy audio without degrading to a fallback model.

Audio-Visual Understanding

Across a broad suite of audio-visual tasks, Pulse 2.0 performs strongly on the evaluations that matter most for meeting intelligence, multilingual voice synthesis, and video summarisation.

Privacy and Infrastructure

All audio and video processing follows the same privacy-first architecture as every Guardian model — encrypted in transit and at rest, with minimal retention by default. Voice data, meeting recordings, and video content are handled with transparent processing and clear policies. For enterprise deployments handling sensitive voice data, minimal retention ensures no audio is retained after the request completes without your consent.

Availability

Guardian Pulse 2.0 is available now through:

  • Onora — Voice mode in the Onora App is powered by Pulse 2.0

The Full 2.0 Release

Guardian Pulse 2.0 launches alongside the complete Guardian 2.0 family:

  • Guardian 2.0 Thinking — Frontier reasoning, IFBench 76.5, 200+ languages
  • G-2.0-Lite — Compact, fast model, AIME 93, native vision, long context
  • G-2.0-Code — Agentic coding, BFCL-V4 72, SWE-bench 77%