Guardian Pulse 2.0 — Native Multimodal
Our multimodal model — audio, video and text in one endpoint. Speech recognition in 113 languages, text-to-speech in 36.
Guardian Pulse 2.0 is our multimodal model — a step forward in privacy-first audio and video AI. It processes audio, video, and text as first-class inputs in a single unified architecture, without adapters, without pre-processing, without quality loss from modality translation.
The capabilities: speech recognition across 113 languages, text-to-speech in 36 languages, and joint audio-visual understanding — all in one model.
Pulse 2.0 handles audio inputs up to 10 hours and video up to 400 seconds at 720p within a single long context window. Meeting transcription, video summarisation, multilingual dubbing, and real-time voice assistants — use cases that were previously impractical with language-only models.
A note on availability: Pulse 2.0 launches with certain capabilities restricted at the consumer level. Voice cloning and speaker personalisation are technically implemented but not yet enabled. We are working through the regulatory framework — verified consent flows, audit trails, and GDPR-aligned biometric data handling — before releasing these features. We expect to open a limited early access programme for enterprise customers in the coming months.
All audio and video processing follows Guardian's privacy-first architecture — encrypted in transit and at rest, with minimal retention by default. Available now through the Onora App.