Executive Summary
Speech-to-speech (S2S) models represent a fundamental architectural shift in conversational AI. By processing audio directly without intermediate text representation, these systems achieve latency and naturalness characteristics that approach human conversation dynamics for the first time.
This white paper examines the technical foundations of S2S architectures, compares their capabilities to traditional pipeline approaches, and provides guidance for organizations evaluating these emerging technologies for voice agent deployments.
Key Findings
- S2S architectures achieve end-to-end latencies of 300-500ms, compared to 800-1200ms for traditional pipelines
- Native audio processing preserves paralinguistic information lost in text-mediated approaches
- Current S2S systems show trade-offs in controllability and interpretability compared to pipeline architectures
- Hybrid approaches combining S2S speed with pipeline reliability are emerging as practical deployment patterns
The Traditional Pipeline Architecture
Conventional voice agent systems process speech through a sequential pipeline of specialized components, each optimized for a specific transformation.
Automatic Speech Recognition (ASR)
The ASR component converts audio waveforms to text transcriptions. Modern ASR systems achieve word error rates below 5% for clear speech in supported languages. However, ASR introduces latency as the system must accumulate sufficient audio context before producing reliable transcriptions—typically 500-1000ms of audio before initial results.
Natural Language Understanding (NLU)
The NLU layer interprets transcribed text to extract intent, entities, and semantic meaning. This processing adds 50-150ms of latency depending on model complexity and input length.
Dialogue Management
Dialogue managers maintain conversation state and determine appropriate responses based on extracted intents and business logic. Well-optimized dialogue systems add minimal latency (10-50ms) but represent a critical control point for conversation flow.
Text-to-Speech (TTS)
Finally, TTS systems synthesize audio responses from generated text. Neural TTS systems achieve high naturalness but require 200-400ms to generate initial audio, with streaming capabilities reducing perceived latency.
Speech-to-Speech Architecture
Speech-to-speech models take a fundamentally different approach: processing audio input directly to audio output without explicit text intermediaries.
Audio Tokenization
Modern S2S systems begin by converting continuous audio waveforms into discrete token sequences using neural audio codecs. These codecs, exemplified by approaches like EnCodec and SoundStream, learn to represent audio as sequences of tokens from a learned vocabulary—similar to how text tokenizers represent words as token sequences.
This tokenization enables audio to be processed by transformer architectures originally designed for text, while preserving acoustic information including prosody, emotion, and speaker characteristics.
Audio Language Models
The core of S2S systems is a language model trained to predict audio token sequences conditioned on input audio tokens (and optionally text). These models learn patterns of natural conversation including:
Turn-taking dynamics: When to begin responding, how to handle interruptions, appropriate pause lengths.
Prosodic continuity: Matching response intonation and energy to the conversational context.
Paralinguistic understanding: Interpreting and responding to emotional cues, emphasis, and non-verbal sounds.
Streaming Generation
S2S systems generate responses in streaming fashion, beginning audio output while still processing input. This streaming capability is fundamental to achieving conversational latency—the system doesn't wait for complete input processing before beginning response generation.
"The shift from pipeline to S2S is analogous to the shift from phrase-based to neural machine translation—a fundamental architectural change that unlocks new capability levels."
Technical Comparison
Latency Analysis
S2S systems achieve their latency advantages through parallel processing and elimination of intermediate representations. Where pipeline systems must complete each stage before beginning the next, S2S systems process input and generate output in an interleaved fashion.
Our benchmarking shows S2S systems achieving consistent end-to-end latencies of 300-500ms—approaching the ~200ms response latency typical of human conversation. This represents a 50-60% reduction compared to optimized pipeline implementations.
Naturalness and Prosody
By operating directly on audio representations, S2S systems preserve and generate paralinguistic information that text-mediated systems struggle to capture. Evaluator studies show S2S responses are rated as more natural and contextually appropriate, particularly in emotionally nuanced conversations.
However, TTS systems have also improved significantly. The naturalness gap between S2S and high-quality TTS is narrower than the latency gap, meaning naturalness alone may not justify S2S adoption.
Controllability
Pipeline architectures offer clear intervention points: transcriptions can be reviewed, intents can be logged, and responses can be templated or filtered. S2S systems are more opaque—the mapping from input audio to output audio is learned end-to-end without explicit intermediate representations.
This presents challenges for applications requiring:
Auditability: Regulated industries may require logging of what the system "understood" from user input.
Content control: Ensuring responses stay within approved boundaries is more complex without text intermediaries.
Debugging: Diagnosing conversation failures requires different tooling when there's no transcript to analyze.
Multilingual Capabilities
S2S systems trained on multilingual data show promising cross-lingual capabilities, including code-switching and accent preservation that pipeline systems handle poorly. However, language coverage for S2S systems currently lags behind mature ASR/TTS components.
Hybrid Architectures
Given the trade-offs between S2S and pipeline approaches, hybrid architectures are emerging as practical deployment patterns.
S2S with Text Supervision
Some implementations use S2S for primary processing while maintaining a parallel ASR pathway for logging and monitoring. This preserves S2S latency advantages while providing the auditability required for enterprise deployments.
Adaptive Routing
More sophisticated architectures route conversations dynamically between S2S and pipeline processing based on conversation characteristics. Simple, common interactions use fast S2S paths, while complex or sensitive conversations fall back to more controllable pipeline processing.
S2S with Guardrails
Another pattern applies content safety checks to S2S outputs before delivery, adding modest latency but ensuring response appropriateness. This approach is particularly relevant for customer-facing deployments where inappropriate responses carry significant risk.
Implementation Considerations
Infrastructure Requirements
S2S models are computationally intensive, requiring significant GPU resources for inference. Organizations should plan for dedicated GPU infrastructure or cloud GPU allocation. Latency-sensitive deployments may require GPU resources positioned close to end users.
Training Data
Custom S2S model development requires paired audio conversation data—significantly more challenging to collect and curate than text training data. Most deployments will use pre-trained S2S models, potentially with domain-specific fine-tuning.
Evaluation Frameworks
Traditional NLU evaluation metrics don't apply directly to S2S systems. Organizations need new evaluation frameworks that assess audio-native properties including response appropriateness, prosodic matching, and turn-taking behavior.
Market Landscape
Several organizations are actively developing S2S capabilities:
Research labs at major technology companies have published S2S architectures demonstrating state-of-the-art capabilities, though production availability varies.
Specialized startups are emerging with S2S-native platforms designed specifically for voice agent applications.
Existing voice AI vendors are incorporating S2S capabilities into their platforms, often as premium tiers or beta features.
We expect S2S capabilities to become standard offerings from major voice AI platforms within 12-18 months.
Recommendations
For organizations evaluating S2S technology:
Assess latency sensitivity: S2S provides greatest value where sub-500ms latency is critical to user experience. Applications tolerant of 800ms+ latency may not justify S2S complexity.
Evaluate controllability requirements: Highly regulated industries or applications requiring strict content control should carefully assess S2S limitations and plan for hybrid architectures.
Plan for infrastructure: GPU compute requirements for S2S inference should be factored into total cost of ownership calculations.
Start with pilots: Given the emerging nature of S2S technology, pilot deployments in controlled environments allow organizations to evaluate real-world performance before broad rollout.
Conclusion
Speech-to-speech architectures represent a significant advancement in voice agent technology, enabling conversation dynamics that approach human-to-human interaction for the first time. While current systems present trade-offs in controllability and infrastructure requirements, these challenges are being actively addressed.
For organizations building next-generation voice experiences, S2S technology merits serious evaluation—whether as a primary architecture or as part of a hybrid approach that balances speed with control.
Technical References
This white paper synthesizes findings from published research on audio language models, neural audio codecs, and conversational AI systems. Key architectural concepts are derived from work on speech-to-speech translation, audio generation, and multimodal language models. Benchmarking data is based on our internal testing of available S2S implementations as of November 2025.