The holy grail of text-to-speech is deceptively simple: make it sound human. Not robotic. Not uncanny. Human. ElevenLabs has cracked this code with proprietary magic, but for those of us who prefer open weights and local inference, the question becomes: how close can we get?
This isn’t a tutorial. It’s a lab notebook—a record of experiments, failures, and breakthroughs in the quest to generate studio-quality voices from models like XTTS, Bark, StyleTTS2, and the new wave of codec-based synthesizers.
Key Takeaways
- Temperature sweet spot is 0.6–0.8 — adaptive scheduling by phoneme type yields best results
- CFG scale of 2.0–2.5 balances faithfulness and expression
- The 3-second rule for voice cloning: quality over quantity in reference audio
- DAC at 16 kbps produces the most natural codec results
- Open-source can achieve 90% of ElevenLabs quality with careful tuning
The Anatomy of “Authentic”
Before we touch any code, we need to define what makes a voice sound real. It’s not just clarity. It’s:
- Prosody — The rhythm, stress, and intonation patterns that convey meaning beyond words.
- Micro-variations — Subtle pitch wobbles, breath sounds, and hesitations that scream “human.”
- Timbre consistency — A voice that doesn’t waver unnaturally between phonemes.
- Emotional range — The ability to convey different affects without sounding like a different person.
ElevenLabs excels at all four. Most open-source models nail clarity but stumble on micro-variations and emotional range. That’s where inference tuning comes in.
The Temperature Paradox
Temperature controls the randomness of sampling. Higher temperature = more variation = more “human”… right?
Not quite.
Temperature too low (0.1-0.3): Robotic, monotone, predictable
Temperature sweet spot (0.6-0.8): Natural variation, maintained coherence
Temperature too high (1.0+): Chaotic, slurred, unstable
But here’s what I discovered: optimal temperature varies by phonetic context.
Vowels benefit from higher temperature (more tonal variation), while consonants need lower temperature (precision matters). This led me to experiment with adaptive temperature scheduling:
def adaptive_temperature(phoneme_type, base_temp=0.7):
"""Adjust temperature based on phoneme characteristics."""
modifiers = {
'vowel': 1.15, # More expression
'fricative': 0.85, # Precision for s, f, sh
'plosive': 0.90, # Clarity for p, t, k
'nasal': 1.05, # Slight warmth for m, n
'approximant': 1.10 # Natural flow for r, l, w
}
return base_temp * modifiers.get(phoneme_type, 1.0)
The CFG Scale: Balancing Faithfulness and Expression
Classifier-Free Guidance (CFG) is borrowed from image generation but applies beautifully to TTS. It controls how closely the model follows the conditioning (voice prompt, emotion tags, etc.) versus free generation.
The math is elegant:
output = uncond + cfg_scale × (cond - uncond)
Where:
uncond= what the model generates with no conditioningcond= what the model generates with full conditioningcfg_scale= how much to push toward the conditioned output
My experiments:
| CFG Scale | Result |
|---|---|
| 1.0 | Flat, generic voice |
| 2.0 | Good balance, natural |
| 3.0 | Highly expressive, slight artifacts |
| 5.0+ | Over-saturated, distorted |
The sweet spot for most models: 2.0 - 2.5.
But here’s the trick: CFG scale should increase with sentence length. Short utterances need less guidance; longer passages benefit from stronger conditioning to prevent drift.
def dynamic_cfg(text_length, base_cfg=2.0):
"""Scale CFG based on utterance length."""
# Normalize to ~50 chars = 1.0 multiplier
length_factor = min(text_length / 50, 2.0)
return base_cfg * (0.8 + 0.4 * length_factor)
Top-k, Top-p, and the Repetition Penalty Trifecta
Three parameters that are often cargo-culted without understanding:
-
Top-k Sampling — Limits vocabulary to the k most probable tokens. For TTS:
- k = 50: Safe, stable, slightly less varied
- k = 200: More natural variation
- k = 500+: Diminishing returns, risk of artifacts
-
Top-p (Nucleus) Sampling — Dynamically selects tokens until cumulative probability reaches p.
- p = 0.7: Conservative, stable
- p = 0.9: Natural speech patterns
- p = 0.95+: Risky, occasional stumbles
-
Repetition Penalty — Crucial for preventing the “stuck in a loop” failure mode.
- 1.0: No penalty (can cause repetition artifacts)
- 1.1 - 1.2: Sweet spot for natural speech
- 1.5+: Forced variation, sounds unnatural
The winning combination I’ve settled on:
sampling_config = {
'temperature': 0.7,
'top_k': 200,
'top_p': 0.9,
'repetition_penalty': 1.15,
'length_penalty': 1.0,
}
Voice Cloning: The 3-Second Rule
Most open-source models support voice cloning from reference audio. But not all reference audio is created equal.
The 3-Second Rule: A 3-6 second clip with:
- Clear speech (no background noise)
- Varied intonation (not monotone reading)
- Representative timbre (the “average” of the target voice)
…outperforms a 30-second clip of flat reading every time.
Embedding Interpolation
Want to blend voices or create variations? Linear interpolation in embedding space works surprisingly well:
def blend_voices(embed_a, embed_b, ratio=0.5):
"""Blend two voice embeddings."""
blended = embed_a * (1 - ratio) + embed_b * ratio
# Normalize to maintain magnitude
return blended / np.linalg.norm(blended) * np.linalg.norm(embed_a)
This lets you create “adjacent” voices—similar enough to be family, different enough to be distinct characters.
The Codec War: Encodec vs. DAC vs. SoundStream
Modern TTS increasingly relies on neural audio codecs. The choice matters:
| Codec | Bitrate | Quality | Latency | Notes |
|---|---|---|---|---|
| Encodec | 1.5-24 kbps | Good | Low | Meta’s workhorse |
| DAC | 8-16 kbps | Excellent | Medium | Descript’s challenger |
| SoundStream | 3-18 kbps | Very Good | Low | Google’s entry |
My take: DAC at 16 kbps produces the most natural results, but Encodec at 6 kbps is the sweet spot for real-time applications.
The codec choice affects inference parameters! DAC is more forgiving of temperature variation; Encodec requires tighter sampling.
A/B Testing Framework
You can’t improve what you can’t measure. I built a simple A/B testing rig:
import random
from pathlib import Path
class TTSExperiment:
def __init__(self, model, configs):
self.model = model
self.configs = configs # List of parameter dicts
self.results = {i: {'wins': 0, 'total': 0} for i in range(len(configs))}
def generate_pair(self, text):
"""Generate two samples with different configs."""
idx_a, idx_b = random.sample(range(len(self.configs)), 2)
audio_a = self.model.synthesize(text, **self.configs[idx_a])
audio_b = self.model.synthesize(text, **self.configs[idx_b])
return (idx_a, audio_a), (idx_b, audio_b)
def record_preference(self, winner_idx, loser_idx):
"""Record human preference."""
self.results[winner_idx]['wins'] += 1
self.results[winner_idx]['total'] += 1
self.results[loser_idx]['total'] += 1
def get_rankings(self):
"""Return configs sorted by win rate."""
rankings = []
for idx, stats in self.results.items():
rate = stats['wins'] / stats['total'] if stats['total'] > 0 else 0
rankings.append((idx, rate, self.configs[idx]))
return sorted(rankings, key=lambda x: -x[1])
After ~100 comparisons, patterns emerge. Document everything.
The Latency-Quality Tradeoff
Real-time applications demand compromises. Here’s my latency budget for streaming TTS:
| Component | Target | Actual |
|---|---|---|
| Text preprocessing | <10ms | 5ms |
| Model inference | <200ms | 180ms |
| Audio decoding | <20ms | 15ms |
| Buffer/network | <50ms | variable |
| Total | <300ms | ~200ms |
To hit these numbers:
- Use ONNX or TensorRT for inference
- Batch phoneme processing
- Stream audio chunks (don’t wait for full synthesis)
- Pre-warm the model (first inference is always slow)
The Unsolved Problems
Even with optimal tuning, gaps remain:
- Laughter and non-speech vocalizations — Most models can’t produce natural laughter.
- Whispering — Either too loud or unintelligible.
- Singing — A different beast entirely (see: Bark’s attempts).
- Code-switching — Seamless language transitions within utterances.
These are active research areas. The next generation of models (likely combining LLM reasoning with audio synthesis) may crack them.
Closing Thoughts: The 90% Problem
Getting to 90% of ElevenLabs quality is achievable with open-source tools and careful tuning. It’s that last 10% that separates “good enough” from “indistinguishable.”
That 10% is likely:
- Training data quality — ElevenLabs has a lot of studio recordings
- Proprietary post-processing — Noise removal, EQ, dynamic compression
- Architecture innovations — Techniques we haven’t seen yet
But 90% is often enough. For internal tools, prototypes, and applications where cost or privacy matter, open-source TTS is ready.
The lab stays open. Experiments continue.
Next in The Lab: Fine-tuning TTS models on custom voices without catastrophic forgetting.