The Lab: Chasing ElevenLabs—Tuning Open-Source TTS for Studio-Quality Voices

The holy grail of text-to-speech is deceptively simple: make it sound human. Not robotic. Not uncanny. Human. ElevenLabs has cracked this code with proprietary magic, but for those of us who prefer open weights and local inference, the question becomes: how close can we get?

This isn’t a tutorial. It’s a lab notebook—a record of experiments, failures, and breakthroughs in the quest to generate studio-quality voices from models like XTTS, Bark, StyleTTS2, and the new wave of codec-based synthesizers.

Key Takeaways

Temperature sweet spot is 0.6–0.8 — adaptive scheduling by phoneme type yields best results
CFG scale of 2.0–2.5 balances faithfulness and expression
The 3-second rule for voice cloning: quality over quantity in reference audio
DAC at 16 kbps produces the most natural codec results
Open-source can achieve 90% of ElevenLabs quality with careful tuning

The Anatomy of “Authentic”

Before we touch any code, we need to define what makes a voice sound real. It’s not just clarity. It’s:

Prosody — The rhythm, stress, and intonation patterns that convey meaning beyond words.
Micro-variations — Subtle pitch wobbles, breath sounds, and hesitations that scream “human.”
Timbre consistency — A voice that doesn’t waver unnaturally between phonemes.
Emotional range — The ability to convey different affects without sounding like a different person.

ElevenLabs excels at all four. Most open-source models nail clarity but stumble on micro-variations and emotional range. That’s where inference tuning comes in.

The Temperature Paradox

Temperature controls the randomness of sampling. Higher temperature = more variation = more “human”… right?

Not quite.

Temperature too low (0.1-0.3):  Robotic, monotone, predictable
Temperature sweet spot (0.6-0.8):  Natural variation, maintained coherence
Temperature too high (1.0+):  Chaotic, slurred, unstable

But here’s what I discovered: optimal temperature varies by phonetic context.

Vowels benefit from higher temperature (more tonal variation), while consonants need lower temperature (precision matters). This led me to experiment with adaptive temperature scheduling:

def adaptive_temperature(phoneme_type, base_temp=0.7):
    """Adjust temperature based on phoneme characteristics."""
    modifiers = {
        'vowel': 1.15,       # More expression
        'fricative': 0.85,   # Precision for s, f, sh
        'plosive': 0.90,     # Clarity for p, t, k
        'nasal': 1.05,       # Slight warmth for m, n
        'approximant': 1.10  # Natural flow for r, l, w
    }
    return base_temp * modifiers.get(phoneme_type, 1.0)

The CFG Scale: Balancing Faithfulness and Expression

Classifier-Free Guidance (CFG) is borrowed from image generation but applies beautifully to TTS. It controls how closely the model follows the conditioning (voice prompt, emotion tags, etc.) versus free generation.

The math is elegant:

output = uncond + cfg_scale × (cond - uncond)

Where:

uncond = what the model generates with no conditioning
cond = what the model generates with full conditioning
cfg_scale = how much to push toward the conditioned output

My experiments:

CFG Scale	Result
1.0	Flat, generic voice
2.0	Good balance, natural
3.0	Highly expressive, slight artifacts
5.0+	Over-saturated, distorted

The sweet spot for most models: 2.0 - 2.5.

But here’s the trick: CFG scale should increase with sentence length. Short utterances need less guidance; longer passages benefit from stronger conditioning to prevent drift.

def dynamic_cfg(text_length, base_cfg=2.0):
    """Scale CFG based on utterance length."""
    # Normalize to ~50 chars = 1.0 multiplier
    length_factor = min(text_length / 50, 2.0)
    return base_cfg * (0.8 + 0.4 * length_factor)

Top-k, Top-p, and the Repetition Penalty Trifecta

Three parameters that are often cargo-culted without understanding:

Top-k Sampling — Limits vocabulary to the k most probable tokens. For TTS:
- k = 50: Safe, stable, slightly less varied
- k = 200: More natural variation
- k = 500+: Diminishing returns, risk of artifacts
Top-p (Nucleus) Sampling — Dynamically selects tokens until cumulative probability reaches p.
- p = 0.7: Conservative, stable
- p = 0.9: Natural speech patterns
- p = 0.95+: Risky, occasional stumbles
Repetition Penalty — Crucial for preventing the “stuck in a loop” failure mode.
- 1.0: No penalty (can cause repetition artifacts)
- 1.1 - 1.2: Sweet spot for natural speech
- 1.5+: Forced variation, sounds unnatural

The winning combination I’ve settled on:

sampling_config = {
    'temperature': 0.7,
    'top_k': 200,
    'top_p': 0.9,
    'repetition_penalty': 1.15,
    'length_penalty': 1.0,
}

Voice Cloning: The 3-Second Rule

Most open-source models support voice cloning from reference audio. But not all reference audio is created equal.

The 3-Second Rule: A 3-6 second clip with:

Clear speech (no background noise)
Varied intonation (not monotone reading)
Representative timbre (the “average” of the target voice)

…outperforms a 30-second clip of flat reading every time.

Embedding Interpolation

Want to blend voices or create variations? Linear interpolation in embedding space works surprisingly well:

def blend_voices(embed_a, embed_b, ratio=0.5):
    """Blend two voice embeddings."""
    blended = embed_a * (1 - ratio) + embed_b * ratio
    # Normalize to maintain magnitude
    return blended / np.linalg.norm(blended) * np.linalg.norm(embed_a)

This lets you create “adjacent” voices—similar enough to be family, different enough to be distinct characters.

The Codec War: Encodec vs. DAC vs. SoundStream

Modern TTS increasingly relies on neural audio codecs. The choice matters:

Codec	Bitrate	Quality	Latency	Notes
Encodec	1.5-24 kbps	Good	Low	Meta’s workhorse
DAC	8-16 kbps	Excellent	Medium	Descript’s challenger
SoundStream	3-18 kbps	Very Good	Low	Google’s entry

My take: DAC at 16 kbps produces the most natural results, but Encodec at 6 kbps is the sweet spot for real-time applications.

The codec choice affects inference parameters! DAC is more forgiving of temperature variation; Encodec requires tighter sampling.

A/B Testing Framework

You can’t improve what you can’t measure. I built a simple A/B testing rig:

import random
from pathlib import Path

class TTSExperiment:
    def __init__(self, model, configs):
        self.model = model
        self.configs = configs  # List of parameter dicts
        self.results = {i: {'wins': 0, 'total': 0} for i in range(len(configs))}
    
    def generate_pair(self, text):
        """Generate two samples with different configs."""
        idx_a, idx_b = random.sample(range(len(self.configs)), 2)
        audio_a = self.model.synthesize(text, **self.configs[idx_a])
        audio_b = self.model.synthesize(text, **self.configs[idx_b])
        return (idx_a, audio_a), (idx_b, audio_b)
    
    def record_preference(self, winner_idx, loser_idx):
        """Record human preference."""
        self.results[winner_idx]['wins'] += 1
        self.results[winner_idx]['total'] += 1
        self.results[loser_idx]['total'] += 1
    
    def get_rankings(self):
        """Return configs sorted by win rate."""
        rankings = []
        for idx, stats in self.results.items():
            rate = stats['wins'] / stats['total'] if stats['total'] > 0 else 0
            rankings.append((idx, rate, self.configs[idx]))
        return sorted(rankings, key=lambda x: -x[1])

After ~100 comparisons, patterns emerge. Document everything.

The Latency-Quality Tradeoff

Real-time applications demand compromises. Here’s my latency budget for streaming TTS:

Component	Target	Actual
Text preprocessing	<10ms	5ms
Model inference	<200ms	180ms
Audio decoding	<20ms	15ms
Buffer/network	<50ms	variable
Total	<300ms	~200ms

To hit these numbers:

Use ONNX or TensorRT for inference
Batch phoneme processing
Stream audio chunks (don’t wait for full synthesis)
Pre-warm the model (first inference is always slow)

The Unsolved Problems

Even with optimal tuning, gaps remain:

Laughter and non-speech vocalizations — Most models can’t produce natural laughter.
Whispering — Either too loud or unintelligible.
Singing — A different beast entirely (see: Bark’s attempts).
Code-switching — Seamless language transitions within utterances.

These are active research areas. The next generation of models (likely combining LLM reasoning with audio synthesis) may crack them.

Closing Thoughts: The 90% Problem

Getting to 90% of ElevenLabs quality is achievable with open-source tools and careful tuning. It’s that last 10% that separates “good enough” from “indistinguishable.”

That 10% is likely:

Training data quality — ElevenLabs has a lot of studio recordings
Proprietary post-processing — Noise removal, EQ, dynamic compression
Architecture innovations — Techniques we haven’t seen yet

But 90% is often enough. For internal tools, prototypes, and applications where cost or privacy matter, open-source TTS is ready.

The lab stays open. Experiments continue.

Next in The Lab: Fine-tuning TTS models on custom voices without catastrophic forgetting.

The Anatomy of “Authentic”

The Temperature Paradox

The CFG Scale: Balancing Faithfulness and Expression

Top-k, Top-p, and the Repetition Penalty Trifecta

Voice Cloning: The 3-Second Rule

Embedding Interpolation

The Codec War: Encodec vs. DAC vs. SoundStream

A/B Testing Framework

The Latency-Quality Tradeoff

The Unsolved Problems

Closing Thoughts: The 90% Problem

Be the first to know about every new letter.

Your creator journey has just leveled up.

More from The Lab

The Lab: Decoding Quantization & Hardware

The Lab: Minimalist Tech Stack for High-Output Creators

The Lab: AI Workflows for the Skeptical Optimist

Be the first to know
about every new letter.

Your creator journey
has just leveled up.