All posts
Sample RateAudioSpeech RecognitionTTSVoice AITelephonyNyquist

Sample Rates are engineering tradeoffs

Sample rates are not quality settings. They are engineering decisions shaped by whether a system is reproducing music, understanding speech, or carrying a phone call as cheaply as possible.

Gokul JS··4 min read

A lot of people think sample rates are about audio quality. 48kHz sounds better than 16kHz, and 16kHz sounds better than 8kHz. I used to think that too. But sample rates are not really quality settings. They are engineering decisions.

The mistake is assuming all audio systems are trying to do the same thing. They are not. Music playback tries to reproduce sound faithfully. Speech recognition tries to understand words. Telephone systems try to make conversation possible as cheaply as possible. Once you realize this, the sample rates stop looking arbitrary.

Nyquist theorem gives the basic rule. To represent frequencies up to some limit, you need to sample at a little more than twice that frequency. Human hearing extends to roughly 20kHz, which is why high-fidelity audio systems operate above 40kHz. CD audio ended up at 44.1kHz. Modern realtime media systems often use 48kHz. At first glance this makes higher sample rates seem obviously superior.

Chart showing Nyquist rule: telephony at 8 kHz for 3.4 kHz max, ASR at 16 kHz for 8 kHz max, and hi-fi at 44.1 or 48 kHz for 20 kHz hearingChart showing Nyquist rule: telephony at 8 kHz for 3.4 kHz max, ASR at 16 kHz for 8 kHz max, and hi-fi at 44.1 or 48 kHz for 20 kHz hearing

But speech recognition systems are solving a different problem. They do not need to preserve every detail humans can hear. They only need enough information to recover language. Most of the information required to understand speech exists below about 8kHz, so many speech recognition systems standardize on 16kHz audio. This dramatically reduces compute, bandwidth, and storage while preserving almost everything needed for transcription. The audio sounds less rich to humans, but the model mostly does not care.

Telephone engineers discovered this long before modern AI existed. They found that humans could still understand speech reasonably well when audio was restricted to roughly 300Hz to 3400Hz. That meant the highest important speech frequency was around 3400Hz, which implied a minimum sample rate slightly above 6800Hz. Engineers standardized on 8kHz because it was cheap, simple, and safe. That decision shaped the global phone network for decades.

Frequency spectrum comparing telephone passband at 300 to 3400 Hz, speech ASR content below 8 kHz, and human hearing up to 20 kHzFrequency spectrum comparing telephone passband at 300 to 3400 Hz, speech ASR content below 8 kHz, and human hearing up to 20 kHz

What makes this more interesting now is that modern voice AI systems often contain all these tradeoffs simultaneously. Audio may be captured at 48kHz in the browser, converted to 16kHz for speech recognition, generated at 24kHz by a TTS model, and finally transmitted at 8kHz through a telephony interface. None of these sample rates is the "real" one. They are simply local decisions made by different parts of the system.

Pipeline showing sample rates at each stage: browser capture at 48 kHz, STT at 16 kHz, variable-rate TTS audio, and telephony at 8 kHzPipeline showing sample rates at each stage: browser capture at 48 kHz, STT at 16 kHz, variable-rate TTS audio, and telephony at 8 kHz

This becomes even clearer once you look at text-to-speech systems. A TTS model can generate speech at many different sample rates. The choice is usually not about truth or fidelity. It is about constraints. A phone system may intentionally use 8kHz because the network cannot carry more. A realtime conversational model may use 24kHz because it sounds natural enough while keeping inference costs manageable. A music system may use 48kHz because high-frequency detail matters more there.

Conclusion

So what am I actually saying? Sample rates are not quality settings you crank up until things sound good. They are engineering decisions, each one chosen for a specific job under specific constraints.

Telephone systems use 8kHz because speech stays intelligible in a narrow band and bandwidth is expensive. Speech recognition models use 16kHz because most of what language needs lives below 8kHz, and doubling the rate would double compute for almost no accuracy gain. Music and browser capture use 48kHz because human hearing extends to roughly 20kHz and fidelity matters there. TTS might land anywhere in between depending on whether the bottleneck is inference cost, network capacity, or how natural the output needs to sound.

None of these choices contradict each other. A voice AI pipeline that captures at 48kHz, transcribes at 16kHz, synthesizes at 24kHz, and transmits at 8kHz is not broken. It is a stack of local optima, each layer solving its own problem.

The sample rate is not really a property of the sound itself. It is a property of the system surrounding the sound. When you evaluate a sample rate, ask what that part of the system is trying to accomplish — not whether a higher number would sound better to your ears.