Decoding and playing audio files in Linux
Table of contents
Overview
I was playing with various media libraries recently and have prepared several snippets demonstrating how one can decode and play an audio file in two separate steps.
The source code is available on GitHub here and there.
The following libraries are used:
- FFmpeg
- SoX
- ALSA (libasound)
- PulseAudio
- libsndfile
Each snippet is a small program. There are two kinds of them:
- decoders demonstrate reading an audio file and decoding raw PCM samples from it
- players demonstrate sending the raw PCM samples to the sound card
Since all snippets use the same sample format and use stdin or stdout, any decoder may be combined with any player via a pipe, for example:
$ ./ffmpeg_decode foo.mp3 | ./alsa_play_tuned
$ ./sox_decode_chain foo.mp3 | ./ffmpeg_play
$ ./sndfile_decode foo.flac | ./sox_play
Below you can find a brief description of every snippet and some side notes.
FFmpeg
ffmpeg_decode
This snippet decodes a file using FFmpeg (with automatic resampling and channel mapping).
Initialization:
- open the input file context (
AVFormatContext
) and look for an audio stream in it - find a decoder (
AVCodec
) for the audio stream - initialize the decoder context (
AVCodecContext
) for the decoder (AVCodec
) - initialize the swresample context (
SwrContext
) that will convert from decoder output to our PCM format and perform resampling if necessary
Decoding loop:
- read an audio packet (
AVPacket
) from the input file (AVFormatContext
) - decode an audio frame (
AVFrame
) from the the audio packet (AVPacket
) using decoder context (AVCodecContext
) - convert the decoded frame (
AVFrame
) to raw samples (byte buffer) using swresample context (SwrContext
) - write the raw samples to stdout
Notes:
- swresample context (
SwrContext
) enables buffering if input data is larger than the passed buffer, so we also read all buffered data (if any) before processing the next frame. The buffering can be avoided by carefully choosing the buffer size.
ffmpeg_play
This snippet plays decoded samples using FFmpeg.
Initialization:
- open an output device (
AVOutputFormat
) - create the output device context (
AVFormatContext
) for the output device - add an audio stream (
AVStream
) for the output device context, which contains the encoder context (AVCodecContext
) - initialize the encoder context (
AVCodecContext
) parameters for our PCM format
Playback loop:
- read raw samples (byte buffer) from stdin
- construct an audio packet (
AVPacket
) that references our buffer with raw samples - write the audio packet to the output device context (
AVFormatContext
)
ffmpeg_play_encoder
This snippet is a bit complicated version of the previous one, demonstrating encoder usage.
Initialization:
- open an output device (
AVOutputFormat
) - create the output device context (
AVFormatContext
) for the output device - add an audio stream (
AVStream
) for output device context, which contains the encoder context (AVCodecContext
) - initialize the encoder context (
AVCodecContext
) parameters for our PCM format - open an encoder (
AVCodec
) for our output format and attach it to the encoder context (AVCodecContext
)
Playback loop:
- read raw samples (byte buffer) from stdin
- construct an audio frame (
AVFrame
) that references our buffer with raw samples - encode the audio frame (
AVFrame
) into audio packet (AVPacket
) using encoder context (AVCodecContext
) - write the audio packet to the output device context (
AVFormatContext
)
SoX
sox_decode_simple
This snippet decodes a file using SoX (without resampling and channel mapping).
Initialization:
- open an input file (
sox_format_t
)
Decoding loop:
- read samples from the input file
- write samples to stdout
sox_decode_chain
This snippet also decodes file using SoX, but uses effects chain and supports resampling and channel mapping.
It opens input file and constructs effects chain:
input
effect reads samples from the input filegain
andrate
effects are added if the input file rate differs from the output rate, to perform resamplingchannels
effect is added if the input file channel set differs from the output channel set, to perform channel mappingstdout
effect writes samples to stdout
When the effects chain is constructed, it is executed using sox_flow_effects()
.
sox_play
This snippet plays decoded samples using SoX.
Initialization:
- open an output device (
sox_format_t
)
Playback loop:
- read samples from stdin
- write samples to the output device
ALSA (libasound)
alsa_play_simple
This snippet plays decoded samples using ALSA with the default parameters.
Initialization:
- open an output device (
snd_pcm_t
) - set the output format
- get the period size
Playback loop:
- read samples from stdin (full period)
- write samples to the output device
alsa_play_tuned
This snippet also uses ALSA to play decoded samples but with non-default configuration.
The most important thing here is ring buffer parameters.
When a program plays sound using libasound, samples are actually written to the internal ring buffer. ALSA reads samples from the buffer every timer tick.
If a program tries to write to a full buffer, a buffer overrun occurs. If ALSA tries to read from an empty buffer, a buffer underrun occurs. These two events are also called xruns. As the result, the user hears sound lags and sees “alsa xrun” messages in console.
To avoid xruns, you can tweak four ring buffer parameters:
buffer_size
- the number of samples in the ring bufferbuffer_time
- duration of the whole buffer in microsecondsperiod_size
- the number of samples that ALSA reads from the buffer every timer tick (no more than thebuffer_size
)period_time
- duration of the timer tick in microseconds (no more than thebuffer_time
)
And also two related parameters:
start_threshold
- before reading the very first sample from the buffer, ALSA waits until there arestart_threshold
samples in itavail_min
- before reading the next batch of samples from the buffer (every timer tick), ALSA waits until there are at leastavail_min
samples in it
These parameters are always a compromise between the robustness and latency.
Here are my recommendations:
period_size
should be set to the number of samples that the program writes to pcm at a time; the higher it is, the higher the latency is, but the lesser the probability of xrun isbuffer_size
should be a multiple ofperiod_size
and several times morestart_threshold
should be set tobuffer_size
avail_min
should be set toperiod_size
PulseAudio
pa_play_simple
This snippet plays decoded samples using PulseAudio Simple API.
Initialization:
- create
pa_simple
object which represents a connection to PulseAudio server plus playback stream
Playback loop:
- read samples from stdin
- write samples to PulseAudio server
pa_play_async_cb
This snippet plays decoded samples using PulseAudio Asynchronous API.
Initialization:
- create
pa_mainloop
object that will be used to run our program - create
pa_context
object that represents a connection to the server - setup callback for context state updates
- run the mainloop
When the connection is established, our callback is invoked. Now we should create a stream:
- create
pa_stream
object that represents a playback stream - configure buffer parameters and stream flags
- setup callback to be invoked when the server wants more samples
There are four parameters of the server-side stream buffer:
maxlength
- maximum buffer length (in bytes)tlength
- desired buffer length, i.e. the target latency (in bytes)prebuf
- start threshold, i.e. the minimum number of bytes to be accumulated in buffer before starting the stream (in bytes)minreq
- minimum number of samples to be requested from the client each time (in bytes)
We also enable three stream flags:
-
PA_STREAM_AUTO_TIMING_UPDATE
Automatically update current latency (stream buffer size) from the server. We just print current latency to stderr.
-
PA_STREAM_INTERPOLATE_TIMING
Interpolate reported latency values between timing updates. We want the latency values printed to stderr to change smoothly.
-
PA_STREAM_ADJUST_LATENCY
With this flag,
tlength
becomes the target size for the stream buffer plus device buffer, instead of just stream buffer. PulseAudio will do two things:-
adjust the device buffer size (ALSA ring buffer size) to be the minimum tlength value among of the all connected streams
-
request samples from the client in such way that there is always about tlength - dlength bytes remaining in the stream buffer, where dlength is the device buffer size
-
When PulseAudio server wants more samples, it invokes our callback which does the following:
- ask PulseAudio to allocate a memory for next batch of samples
- read requested amount of samples from stdin to the buffer
- write the buffer to PulseAudio
We could also do memory allocation by ourselves. However, delegating this function to PulseAudio prevents us from unnecessary copying when PulseAudio uses the zero-copy mode. Zero copy is usually a default for local clients.
pa_play_async_poll
This snippet is just like previous one, but it uses polling instead of callbacks. Polling is performed between mainloop iterations.
This approach could be also used with a threaded mainloop. In this case, polling may be performed from another thread after obtaining a lock.
libsndfile
sndfile_decode
This snippet decodes a file using libsndfile.
Initialization:
- open an input file (
SNDFILE
)
Decoding loop:
- read samples from the input file
- write samples to stdout
Other libraries
There are no snippets for these libraries, but they may be also useful.
Portable audio I/O:
Media frameworks:
Notes
SoX can use libsndfile for reading files. FFmpeg can use SoX for more precise resampling.
Both SoX and FFmpeg can use ALSA and PulseAudio for audio output. But note that FFmpeg tools use SDL for audio output instead.
PulseAudio usually uses ALSA for audio output. It may also use SoX for high-quality resampling.
All libraries listed here (with or without snippets) are cross-platform.