Decoding and playing audio files in Linux

07 Jul 2016

linux

audio

Table of contents

Overview

I was playing with various media libraries recently and have prepared several snippets demonstrating how one can decode and play an audio file in two separate steps.

The source code is available on GitHub here and there.

The following libraries are used:

Each snippet is a small program. There are two kinds of them:

decoders demonstrate reading an audio file and decoding raw PCM samples from it
players demonstrate sending the raw PCM samples to the sound card

Since all snippets use the same sample format and use stdin or stdout, any decoder may be combined with any player via a pipe, for example:

$ ./ffmpeg_decode     foo.mp3   |  ./alsa_play_tuned
$ ./sox_decode_chain  foo.mp3   |  ./ffmpeg_play
$ ./sndfile_decode    foo.flac  |  ./sox_play

Below you can find a brief description of every snippet and some side notes.

FFmpeg

WARNING

FFMpeg examples below are heavily outdated due to changes in API.

`ffmpeg_decode`

This snippet decodes a file using FFmpeg (with automatic resampling and channel mapping).

Initialization:

open the input file context (AVFormatContext) and look for an audio stream in it
find a decoder (AVCodec) for the audio stream
initialize the decoder context (AVCodecContext) for the decoder (AVCodec)
initialize the swresample context (SwrContext) that will convert from decoder output to our PCM format and perform resampling if necessary

Decoding loop:

read an audio packet (AVPacket) from the input file (AVFormatContext)
decode an audio frame (AVFrame) from the the audio packet (AVPacket) using decoder context (AVCodecContext)
convert the decoded frame (AVFrame) to raw samples (byte buffer) using swresample context (SwrContext)
write the raw samples to stdout

Notes:

swresample context (SwrContext) enables buffering if input data is larger than the passed buffer, so we also read all buffered data (if any) before processing the next frame. The buffering can be avoided by carefully choosing the buffer size.

`ffmpeg_play`

This snippet plays decoded samples using FFmpeg.

Initialization:

open an output device (AVOutputFormat)
create the output device context (AVFormatContext) for the output device
add an audio stream (AVStream) for the output device context, which contains the encoder context (AVCodecContext)
initialize the encoder context (AVCodecContext) parameters for our PCM format

Playback loop:

read raw samples (byte buffer) from stdin
construct an audio packet (AVPacket) that references our buffer with raw samples
write the audio packet to the output device context (AVFormatContext)

`ffmpeg_play_encoder`

This snippet is a bit complicated version of the previous one, demonstrating encoder usage.

Initialization:

open an output device (AVOutputFormat)
create the output device context (AVFormatContext) for the output device
add an audio stream (AVStream) for output device context, which contains the encoder context (AVCodecContext)
initialize the encoder context (AVCodecContext) parameters for our PCM format
open an encoder (AVCodec) for our output format and attach it to the encoder context (AVCodecContext)

Playback loop:

read raw samples (byte buffer) from stdin
construct an audio frame (AVFrame) that references our buffer with raw samples
encode the audio frame (AVFrame) into audio packet (AVPacket) using encoder context (AVCodecContext)
write the audio packet to the output device context (AVFormatContext)

SoX

`sox_decode_simple`

This snippet decodes a file using SoX (without resampling and channel mapping).

Initialization:

open an input file (sox_format_t)

Decoding loop:

read samples from the input file
write samples to stdout

`sox_decode_chain`

This snippet also decodes file using SoX, but uses effects chain and supports resampling and channel mapping.

It opens input file and constructs effects chain:

input effect reads samples from the input file
gain and rate effects are added if the input file rate differs from the output rate, to perform resampling
channels effect is added if the input file channel set differs from the output channel set, to perform channel mapping
stdout effect writes samples to stdout

When the effects chain is constructed, it is executed using sox_flow_effects().

`sox_play`

This snippet plays decoded samples using SoX.

Initialization:

open an output device (sox_format_t)

Playback loop:

read samples from stdin
write samples to the output device

ALSA (libasound)

`alsa_play_simple`

This snippet plays decoded samples using ALSA with the default parameters.

Initialization:

open an output device (snd_pcm_t)
set the output format
get the period size

Playback loop:

read samples from stdin (full period)
write samples to the output device

`alsa_play_tuned`

This snippet also uses ALSA to play decoded samples but with non-default configuration.

The most important thing here is ring buffer parameters.

When a program plays sound using libasound, samples are actually written to the internal ring buffer. ALSA reads samples from the buffer every timer tick.

If a program tries to write to a full buffer, a buffer overrun occurs. If ALSA tries to read from an empty buffer, a buffer underrun occurs. These two events are also called xruns. As the result, the user hears sound lags and sees “alsa xrun” messages in console.

To avoid xruns, you can tweak four ring buffer parameters:

buffer_size - the number of samples in the ring buffer
buffer_time - duration of the whole buffer in microseconds
period_size - the number of samples that ALSA reads from the buffer every timer tick (no more than the buffer_size)
period_time - duration of the timer tick in microseconds (no more than the buffer_time)

And also two related parameters:

start_threshold - before reading the very first sample from the buffer, ALSA waits until there are start_threshold samples in it
avail_min - before reading the next batch of samples from the buffer (every timer tick), ALSA waits until there are at least avail_min samples in it

These parameters are always a compromise between the robustness and latency.

Here are my recommendations:

period_size should be set to the number of samples that the program writes to pcm at a time; the higher it is, the higher the latency is, but the lesser the probability of xrun is
buffer_size should be a multiple of period_size and several times more
start_threshold should be set to buffer_size
avail_min should be set to period_size

PulseAudio

`pa_play_simple`

This snippet plays decoded samples using PulseAudio Simple API.

Initialization:

create pa_simple object which represents a connection to PulseAudio server plus playback stream

Playback loop:

read samples from stdin
write samples to PulseAudio server

`pa_play_async_cb`

This snippet plays decoded samples using PulseAudio Asynchronous API.

Initialization:

create pa_mainloop object that will be used to run our program
create pa_context object that represents a connection to the server
setup callback for context state updates
run the mainloop

When the connection is established, our callback is invoked. Now we should create a stream:

create pa_stream object that represents a playback stream
configure buffer parameters and stream flags
setup callback to be invoked when the server wants more samples

There are four parameters of the server-side stream buffer:

maxlength - maximum buffer length (in bytes)
tlength - desired buffer length, i.e. the target latency (in bytes)
prebuf - start threshold, i.e. the minimum number of bytes to be accumulated in buffer before starting the stream (in bytes)
minreq - minimum number of samples to be requested from the client each time (in bytes)

We also enable three stream flags:

PA_STREAM_AUTO_TIMING_UPDATE

Automatically update current latency (stream buffer size) from the server. We just print current latency to stderr.
PA_STREAM_INTERPOLATE_TIMING

Interpolate reported latency values between timing updates. We want the latency values printed to stderr to change smoothly.
PA_STREAM_ADJUST_LATENCY

With this flag, tlength becomes the target size for the stream buffer plus device buffer, instead of just stream buffer. PulseAudio will do two things:
- adjust the device buffer size (ALSA ring buffer size) to be the minimum tlength value among of the all connected streams
- request samples from the client in such way that there is always about tlength - dlength bytes remaining in the stream buffer, where dlength is the device buffer size

When PulseAudio server wants more samples, it invokes our callback which does the following:

ask PulseAudio to allocate a memory for next batch of samples
read requested amount of samples from stdin to the buffer
write the buffer to PulseAudio

We could also do memory allocation by ourselves. However, delegating this function to PulseAudio prevents us from unnecessary copying when PulseAudio uses the zero-copy mode. Zero copy is usually a default for local clients.

`pa_play_async_poll`

This snippet is just like previous one, but it uses polling instead of callbacks. Polling is performed between mainloop iterations.

This approach could be also used with a threaded mainloop. In this case, polling may be performed from another thread after obtaining a lock.

libsndfile

`sndfile_decode`

This snippet decodes a file using libsndfile.

Initialization:

open an input file (SNDFILE)

Decoding loop:

read samples from the input file
write samples to stdout

Other libraries

There are no snippets for these libraries, but they may be also useful.

Portable audio I/O:

Media frameworks:

Notes

SoX can use libsndfile for reading files. FFmpeg can use SoX for more precise resampling.

Both SoX and FFmpeg can use ALSA and PulseAudio for audio output. But note that FFmpeg tools use SDL for audio output instead.

PulseAudio usually uses ALSA for audio output. It may also use SoX for high-quality resampling.

All libraries listed here (with or without snippets) are cross-platform.