Laurence Tratt: Recording and Processing Spoken Word

What happens if you listen to 60 seconds of your favourite radio station / audiobook and then 60 seconds of a random non-professional podcast? The former will be pleasant to listen to, with good intelligibility — it’s easy to understand everything the speakers say. The latter, however, are often somewhat unpleasant to listen to and have poor intelligibility. To add insult to injury, different episodes of the same podcast will often vary in quality and volume.

Fortunately, it turns out that most of us can get reasonably close to “professional” sounding recordings without much effort or money. In this post I’m going to break this down into two parts: recording and processing. My experience is that most of us eventually work out some recording techniques, but many fewer dive into processing. I’m going to show how one can use widely available command-line tools to process recordings, producing good quality mixes (i.e. the output we release to listeners).

I’m going to show how we can go from recordings that sound like this 1:

to this, without spending too much money or time on tricky audio editing:

To stop this post getting too big, I’m going to focus solely on spoken word (i.e. speech) (rather than, say, music), and on pre-recordings rather than live broadcasts 2.

Recording

Evolution has made the human ear incredibly sensitive to audio frequencies generated by the human voice, which is what allows us to accurately identify hundreds of people simply from listening to a brief snippet of their voice. We need to record those frequencies fairly well if we want to stand a chance of people finding our eventual audio pleasant and intelligible.

Equipment

Many of us assume that what we need to get a good recording is very expensive equipment. While expensive equipment rarely hurts, it turns out to be much less necessary than in many other areas 3. The law of diminishing returns on audio equipment kicks in far faster than for, say, cameras.

Most obviously we need a microphone. It’s tempting to look at video of our favourite podcasts or radio stations and assume we need to spend £400 on a Shure SM7B — but then we need a microphone arm and a surprisingly powerful audio interface. You’re not going to get much change from £600 (~750USD) by going down that route.

Fortunately, far cheaper microphones are really rather good. There are two things to avoid:

Don’t use a laptop microphone
Don’t use a Bluetooth microphone.

Most laptop microphones are terrible 4 and they’re also too far away from your mouth (more on that below). Surprisingly, Bluetooth can receive decent quality audio but generally sends low quality audio that’s worryingly close in quality to an old-fashioned landline. This situation seems to be slowly improving, but it’ll be a while before it’s universally solved, so a blanket ban still seems to be the best advice.

If you bear those two rules in mind, most other microphones are at least somewhat acceptable. I’ve heard tolerable audio from the cheap wired headsets that (used to) come for free with mobile phones. USB microphones, which have the advantage that you don’t need a separate audio interface, seem to become usable at around the £50 mark and good by £100. If you want to explore this in depth, Podcastage has highly consistent reviews, and frequent comparisons, of every microphone under the sun. My suggestion is to start with something reasonably priced and only spend more later if you really get deeply into things 5.

Room and technique

No matter how good our microphone, where and how we record makes a huge difference. There are 3 main factors:

Don’t record in a reverberant space.
Record a quiet enough signal so that there is no digital clipping.
Put the microphone about a hands width away from your mouth when recording.

“Reverberant space” is a fancy way of saying “somewhere with lots of echo”. Imagine you’re in the middle of a space with lots of hard surfaces, such as a cathedral or a brick factory building. When someone speaks, you’ll hear a “trail of echoes”, possibly a couple of seconds long, of what they’re saying. This can sound quite nice for music, but it seriously hampers intelligibility 6.

For example, if I record on my laptop in my kitchen – which, like most kitchens, has lots of hard surfaces – my recording is smothered in reverberations:

What I want to do is record in a “dead” space where there are as few reverberations as possible. In my case, simply moving my laptop to my office, which happens to have a number of soft furnishings, lessens reverberations significantly:

As this shows, many of us can improve our recordings simply by moving to a more suitable room! To do even better, we need to actively dampen the room we’re in, adding further soft coverings to absorb sound waves and lessen reverberations. Where should we place such coverings?

Most microphones are directional such that they pick up most of the sound from the direction they’re pointing at — including reverberations. Since you’ll be pointing your microphone at your mouth, putting dampening right behind you solves the majority of reverberation problems. If you want to go the extra mile, you may find that putting dampening in any right-angled corners in your room will make your recording that little bit better 7.

In my case, I dampened the room in under a minute by closing the curtains, and putting a duvet on a clothes rack right behind my head. This improves the situation a bit further:

I’ve now hit the limits of the recording quality I can expect to get from a laptop’s microphone — remember that I strongly advised against using laptop microphones earlier!

If I record in the same room, using the same dampening, but swap from my laptop’s microphone to a “proper” (i.e. stand alone) microphone I get the following:

Part of the (significant!) improvement between this mix and the last is the use of a better microphone, but probably as much of the improvement is because the microphone is now much closer to my mouth.

Most people record their audio with their microphone too far away or pointing in the wrong direction. You want a microphone to be about 5 fingers (i.e. your hands!) width away from your mouth to the tip of the microphone, with the microphone pointing at your mouth. Any further away and the recording will sound thin and pick up many more reverberations; any closer and we’ll have a problem with the proximity effect and plosives.

We then need to record the signal into our computer. First we have to select a recording format, which is a combination of file format (e.g. flac / mp3) and sample rate (e.g. 48KHz). For music these two factors require careful consideration, but spoken word is more forgiving. I recommend recording in flac or wav format if you can (but not obsessing about it) and using a sample rate of 48KHz 8.

We have to find the right level to record at: if you record far too quiet a signal you can lose some important details; but if you record even slightly too loud, you’ll get digital clipping. The latter sounds truly awful, so it’s best to aim for a quieter than a louder recording 9. How awful is awful? Let’s take the most recent recording from above, and amplify it so that it clips:

Notice how my voice now sounds “buzzy” and “tinny”: that’s what happens when digital clipping is frequent. It can sound even weirder when clipping only happens occasionally, often being experienced as a “click” in the recording.

The easiest way to avoid recording yourself at a level that clips is to to record yourself talking at your loudest practical level which, for most people, is when they laugh — once you’ve found a recording level where laughter doesn’t clip you’re done. Personally, I load Audacity, press the record button then use the level monitor highlighted below:

Audacity

The further to the right the green bar is, the louder the recording at this point in time: the blue vertical bar shows you the loudest part of the recording so far. To be safe, I tend to make sure the loudest part of my recording is about -12dB (in the screenshot the blue line is at about -13dB).

We need to think about background noise while we’re recording. In an ideal world we would of course like no background noise, but that’s often not practical. However, we can often reduce background noise by recording at appropriate times. For example, if I lived right next to a school, I’d avoid drop-off/pickup times when lots of people are revving engines. As that example suggests, continuous background noises are less troublesome for listeners than intermittent background noises.

Finally, there is how we speak. To avoid our voice sounding raspy, we need to be well hydrated before we start recording and to take regular sips while we’re recording to maintain hydration. Some people turn into a robotic monotonic when a microphone is put in front of them, which isn’t fun to listen to. In general, we want to speak as naturally as possible, though most of us will do well to speak a little slower than our normal conversational speed. In my experience, with a little bit of practise, most of us soon find a speaking cadence and tone that we’re happy with.

Processing

Many people release their initial spoken word recordings as final mixes. If we process (i.e. change) the audio we can make it both more pleasant to listen to and more intelligible. In this section I’m going to lead you through the two key aspects of this: dynamic range compression and normalisation. Those of us comfortable with command-line tools are fortunate that we can achieve professional standards simply by installing ffmpeg and ffmpeg-normalize.

Doing a good job of audio processing does require some understanding of how audio is represented and what the human ear likes to hear. If you don’t have time for that, then I recommend at least running this command:

ffmpeg-normalize <input file> -o normalised.mp3 \
  -c:a mp3 -b:a 96 -pr -t -16 -tp -1

This will take in your recording <input file> and produce an output normalised.mp3. In nearly all cases the output will be more pleasant to listen to than the original, and different output files will have a consistent volume level — in other words, it makes it easy for people to listen to multiple things without continually fiddling with the volume.

How much of a difference does this make? First try listening to the raw recording, without any processing (warning: this will be rather quiet):

Then try listening to that raw recording run through ffmpeg-normalize (warning: this will be much louder):

When we visualise the two wave forms one atop the other 10, the difference between them becomes obvious (“raw” recording in blue / teal, “normalised” in orange):

Compressing audio

Notice how much “smaller” (i.e. quieter) the raw recording is relative to the normalised recording, but also notice that the overall shape has been retained in the latter.

As simple as it is, this single command will make your audio more consistently good than most people manage. If you want to do even better, keep on reading.

Dynamic range

At this point we have to start understanding something about sound. Volume in audio is represented with decibels (dB).

There are three general surprises to volume. First, in our context, we’ll always use 0 to represent the loudest value, with negative numbers being progressively quieter (e.g. -9dB is quieter than -6dB). Second, the scale is non-linear because our ears do not perceive volume linearly. Our ears tend to perceive a 10dB difference between two signals as being a doubling in loudness, though the volume has more than doubled 11. Third, our initial reaction is to nearly always prefer louder over quieter things: if we’re comparing two sounds for anything other than loudness, we need to make sure that the two sounds are as loud as each other to avoid fooling ourselves.

We can now introduce the concept of dynamic range: what is the difference between the loudest and quietest part of a recording? Audio with too narrow a dynamic range can be unpleasant to listen to 12, but audio with too wide a dynamic range can be difficult to understand. The latter is particularly problematic with spoken word, which we often listen to in less than ideal acoustic circumstances.

For example, If I’m on a train and I find a comfortable volume level for the quieter parts of speech and then the volume jumps, it will hurt my ears: if I pick a comfortable volume level for the loud parts, I won’t hear the quieter parts over the rumble of the train. Unlike a conversation in normal life, I can’t easily ask the recording to repeat just the last part again.

Dynamic range compression

You might be tempted to imagine that the solution to a too-wide dynamic range is to record your voice at a consistent level but it’s impossible for us to keep within the ideal dynamic range for any length of time 13. Instead, what we want to do is process our recording so that the dynamic range is reduced.

The main technique we’ll use is dynamic range compression (and, no, this has nothing to do with “zip file” style compression!) which is almost always referred to as just compression. Compressors – that is, a tool which performs compression on audio – make the louder parts of a recording quieter and keep the quieter parts unchanged, thus reducing the dynamic range of the recording. We then need to raise the overall volume level after compression: in our situation, we already know the tool (normalisation!) we’ll use to do this.

Most of us find it easier at first to see what compression and normalisation do rather than hear it. The three figures below show (from left to right, with “higher” on the y-axis meaning “louder”):

The original recording, which has a wide dynamic range (i.e. the delta between the highest and lowest values on the y-axis), starting loud and then getting much quieter.
The recording after compression. The dynamic range has been reduced by making the loud parts quieter (note that the “shape” of the loud parts hasn’t changed!). However, this has the side-effect of making the overall recording much quieter overall.
The compressed recording after normalisation. Note that this has the same dynamic range as the second figure, as all parts of it have been consistently increased in volume.

Compressing audio

In my opinion, compression is the single greatest tool in an audio engineer’s toolbox. Modern music is unimaginable without it, as is radio, and other spoken word. Compression is also conceptually quite simple. Fundamentally, compressors have two main settings:

Above what threshold (i.e. how close to 0dB) must a signal be before it’s reduced in volume?
How much must a loud signal be reduced by?

One challenge with working out the “right” settings is that professionals have a tendency to say “you just need to use your ears”. While absolutely true, this advice is very hard to follow for beginners. Until we develop good taste, most of us tend to overdo audio processing. If in doubt, my suggestion for audio processing is that you “turn an effect up” until you can clearly hear it changing the audio, and then dial it back down until the point that you can’t quite hear it changing things.

Let’s start with the ratio, which is the easiest to explain: when a signal has exceeded the threshold we reduce it by a ratio. For example a “4:1” ratio says that for every extra 4dB of loudness above the threshold, the compressor will reduce the volume to 1dB. For spoken word, “2:1” is a conservative ratio; “3:1” is a reasonable bet; “4:1” can work if you have a good recording; and the further above that you go, the more you will sound like a shouty sports presenter.

What the “right” threshold is will vary per recording. I use a conservative rule of thumb: I play back the recording, look at the approximate average level using Audacity’s meter, and subtract 6dB from it. If my peaks are at -12dB, the average tends to be around -18dB, which would then lead to a threshold of -24dB.

However, simply setting a level in dB isn’t fine-grained enough. Signals quite often very briefly exceed or drop below the threshold and if we turn the compressor on and off too quickly, the compressor produces an unpleasant sound effect called pumping. Thus, in addition to a threshold, we also specify an attack time – how long does a signal have to exceed the threshold for the compressor to start? – and a release time – how long does the signal have to fall below the threshold for the compressor to stop? Fortunately, for spoken word, a fast attack time of around 2-3ms and a moderately slow release time of around 15ms tends to work well 14.

With this knowledge, we can then fire some audio at a compressor. Let’s use ffmpeg’s built-in compressor acompressor:

ffmpeg -i in.flac \
  -af "acompressor=threshold=-24dB:attack=2.5:release=15:ratio=3" \
  out.flac

This command takes in a file in.flac, compresses it, and writes the output to out.flac. We’re using an audio filter (-af) with a single stage (acompressor). The arguments to acompressor are a threshold (note the dB units and be aware that “dB” is case-sensitive in ffmpeg); the attack time (2.5ms); the release time (15ms); and the ratio (3:1).

Although I won’t go into further detail, compression is an area one can spend a lot of time on. For example, different compressors not only have different settings but change volume / sound in subtly different ways 15. One can also do fun things like chain compressors (i.e. compress audio multiple times)16.

Normalisation

Because a compressor makes the louder bits quieter and keeps the quieter bits unchanged, it reduces the overall volume level of a recording. This exacerbates a common problem: when listeners move between different recordings they can be startled by the difference in volume. To solve this problem we can use “normalisation” to homogenise the perceived volume level between different recordings.

In its most basic form, normalisation simply uniformly increases, or decreases, the volume of an entire recording. Imagine I have one recording whose “overall volume level” is -20dB and another of -10dB. If I want both to have an overall volume level of -13dB, I’ll increase the first recording’s volume by 7dB and reduce the second by 3dB.

Conceptually, normalisation is thus simple. In practise, there are two difficulties that a good normaliser needs to consider.

First, what does “overall volume level” mean? There are lots of factors that influence how we perceive volume, from the effects of different frequencies, to peaks versus averages, and so on. Simple measurements of volume don’t match those perceptions. Instead, people have come up with blends of various measures to better approximate our perceptions. By far the most common now is the EBU R128 standard, which is what we’ll use.

Second, what do we do with occasional peaks (i.e. short, loud, spikes in volume)? If a recording is mostly quiet but has a brief period of someone laughing loudly, can we only increase the volume level until the laugh clips? or should we allow clipping? or …?

Since the human ear hates digital clipping with a passion, you’ll be unsurprised to hear that we don’t want to allow clipping in normalisation. That seems to leave us in the unhappy position that we can only increase a recording’s volume until the point that the loudest peak would start to clip, even if that doesn’t allow us to increase the overall volume as much as we want.

Fortunately many normalisers also perform compression! They tend not to call it “compression”, though, instead calling it “limiting”, but for our purposes we can think of this as a compressor with fixed settings 17.

As earlier, it’s easier to see than to hear this. The three figures below show (from left to right, with “higher” on the y-axis meaning “louder”):

The original recording, which has a mostly narrow dynamic range with one very obvious peak.
Naive normalisation, increasing the recording’s volume level but causing the peak to be clipped: notice that it has turned into a flat horizontal line. This will sound horrible!
Normalisation and limiting, increasing the recording’s volume level but compressing the peak so that its “shape” is retained but it is no longer as relatively loud as before.

Compressing audio

ffmpeg has the excellent loudnorm filter to perform normalisation and limiting. Alas, as is common with ffmpeg, the documentation of loudnorm is woefully incomplete, making it almost impossible for normal users to understand and use. Fortunately, the external ffmpeg-normalize tool does all the tricky stuff for us.

loudnorm is both a normaliser and a limiter, not that the documentation really explains that: I once had to perform some experiments to convince myself that it’s doing limiting 18! In fact, the limiter works so well on spoken word that it underlies my suggestion from earlier in this post for those in a hurry to use ffmpeg-normalize and not worry about explicit compression. Manual compression does lead to a better mix, but the difference often isn’t as stark as you might expect: indeed, if all you take away from this post is “use ffmpeg-normalize” then I have not wasted my time entirely!

Let’s quickly look at the ffmpeg-normalize command-line from earlier:

ffmpeg-normalize <input file> -o normalised.mp3 \
  -c:a mp3 -b:a 96 -pr -t -16 -tp -1

In order, the options are: -o normalised.mp3 is the output file; -c:a mp3 specifies the output format (in this case MP3) and -b:a 96 specifies a bitrate of 96 (for spoken word, a reasonable trade-off between file size and audio quality); -pr shows a progress bar (normalisation is not a fast job, in part because it makes multiple passes over the audio); -t -16 specifies a target loudness of -16 (in LUFS, which we can think of as dB; the default is -23, which was probably appropriate for television back in the day, but -16 is recommended for podcasts and the like); -tp -1 specifies a maximum peak (in DBFS, which we can think of as dB; the default is -2 but I find that unduly conservative). The last of these options (-tp -1) I would consider optional, but for most of us -t -16 is recommended.

Using compression and normalisation together

As you’ve probably realised, I believe that the best recordings are first compressed and then normalised. This can seem a bit surprising, because I’ve said that normalisation also does compression: what’s the difference?

Let’s start by taking our raw recording and normalising it (without a separate compression stage):

Compare that with the raw recording when we first compress it and then normalise it:

Most of us will tend to slightly prefer the sound of the latter, though I suspect few of us would be able to easily articulate why. If we again visualise the two waveforms (orange is normalised only; blue / teal is compressed and normalised) atop each other we can retrospectively understand the difference:

Compressing audio

The compressed and normalised (blue / teal) waveform is more uniform, with slightly lower peaks (i.e. the loud parts of speech aren’t quite as loud) and slightly higher troughs (i.e. quieter words are slightly less quiet). In nearly all situations this will lead to a slightly smoother listen, with better intelligibility. The differences in this case are relatively subtle, but depending on who’s speaking, the differences between the two can be more pronounced.

Wrapping up

This post is rather long, but I’ve still had to sweep quite a lot of details under the carpet in order to focus on something close to the “fundamentals”. There is a lot more I could have said from multi-person recording to noise gates to further automation: I’ll probably do at least one follow-up post on the latter.

Summarising the post as best I can, I’d say the following points will get you most of the way to good quality spoken word mixes:

Even fairly modest recording equipment will get you good results, but don’t use the microphone built into your laptop!
Avoid reverberations by using a room with soft furnishings; adding further soft furnishings will improve the situation further.
Keep the microphone a hands width away from your mouth.
Always use ffmpeg-normalize.
For the best possible results, use compression too.

Of course, all of this might make your audio sound better, but you still have to make the recordings in the first place, a lesson I should learn for my own somewhat neglected podcast…

Acknowledgments: Thanks to Dan Luu for comments.

Newer 2024-08-21 12:30 Older

If you’d like updates on new blog posts: follow me on Mastodon or Twitter; or subscribe to the RSS feed; or subscribe to email updates:

Footnotes

When comparing two mixes, we can easily foll ourselves, as our ears initially prefer slightly louder recordings. I’ve normalised (except where it doesn’t make sense) all the recordings to avoid fooling ourselves in this way.

☒

Nearly all of what I say can be applied to live broadcasts, but there are some subtleties, chiefly to do with the need for low latency, that muddy the waters.

☒

Nearly all of what I say can be applied to live broadcasts, but there are some subtleties, chiefly to do with the need for low latency, that muddy the waters.

Indeed, microphone technology has advanced more slowly than any other comparable area I can think of: or, alternatively, it was perfected earlier than one might think. The two microphones most of us recognise are the Shure SM57 (used widely for music vocals) and the SM57B (for radio). The former was released in 1965 and the latter in 1973: both were relatively simple evolutions of previous models.

☒

My favourite are those laptop microphones that pick up the laptop’s fan noise. In fairness, some laptops now do some clever tricks to make them sound decent, but even the best don’t yet match a half-decent microphone placed closer to your mouth.

☒

Personally, if you have deep pockets I’d suggest a capacitor / condenser microphone, but most of us will be very happy with a dynamic microphone.

☒

Personally, if you have deep pockets I’d suggest a capacitor / condenser microphone, but most of us will be very happy with a dynamic microphone.

This problem is compounded by the fact that listeners will often listen in a reverberant space: adding their reverberations to ours is a recipe for unintelligibility.

☒

This problem is compounded by the fact that listeners will often listen in a reverberant space: adding their reverberations to ours is a recipe for unintelligibility.

Treating corners is less about reverberations than “frequency traps”. The more bass-heavy a sound is, the more that this becomes a problem.

☒

Treating corners is less about reverberations than “frequency traps”. The more bass-heavy a sound is, the more that this becomes a problem.

44.1KHz is also OK. Don’t be tempted by higher sample rates than 48KHz: none of us can hear the difference for spoken word, and bigger files waste space and processing time.

☒

44.1KHz is also OK. Don’t be tempted by higher sample rates than 48KHz: none of us can hear the difference for spoken word, and bigger files waste space and processing time.

Helpfully, affordable preamps have really improved in quality in recent years, which means that they a) introduce less noise into your recording b) cause fewer problems if you record a bit too quiet. This means that you have to worry less about “am I recording loud enough?” than in the recent past.

☒

Note that I downsampled to 4KHz first: without that, the SVGs take too long to transfer and too long to render.

☒

Note that I downsampled to 4KHz first: without that, the SVGs take too long to transfer and too long to render.

Our ears are, in a sense, tricking us: producing a 10dB difference requires the underlying sound source to be using about 10x more power! There are other surprises to “what does a doubling in dB mean?” but we don’t need to worry about them.

☒

The loudness war has made a great deal of music unnecessarily unpleasant to listen to, simply so that people can exploit the human tendency to initially prefer a slightly louder over a slightly quieter thing.

☒

The human voice has a dynamic range of around 70-80dB from a whisper to a shout.

☒

The human voice has a dynamic range of around 70-80dB from a whisper to a shout.

If you’re an exceptionally fast, or a particularly slow, speaker, you might get slightly better results by adjusting those values up or down.

☒

If you’re an exceptionally fast, or a particularly slow, speaker, you might get slightly better results by adjusting those values up or down.

Personally – and this is at best subjective and at worst placebo – I don’t like the sound of acompressor a great deal when used with a fast attack and moderate ratio, at least in the version of ffmpeg I’m using (4.4.5). I prefer Steve Harris’s mono sc4m compressor, but it’s not part of ffmpeg, and is somewhat harder to use, so I went with acompressor for this post.

☒

I tend to compress very loud sounds first, then run a more normal compressor over medium loud sounds: the two compressors have low-ish ratios (2–2.5x) that collectively give a similar, but slightly smoother, effect to a single compressor with a higher ratio.

☒

Typically fast attack, fast release, and a very high ratio.

☒

Typically fast attack, fast release, and a very high ratio.

Eventually I found a blogpost from loudnorm’s author explaining how loudnorm works in more detail, including its limiter.

☒

Eventually I found a blogpost from loudnorm’s author explaining how loudnorm works in more detail, including its limiter.

Recording and Processing Spoken Word

Recording

Equipment

Room and technique

Processing

Dynamic range

Dynamic range compression

Normalisation

Using compression and normalisation together

Wrapping up

Footnotes

Comments