Here’s the typical conversation we have when we meet fellow audio friends.
A: You guys do spatial audio? What is that?
G: You can hear sound in VR as if you were to experience it in the real world. Try this demo.
A: Using headphones? Then isn’t it just stereo? Or panning? (Trying the demo..) Oh, I see.
Once you hear it for yourself, you can easily tell the difference between regular stereo audio and spatial audio. However, it’s difficult to explain verbally how spatial audio is different because spatialized content is still being delivered over regular headphones. This article is to differentiate between stereo, panning, and spatial audio.
Stereo, short for stereophonic sound, is the most popular reproducing method. Using two or more loudspeakers, it gives the perception of a wide audio image with sound coming from various directions. The term stereophonic refers to multiple loudspeakers, but stereo commonly only refers to two channels, left and right. When using more than two channels, the setup is usually called surround sound. An electronic device that reproduces stereo sound is also called a stereo.
From a capture, mixing, and rendering perspective, stereo sound can be broken up into two types. The first type is natural stereo, or true stereo. In this case, the recording is captured using more than one microphone. Typically, two microphones are placed left and right at a certain distance from one another, with each microphone capturing subtle differences in time arrival and sound pressure. This captured sound then gives the perception of width and depth to a mix. For playback, usually two speakers or a pair of headphones can be used. Natural stereo is usually used for acoustic music and classical music.
Another type of stereo sound is artificial stereo or pan-pot stereo. Typically, a virtual stereo image is created by mixing multiple sound and instrument tracks, that are captured individually. During this process, panning is used to position each track in a specific position. In conventional stereo mixing, panning is the basic method to control the sound image from left to right. This is part of the reason why the pan-pot is placed right above the input fader, so that panning can be easily controlled on a mixing console. Unlike natural stereo, each sound can be individually captured, then positioned somewhere between two speakers during the mixing stage.
As stated above, panning is important in channel-based rendering. The most widely used panning method is amplitude panning. Depending on where you want a sound to be placed in a mix, you adjust the amplitude of that sound via the two left and right channels. The ability to discern whether a sound is coming from the left or the right via amplitude, is called the interaural level difference (ILD). It basically entails that if a sound is coming from the left, it would be louder in your left ear, and vice versa.
Panning does have its limitations. Since loudspeakers are usually placed in parallel, sound can be only be heard in a flat horizontal plane. To place sound vertically, the speaker layout itself should include an elevation aspect, such as 3D VBAP. Additionally, the distance between the listener and the virtual sound source is not easy to control, as the sound image is fixed to a virtual arc between the left and the right channels. Instead of feeling that the sound is closer or farther, you may feel that the sound is bigger or smaller in that loudspeaker area. The critical limit here is that panning cannot place the sound image beyond the angle of the loudspeaker layout.
This limitation is magnified in the headphone listening environment, where speakers are placed right over each ear. In headphones, no matter how hard you pan, the sound image will simply be moving between the two speakers, resulting in what’s called in-head localization. For this reason, panning is not suitable for VR content because sound can be placed anywhere in a three dimensional space.
While the process used to create stereo sound is known as panning, the process used to create spatial audio is known as binaural rendering. Binaural rendering lets one hear sounds as if they were emanating from outside of the headphones, with audio sources sounding like they are coming from somewhere in the three dimensional environment independent from the headphones. Binaural rendering involves Head Related Transfer Function (HRTF), which contains both monaural cues and binaural cues. HRTF can be broken down into unique measurements of sound that differ depending on the location of the source of that sound. In a broader sense, this rendering process is called spatial audio. Because spatial audio uses HRTF, sound can be placed anywhere in a 3D space, with elevation and distance also being taken into account. Using binaural rendering and HRTF, even if spatial audio is consumed through headphones, it’s possible to hear sounds as if they were coming from external sources.
In VR, the position of the listener is constantly changing (6DOF), so panning is not suitable. The first reason is because panning is limited by a physical speaker layout. It is very difficult to install speakers in every direction to reflect yaw, pitch, and roll correctly. Even then, distance would still not be conveyed properly. Another reason is even if headphones are used for interactive panning, in-head localization will hinder realistic listening experience. Because it would sound like the audio is coming from inside of your head, rather than from the external world. However, when using HRTF based binaural rendering, sound can be placed anywhere in a virtual space, allowing the listener to hear sounds as they should be heard in a 6DOF environment.