Innovation Inspiration — Why We Created Our Own Audio Format

 

 

If you’ve attended any VR event, odds are very high that you’ve heard some familiar phrases — “We’re still figuring out what works and what doesn’t,” or maybe “There’s still no uniform or good audio format.” It might be comforting to hear that you’re not the only one beating yourself up over these things, but they’re still generally vague complaints with no actionable solutions. The desire for solutions that we can act on gets echoed again and again at events and among the community.

Recently, more people are experimenting with Ambisonics as their VR audio format. If, however, you’ve ever experienced how much your sound output quality degrades when listening in First-Order Ambisonics, you immediately know it’s not what you want.

The logic behind combining Ambisonics and object approaches for VR audio.

Ambisonics a.k.a The Format on the Rise — Ambisonics has been around for a while but is only now becoming popular with the rise of VR. Essentially a snapshot of spherical sound, Ambisonics has also been referred to as “scene-based” audio. It lends itself nicely to 360 video or cinematic VR and creates believable ambiance from a recording perspective. Unfortunately, this alone doesn’t make it perfect for VR across the board. One of the first problems people run into is that individual sound sources can’t be manipulated. This leads to sacrificed localization accuracy and many listeners perceive sound sources as part of a blurred sound with some general directivity. Localization becomes comparable to object-based audio only once you get above Third-Order Ambisonics. However, since many Digital Audio Workstations only support up to 8 channels as input tracks, Higher-Order Ambisonics aren’t even compatible with those authoring tools. Additionally, an Ambisonics approach is not appropriate for delivering non-diegetic stereo like background music or narration. Most importantly for current VR trends, it also doesn’t work well for 6DOF (6 Degrees of Freedom) settings. As illustrated in the image below, it’s difficult to accurately reflect the movement of the VR user since the recording is only portrayed from one position.

3DOF 6DOF VR Audio
3DOF and 6DOF experiences require different audio approaches

Object a.k.a. The Perfect Partner— Object-based audio has emerged as the optimal way to complement Ambisonics. When sound sources are represented as individual object (or mono) tracks, localization is possible at a more precise level than ever before.

It’s difficult to capture a dry signal of each individual object during a live recording, but if you can manage to do so, you no longer need to worry about mapping it to a loudspeaker position. Instead, you can place sound objects at any point in three-dimensional space, providing an unprecedented level of freedom. If paired with a renderer that supports an object-based mix, there will be no downmixing from the project to the end-user listening environment. This guarantees that the original vision of the creator is realized at every point in the workflow chain. Additionally, with the independent control that object-based audio affords, mixing and mastering an object signal results in a more natural sound than doing the same to an Ambisonic signal. What about those tricky 6DOF environments that were so tough for Ambisonics? Object-based audio easily reflects user’s gameplay and interactions which is why it’s already been used for an extensive variety of game content. Individual objects may tend to do a poor job of portraying scene-sized ambiance but their raw characteristics have the potential to be fully leveraged in VR, especially when paired with Ambisonics.

Channel a.k.a. The Familiar Friend — If using an object and Ambisonics approach are already good enough, then why does channel even need to be considered? Channel signals have been used for a long time and are still widely used in conventional audio workflows. Many players and devices still only recognize and support stereo, and don’t do the same for Ambisonics or object. While channel is the most familiar audio format, it has critical shortcomings for VR applications. Channel inherently portrays a discrepancy between where a sound is generated (like someone talking) and where the user perceives it (from various speaker locations). No matter how much you increase the number of channels, this approach relies on a spread or blurred effect, resulting in inaccurate localization. Also, the two-step rendering conversions outlined in the quadraphonic layout example image below result in a substantial loss of quality. However, it should be noted that conversion to binaural output occurs at any channel (above 2.0), not just in a quadraphonic layout. First, a source is mapped to a speaker in a process called amplitude panning. It then goes through a binaural rendering process and is mapped to each headphone driver. Most adversely for VR, channel relies on a fixed virtual speaker location. So while it is potentially possible to utilize channel for 3DOF, the two-stage rendering process illustrated below and blurred directionality significantly harm the audio quality.

 

Binaural rendering
Sound sources are mapped to speakers first and then binaurally rendered to headphones
MPEG-H 3D Audio does support the combination of object, channel, Ambisonics

Fortunately, the shortcomings of each format can be offset by the strengths of another. Is it even possible, however, to combine support for all three? The 3D Audio codec established by MPEG does just that. Core members of G’Audio Lab have actually been key participants in international audio standardization meetings going all the way back to 2005.

While MPEG-H 3D Audio was developed to support channel, object, and/or Ambisonic signals, it is not optimal for VR. At its inception, VR applications and the accompanying headphone delivery methods weren’t really on the radar. MPEG-H 3D Audio was created as a standard for situations in whichthere may be many loudspeakers used in the audio presentation.It was meant to address High-Efficiency Coding and Media Delivery in heterogeneous environments. The focus became UHDTV and then multi-channel configurations like 22.2 surround sound. It’s important to note that MPEG-H 3D Audio is both a format and a codec. As a codec, MPEG-H 3D Audio’s essential functions revolve around receiving the appropriate signal and then delivering, or playing it. As such, it neglects integral parts of the entire VR audio workflow — namely, recording/capturing as well as post-production. Additionally, since it is an entirely new codec, the current market is not readily prepared to adopt it, a fact which discourages creators from using the format even more. An ideal VR audio format would be codec-agnostic and easily adopted into the market. All of this returns us to the original complaints and the quest to find a unified audio format for VR.

Not stopping at the creation of an entirely new audio format

After contributing to MPEG-H 3D Audio and drawing inspiration from its success, the G’Audio team decided to dedicate our binaural rendering technology wholeheartedly to the VR medium. We quickly realized that it was imperative to create a new format that could incorporate the strengths of all three audio formats. “Just as stereo is played by mp3, VR audio should be played by GAO!” The rest was history. GAO = G’Audio + Ambisonics + Object.

This proprietary format developed by G’Audio Lab delivers a superior sense of localization and sound quality compared to existing industry offerings. It supports all three audio signals which allows object tracks to provide pinpointed sound, while Ambisonics supply meaningful (and believable) ambiance to the scene. GAO metadata contains the positional information for playing and rendering the respective object, channel, and Ambisonics signals. It is accompanied by an audio wave file and the desired video container upon export. This has never been supported by existing audio formats so this marks a big step forward for VR audio.

Creating an A to Z audio workflow is no easy task

Almost immediately after development, it became clear that GAO alone was not enough. Final sound quality is determined not only by how a track is recorded, mixed/mastered, and exported but also by how it is rendered and played. To date, there is still no object-based player platform that supports HRTF rendering in the market. So it seemed that GAO, and any promise of combining objects and Ambisonics, had hit a dead end — it couldn’t be rendered and played on any existing publishing platform. It didn’t take much to see that somebody needed to be building an entirely new renderer… and that’s precisely what the G’Audio team did.

Facebook started to support a combination of Ambisonics and stereo while YouTube provides support for FOA. When a project is exported in FOA, all channels are inescapably downmixed into a four-channel output, which is portrayed as 8 virtual speakers. This inevitably deteriorates sound quality and localization accuracy. By comparison, an object-based mix can be portrayed as a potentially unlimited number of virtual speakers — preserving sound quality and localization accuracy. G’Audio Sol, the spatial audio renderer SDK, was therefore built to support GAO and provide a solution to this issue. The Sol SDK can be integrated into any web player, HMD, or standalone app in a simple and straightforward process. Sol also supports FOA, and can even improve the quality of FOA projects thanks to its superior binaural rendering engine.

Yes, we even built a Sol reference player…
G’Player can be downloaded for free from the Oculus Store

When Pro Tools projects are exported in GAO and played through Sol — on any player or platform that has the Sol SDK integrated — creators can deliver their original vision in its purest form. Those complaints that plague panels and conventions today can finally be silenced. Still not convinced? Go ahead and bounce your project in GAO format, listen to it from G’Player, and hear the difference.

Leave a Reply

Your email address will not be published. Required fields are marked *