Audio Moderation: How It Works and Why It Matters

April 27, 2023 | By Jeff Meyer | Profanity Filter

Audio moderation is the process of reviewing spoken content in voice messages, podcasts, live streams, and video audio to detect harmful, unsafe, or policy-violating material. Because audio can include multiple speakers, background noise, accents, and tone, it often requires both AI transcription and human review to moderate accurately.

Did you know that more than a billion hours of video are watched on YouTube every day? And a growing proportion of this content is audio-only. From social media to comment sections, audio content is quickly becoming an essential tool for brands to communicate with their audience. However, audio moderation presents unique challenges for online communities that you might not have considered.

With the explosive growth of user-generated content on social media and other online platforms, many brands are finding it increasingly difficult to keep track of what their users are uploading, particularly audio files (be these MP3 files like podcasts or video files)). This can leave platforms vulnerable to all manner of risks, from legal challenges over copyrighted material to damage to brand reputation from harmful or inappropriate audio content. Without the right tools and expertise to protect them, businesses leave themselves open to serious consequences.

When it comes to audio moderation, threats take many forms. Consider that audio includes: music, podcasts, live streams, direct message (DM) sound bites, background and foreground voices in video, and even stock sound effect files. Of these, video and DM audio pose the biggest risks, with DMs being a particular issue due to their potential for use in spamming. Brands that use audio as a marketing channel or communication option for their customers must be aware of and mitigate said risks, just as they would with images or written content.

WebPurify’s audio moderation service combines human moderators with AI to ensure accurate and efficient moderation. Moderators check audio in the language they speak natively, while AI can transcribe the audio file itself, turning it into text , and then running that text through WebPurify’s profanity and intent filters The combination of human expertise and AI technology ensures review is comprehensive but scalable, fast but not at the expense of accuracy.

Of course, depending on desired turnaround time and budget, some companies need to value engineer approaches accordingly. For example, they might opt to only use AI audio moderation unless certain key phrases have dual meanings or can be malicious in certain contexts, in which case the entire submission is escalated to a human. This optimizes SLAs, and reduces human moderator workloads, but still affords the ability to closely review nuanced content that would be a challenge for AI alone.

WebPurify’s audio moderation service has helped brands across a spectrum of industries avoid bruised reputations, damaged user experiences and the myriad legal issues that follow, not to mention adverse effects to company bottom lines. This is particularly noteworthy because audio moderation often gets short shrift relative to other content types. Brands know it’s important, but it’s a bit less approachable and more complicated than image, text or video review, and thus often deprioritized and victim to budget constraints.

Below we answer some of the most common questions about audio moderation, and hopefully, illuminate one of the less discussed but no less important parts of a truly comprehensive UGC review.

What types of audio need moderating most?

DM (direct message) audio, and audio in video (particularly background or overlapping voices) pose the biggest risks. They’re found in abundance online, posted almost as quickly as they’re reviewed, and aren’t always of the highest quality (as opposed to a podcast or recorded music). Especially in the era of smartphones, most video doesn’t involve sound-engineered feature films but instead amateur clips made with social media in mind. The subjects’ voices can be difficult to isolate or untangle from that of their friends, street noise, or music, and this makes it easier for a bad word or phrase to go undetected Voice DMs are a fun, useful, feature but can be a boon for spam if not carefully checked.

Unlike text DMs, they’re far more difficult for AI to peg as inappropriate without human help. They can even be intentionally inscrutable as part of a harassment ploy and, when meant to be upsetting, are more successful than written messages, which are less visceral. For example, WebPurify has encountered more than a few instances on social and dating apps wherein a user will leave voice memos that have no foul language, but involve heavy breathing, suggestive sounds, or simply silence. None of these examples will trigger an AI filter, but are clearly disturbing. Human moderation (and/or user reporting features) are required.

How does WebPurify’s audio moderation service work?

Any audio, be it from a video, podcast or live chat, is run through an AI tool that transcribes it very quickly (not quite real-time, but close to it). Said transcription is done in the file’s native language to account for colloquialisms. That’s step one. This transcription is then fed through WebPurify’s industry-leading profanity filter to instantly detect any obvious offenses. As with any use of the profanity filter, the client’s custom block/allow lists will be enforced for audio as they are for written text.

Similar to WebPurify’s photo and video moderation, this quick process eliminates the majority of offensive content. Anything on WebPurify’s standard NSFW list and any obvious violations of the client’s defined rules are quickly filtered out. But anything that isn’t clear-cut or gets lost in translation is delegated to the human audio moderation team. These teams can directly review audio in English, Spanish and several major Indian languages. If moderation in other languages, such as French, is required, WebPurify has longstanding global partnerships that achieve this quickly and efficiently.

How do WebPurify’s human moderators level up audio moderation?

Outside of audio that features multiple languages or is difficult to transcribe as a consequence of low quality, tonality is far and away the toughest challenge for AI. In fact, it’s near impossible to accurately discern without humans. From sarcasm to local vernacular, the nuance of speech matters. Put into real terms, almost everyone has at some point in their lives sarcastically said Kill me now / Just shoot me / I wish I was dead. Usually, it’s in jest, but AI doesn’t know that. This is why WebPurify believes it’s so important to have a balance of AI and human content moderators. AI filters out the egregious examples, while humans preserve the integrity and user experience of your platform by understanding when someone is using black humor.

Humans are also relied upon where copyright infringement is a concern. While some AI certainly can identify well-known lyrics as a “match” for intellectual property, this is usually reserved for better-known artists. Speeches, ebooks, articles, impersonations and undiscovered music isn’t as straightforward and can’t be cross-referenced with a quick Google search by AI in the way a human might do due diligence for suspected copyright violations.

What are the challenges when moderating audio files?

In a perfect world, every audio file that needs scrutinizing is recorded in a quiet space and in one language. But life doesn’t work that way. At campaign events, for instance, politicians often speak over one another or members of an audience will shout at the same time, competing to be heard. At multicultural events, especially sporting occasions, you’ll hear multiple languages in the crowd Conversely, some individuals might be speaking in an otherwise sterile environment, with little background noise, but they have a thick accent when not talking in their mother tongue. The lyrics to many songs, in any language, are hard to decipher on first listen. And, lastly, sometimes audio involves discussion of very unpleasant or controversial topics in an effort to repudiate or analyze – not support – them.

All of the above can be tough for AI, alone, to comprehend especially when you consider automated moderation starts with transcription. If an uploaded file’s sound is “messy” this can lead to things quite literally getting lost in transcription, at which point the AI model is being fed bad data. This is where human teams that speak multiple languages and who are trained to take context into consideration, to listen for something inappropriate shouted in the background of a video or said under someone’s breath in a podcast, are key.

How can brands reduce their audio moderation costs?

If you’re a brand that requires audio moderation but needs to be extra mindful of budget, the best thing you can do in order to lower costs while increasing accuracy is to include closed captioning on your videos wherever possible. Not only is this a helpful nod to ADA accessibility, but subtitles save time on transcription, allowing AI to jump straight into using OCR for review and allowing human moderators to both read and listen, making it easier to get things right the first time. Such is the increase in efficiency, if you approach a moderation company like WebPurify with videos that mainly feature closed captioning, you will likely get more competitive pricing since it represents an overall lighter AI “lift” and/or human workload.

In conclusion

Audio is absolutely one of the four legs of the user-generated content stool, the others being images, video and text. Too often, though, keeping it safe is thought of as “one more expense” and less pressing than, say, catching nude photos on social media etc. It’s also admittedly tough to do right. Since sound can’t be “seen”, it’s by nature trickier to moderate, especially when multiple voices, tonality, background noise, accents and languages come into play. A video might be slightly grainy, or text misspelled to camouflage something NSFW, but these are modest challenges in comparison to reviewing spoken (or shouted!) word in a podcast, protest or music, for example.

Even so, or perhaps especially so, it doesn’t pay to skimp on audio moderation. It’s the UGC medium most likely to have something inappropriate slip through the cracks, and brands need to hedge accordingly. Fortunately, as we’ve illustrated here, with a bit of guidance and the prudent blending of human and AI review, companies can keep costs down while keeping their guard up.

For more information on our service, contact us or visit our video moderation page.

Audio Moderation: How It Works and Why It Matters

What types of audio need moderating most?

How does WebPurify’s audio moderation service work?

What are the challenges when moderating audio files?

How can brands reduce their audio moderation costs?

In conclusion

Request Demo

Stay ahead in Trust & Safety! Subscribe for expert insights, top moderation strategies, and the latest best practices.