Think like a bad guy: how to adopt the mindset of bad actors to test your content moderation defenses

August 8, 2024 | By Jeff Meyer | UGC

WebPurify is best known for its versatile suite of content moderation services, comprising a combination of human and AI content review that can be used standalone, or together in a hybrid fashion. Complementing this core competency is our Trust and Safety Consultancy, which helps customers across many industries and use cases with their more holistic moderation concerns: from drafting playbooks and defining brand criteria to best practices for hiring and scaling their in-house teams.

Excellent moderation tools, well-laid rules of engagement and defined brand criteria: it all sounds good, and it is, but as the adage goes even the best laid plans need reworking after first contact with the enemy. It’s for this reason that WebPurify advises our clients to try and assume the mindset of a bad actor when designing, implementing (and testing) their content moderation defenses.

After all, it’s far more preferable to stress test your setup and discover gaps in protection when you’re playing the adversary versus dealing with a real one.

Even though this is an incredibly useful exercise, for many it doesn’t come naturally. They don’t know where to start beyond basic concerns like use of crude language or flagging explicit photos.

Fortunately, the WebPurify team is here to help. We’re well acquainted with many ways folks acting in bad faith try to bypass content moderation, and we help our customers peer into these characters’ psyches in order to prepare accordingly.

Yes, implementing solutions like WebPurify is more than half the battle, but how those solutions are folded into your platform overall also matters, as does thinking of threats that fall well outside “the norm”. This is especially true when deciding between using AI on its own or in combination with humans, depending on your risk tolerance.

How to test your content moderation defenses

Here are are some tips to help you assume a bad actor’s mindset and truly put your content moderation setup through its paces:

1. Forget the basics, embrace edge cases

Don’t waste too much time checking if your AI or human teams catch variations on basic things like weapons or nudity. Consider instead:

If it’s likely your users might upload artistic, non-photorealistic but nonetheless inappropriate content (which can fool even the best AI models and necessitate human oversight).
What types of online scams your industry, customer base and platform is most susceptible to.
Whether your users can combine (caption) images with text, thereby giving an otherwise harmless photo and text snippet new meaning when used together.
For more on this, see our section on ‘Context’ below.

2. Walk the line of your criteria

Purposely toe the line of your own criteria and see how your playbook or Standard Operating Procedure (SOP) for moderation holds up under scrutiny. Is it too vague? Too strict? Are processes in place to make amendments when new, unconsidered challenges arise? For example, you might have a rule that reads, “No alcohol use.” Would a photo of wine being used in a religious service violate this? How about a racecar driver spraying champagne to celebrate a win?

3. Sweep for points of entry

It’s easy to get fixated on the places within your website or app where user-generated content (UGC) is obviously going to be making a regular appearance. This can leave other parts of your platform relatively unsupervised and tempting for ill-intentioned users to exploit.

Do you have content moderation safeguards in place for your customer service portal? Can you protect your employees from bad actors on chat support and, conversely, can you protect your customers from an irate employee on?
Is live content (if applicable) as thoroughly moderated as pre-recorded or static content?
If you’re in an industry less commonly associated with content moderation, or thought to “not really need it as much,” have you scrutinized this assumption? For example, many food and beverage companies use WebPurify to flag inappropriate order/ticket names provided by customers for food pickup, or even internal notes printed on order tickets and written by staff that may be disparaging of a customer. Similarly, delivery apps often moderate direct messages between drivers and recipients but fail to moderate “delivered” confirmation photos, which can be highly inappropriate on rare (but very upsetting) occasions.

4. Stay vigilant for spikes in traffic or content submissions that aren’t allowed, but are rare and infrequent, thus easy to miss

We’ve touched a bit on this in other blog entries, but it bears repeating. If you’re using an AI moderation solution, be sure it’s built to scale in the case of malicious attempts to overwhelm with high content volumes (or more benign spikes during holiday traffic, for that matter).

Likewise, don’t neglect to train your human moderators on what to do if they’re falling behind in their queue. Never retain a third party vendor that doesn’t have processes in place to ensure delivering on their SLA, and be sure to test these protocols from time to time, checking that your moderators don’t freeze up in the face of big swings in volume.

5. Develop a quality assurance plan

A healthy QA process is always advisable, and should include dropping “test” images that feature violative but less commonly seen content that breaks less frequently enforced rules into moderators’ queues. Don’t let something slip through your defenses because it’s a rare offense and your moderators (or vendors’ moderators) become complacent.

6. Play the troll (and read about them too). What does your product / do your products lend themselves to in terms of bad behavior?

Strange as it sounds, it helps to dream up ways you could cause problems for your platform and read about how bad guys caused trouble for the other players in your industry (there’s unfortunately no shortage of reporting on PR headaches caused by sloppy or insufficient moderation). It pays to brainstorm what you would do were you intent on causing your own company harm, especially in ways specific to your UX and product features. Don’t be afraid to get creative: this aids in covering the less obvious bases. To illustrate:

If you’re in the video game industry, you might consider how users could maliciously exploit a bug or glitch to ruin the user experience for everyone. Will you depend on a player to report this to mods, or is it better to have a team of moderators secret shopping the game for bad behavior?
If you’re an ecommerce company, what’s your strategy for detecting and taking down listings that represent Intellectual Property infringement? Do you have measures in place that prevent buyers and sellers from completing the transaction off-platform, cutting you out? Have you put these measures to the test?
If you’re in any number of industries, you might offer a refund request function. Have you attempted to falsely claim refunds, perhaps even going so far as to provide a photoshopped receipt. What was the outcome?
Is your platform something built for creators in the vein of Fansly or Patreon? If so, you likely vet their accounts (and the user’s identity) when they’re first created. Do you subsequently vet these accounts again? If a person entirely different than the initial accountholder started uploading content, would this be detected?

7. Submit your own images to see how your AI performs

Testing your content moderation setup with real examples from your platform is a very good way to see how well your defenses stack up. WebPurify’s new image moderation demo tool makes this process of testing our AI straightforward and insightful. By submitting your own images, you can see how we handle various types of content, identifying what gets flagged and what requires escalation to humans

This demo is invaluable for understanding the limits of AI moderation. Artistic or non-photorealistic images, for instance, can often be tricky for AI to classify correctly. By testing with your specific content, you can fine-tune your moderation rules and measures taken to better suit your platform’s needs and stay ahead of potential issues.

8. Ensure account integrity

For any platform that allows users to create accounts, ensuring that the person using an account is the same one who created and verified it is crucial for maintaining the safety and integrity of your brand. KYC (Know Your Customer) processes are a start, but they need to be part of an ongoing strategy to prevent misuse. For example:

Initial Verification vs. Ongoing Monitoring: verification shouldn’t stop at account creation. Continuous monitoring is essential to ensure account integrity, and regular re-verification, including biometric checks, can help maintain this.
Multiple Accounts, One User: this poses a challenge to platform integrity, as individuals create multiple accounts to manipulate systems or evade bans. Implementing protocols such as monitoring IP addresses and analyzing behavior patterns can help detect and prevent these activities, ensuring that each user maintains a unique and verified identity.
Behavioral Analysis: implementing behavioral analysis helps identify discrepancies in account usage. Sudden changes in posting patterns or content can signal different users on the same account.
Enhanced KYC Procedures: go beyond basic ID checks. Use additional verification layers like live selfies, video verification, and cross-referencing social media profiles to make it harder for accounts to be misused.
Regular Audits: conduct regular audits to spot anomalies and identify compromised accounts. Being proactive helps maintain account integrity across your platform.
Educate Users: inform your users about the importance of maintaining account integrity and the risks of sharing accounts. Encourage them to report any suspicious activity and make sure to provide easy re-verification options if needed.

9. Addressing the GenAI Challenge

By now, everyone is aware that generative AI has the ability to create realistic images and videos and this has vast potential for misuse. This is a problem for platforms in any industry because genAI has become so sophisticated many of its images and videos are difficult to detect. Frankly, it’s a challenge many industries weren’t prepared for, and is only going to get worse.

WebPurify’s advanced AI image moderation model is part of our larger toolset for generative AI content moderation and is designed to accurately detect synthetic media, spotting subtle cues that indicate whether an image or video has been AI-generated. Our model ensures any potentially deceptive content is flagged for human review.

10. Context is king – and devious

When it comes to effective content moderation, context is everything. Being able to understand nuance and context is integral to preventing an unwanted number of false positives and negatives, and this balance is best struck with a combination of human and AI moderation. As touched on earlier, an innocent image or phrase can become harmful or offensive when combined with the right (or wrong) text. This is why understanding and moderating contextual nuances is so important.

Let’s consider a juvenile but effective example: An innocuous picture of a whale paired with the caption “your mom” suddenly turns into a derogatory joke, but this might slip by a purely AI-powered moderation system that struggles with contextual analysis. Moderation programs must be equipped to analyze both images and accompanying text to catch such combinations, and this almost always means weaving human oversight into the mix.

Certain phrases and images can also become inappropriate due to current events. For instance, you may be familiar with the phrase “Let’s go, Brandon,” which in some corners of the internet has become a way to deride President Biden while avoiding profanity filters. Keeping up with current events and understanding their potential impact on your platform’s content is vital in this age of fast-moving news cycles.

The important point here is that content moderation systems need to continuously learn and adapt to new contexts and trends. What was once considered acceptable might change overnight, and systems must be able to evolve with these changes to remain effective. At WebPurify, our Moderation as a Service solution is continuously updated – as frequently as every week – to respond to new challenges.

In conclusion

To maintain a secure platform over time, brands must think ahead and anticipate the tactics bad actors might use to bypass their content moderation. By adopting a proactive mindset, you can identify vulnerabilities and strengthen your defenses before new challenges overwhelm.

Whether it’s ensuring ongoing account verification, understanding the nuances of context, or utilizing advanced tools like WebPurify’s AI image model, preparation is key. By imagining how your systems could be exploited and thinking like a “bad guy”, you can better protect your platform and create a safer, more reliable space for your users.

Note: Make sure to check out our photo moderation challenge while you’re here.

Think like a bad guy: how to adopt the mindset of bad actors to test your content moderation defenses

How to test your content moderation defenses

1. Forget the basics, embrace edge cases

2. Walk the line of your criteria

3. Sweep for points of entry

4. Stay vigilant for spikes in traffic or content submissions that aren’t allowed, but are rare and infrequent, thus easy to miss

5. Develop a quality assurance plan

6. Play the troll (and read about them too). What does your product / do your products lend themselves to in terms of bad behavior?

7. Submit your own images to see how your AI performs

8. Ensure account integrity

9. Addressing the GenAI Challenge

10. Context is king – and devious

In conclusion

Request Demo

Stay ahead in Trust & Safety! Subscribe for expert insights, top moderation strategies, and the latest best practices.