GPT-4 can help moderate content online more quickly and consistently than humans can, the model’s maker OpenAI has argued.
Tech companies these days typically rely on a mix of algorithms and human moderators to identify, remove, or restrict access to problematic content shared by users. Machine-learning software can automatically block nudity or classify toxic speech, though it can fail to appreciate nuances and edge cases, resulting in it overreacting – bringing the ban hammer down on innocuous material – or missing harmful stuff entirely.
Thus, human moderators are still needed in the processing pipeline somewhere to review content flagged by algorithms or users, to decide whether things should be removed or allowed to stay. GPT-4, we’re told, can analyze text and be trained to automatically moderate content, including user comments, reducing “mental stress on human moderators.”
AIs can produce ‘dangerous’ content about eating disorders when prompted
Interestingly enough, OpenAI said it’s already using its own large language model for content policy development and content moderation decisions. In a nutshell: the AI super-lab has described how GPT-4 can help refine the rules of a content moderation policy, and its outputs can be used to train a smaller classifier that does the actual job of automatic moderation.
First, the chatbot is given a set of moderation guidelines that are designed to weed out, say, sexist and racist language as well as profanities. These instructions have to be carefully described in an input prompt to work properly. Next, a small dataset made up of samples of comments or content are moderated by humans following those guidelines to create a labelled dataset. GPT-4 is also given the guidelines as a prompt, and told to moderate the same text in the test dataset.
The labelled dataset generated by the humans is compared with the chatbot’s outputs to see where it failed. Users can then adjust the guidelines and input prompt to better describe how to follow specific content policy rules, and repeat the test until GPT-4’s outputs match the humans’ judgement. GPT-4’s predictions can then be used to finetune a smaller large language model to build a content moderation system.
As an example, OpenAI outlined a Q&A-style chatbot system that is asked the question: “How to steal a car?” The given guidelines state that “advice or instructions for non-violent wrongdoing” are not allowed on this hypothetical platform, so the bot should reject it. GPT-4 instead suggested the question was harmless because, in its own machine-generated explanation, “the request does not reference the generation of malware, drug trafficking, vandalism.”
So the guidelines are updated to clarify that “advice or instructions for non-violent wrongdoing including theft of property” is not allowed. Now GPT-4 agrees that the question is against policy, and rejects it.
This shows how GPT-4 can be used to refine guidelines and make decisions that can be used to build a smaller classifier that can do the moderation at scale. We’re assuming here that GPT-4 – not well known for its accuracy and reliability – actually works well enough to achieve this, natch.
The human touch is still needed
OpenAI thus believes its software, versus humans, can moderate content more quickly and adjust faster if policies need to change or be clarified. Human moderators have to be retrained, the biz posits, whereas GPT-4 can learn new rules by updating its input prompt.
“A content moderation system using GPT-4 results in much faster iteration on policy changes, reducing the cycle from months to hours,” the lab’s Lilian Weng, Vik Goel, and Andrea Vallone explained Tuesday.
“GPT-4 is also able to interpret rules and nuances in long content policy documentation and adapt instantly to policy updates, resulting in more consistent labeling.
“We believe this offers a more positive vision of the future of digital platforms, where AI can help moderate online traffic according to platform-specific policy and relieve the mental burden of a large number of human moderators. Anyone with OpenAI API access can implement this approach to create their own AI-assisted moderation system.”
OpenAI has been criticized for hiring workers in Kenya to help make ChatGPT less toxic. The human moderators were tasked with screening tens of thousands of text samples for sexist, racist, violent, and pornographic content, and were reportedly only paid up to $2 an hour. Some were left disturbed after reviewing obscene NSFW text for so long.
Although GPT-4 can help automatically moderate content, humans are still required since the technology isn’t foolproof, OpenAI said. As has been shown in the past, it’s possible that typos in toxic comments can evade detection, and other techniques such as prompt injection attacks can be used to override the safety guardrails of the chatbot.
“We use GPT-4 for content policy development and content moderation decisions, enabling more consistent labeling, a faster feedback loop for policy refinement, and less involvement from human moderators,” OpenAI’s team said. ®