A study published earlier this week by Surge AI seems to expose one of the AI industry’s biggest problems: bullshit, exploitative data labeling practices.
Last year, Google built a dataset called ‘GoEmotions’. It was billed as a “fine-grained emotion dataset” — basically a turnkey dataset for building AI that can recognize emotional sentiment in text.
Per a Google blog post:
In “GoEmotions: A Dataset of Fine-Grained Emotions,” we describe GoEmotions, a human-annotated dataset of 58,000 Reddit comments, extracted from popular English-language subreddits and labeled with 27 emotion categories. As the largest fully annotated English language fine-grained emotion dataset to date, we designed the GoEmotions taxonomy with both psychology and data applicability in mind.
Here’s another way to put it: Google deleted 58,000 Reddit comments and then sent those files to a third-party company for labeling. More on that later.
Surge AI looked at a sample of 1,000 tagged comments from the GoEmotions dataset and found that a significant portion of them were mislabeled.
According to the study:
As much as 30% of the dataset is seriously mislabeled! (We tried to train a model on the dataset ourselves, but noticed serious quality issues. So we took 1000 random comments, asked Surgers if the original emotion was reasonably accurate, and found strong errors in 308 of them.)
It goes on to point out some of the major issues with the dataset, including this doozy:
Problem #1: “Reddit comments were presented without additional metadata”
First, language does not live in a vacuum! Why present a comment without additional metadata? In particular, the subreddit and the parent post being responded to is an important context.
Imagine seeing the “hide its traps the damn sun” comment on its own. Do you have any idea what it means? Probably not – maybe that’s why Google mislabeled it.
But what if you were told it came from the /r/nattyorjuice subreddit devoted to bodybuilding? Would you then realize that falling refers to someone’s trapezoidal muscles?
This kind of data cannot be labeled correctly. Using the above comment “hiding his traps the damn sun” as an example, it’s impossible to imagine a single person on the planet being able to understand every fringe case when it comes to human sentiment.
It’s not that the specific label makers didn’t get it right, it’s that they were given an impossible task.
There are no shortcuts to understanding human communication. We are not stupid the way machines are. We can incorporate our entire environment and lived history into the context of our communication and, through the most tame expression of our masterly grasp of semantic manipulation, turn nonsense into philosophy (shit happen) or turn a truly mundane statement into the punch line of a timeless joke (to get to the other side).
What these Google researchers have done is who knows how much time and money they spent developing a crappy digital version of a Magic 8-Ball. Sometimes it’s right, sometimes it’s wrong, and somehow there’s no way to be sure.
This particular form of AI development is a whim. It’s a scam. And it’s one of the oldest in the book.
Here’s how it works: The researchers took an impossible problem, “how to determine human sentiment in text on a massive scale without context,” and used the magic of nonsense to turn it into a relatively simple problem that any AI can solve “how to match keywords to labels.”
The reason it’s a scam is because you don’t need AI to match keywords to labels. Hell, you could do that in Microsoft Excel 20 years ago.
A little deeper
You know that the dataset that the AI was trained on contains mislabeled data. The only way you can be absolutely sure that any given result it returns is correct is to verify it yourself – you need to use the so-called man in the loop. But what about all the results it doesn’t return as it should?
We don’t try to find all cars that are red in a dataset of car images. We make decisions about people.
If the AI messes up and misses some red cars, those cars are unlikely to experience negative results. And if it accidentally labels a few blue cars as red, those blue cars should be fine.
But this particular dataset is built specifically for human outcome decision making.
It is a long-term goal of the research community to enable machines to understand context and emotion, which in turn would enable a variety of applications, including empathetic chatbots, models to detect harmful online behavior and enhanced customer support interactions.
Again, we are sure that any AI model trained on this dataset will produce erroneous results. That means that every time the AI makes a decision that rewards or punishes a human, it causes demonstrable harm to other humans.
If the output of the AI can be used to influence human rewards — for example, popping up all resumes in a pile of “positive feelings” — we have to assume that some of the files that didn’t surface were unfairly discriminated against.
That’s something people-in-the-loop can’t help with. It would take a person to view every single file that was not selected.
And if the AI has the ability to influence people to punish – for example, by removing content that is considered “hate speech” – we can be sure that sentiments that objectively do not deserve punishment will unjustly surface and thus harm people.
The worst of all, study after study shows that these systems are inherently full of human bias and that minority groups are always disproportionately affected.
There’s only one way to solve this kind of research: toss it in the trash.
It is our position here at Neural that it is completely unethical to train an AI on man-made content without the express individual consent of the people who created it.
Whether it’s legal or not doesn’t matter. When I post on Reddit, I do so in good faith that my dissertation is intended for other people. Google doesn’t reimburse me for my data, so it shouldn’t use it, even if the terms of service allow it.
In addition, it is also our position that it is unethical to deploy AI models trained on data that has not been verified to be error-free, when the output of those models has the potential to influence human outcomes.
Google’s researchers aren’t stupid. They know that a generic keyword search and comparison algorithm can’t turn an AI model into a human-level expert in psychology, sociology, pop culture, and semantics just because they give it a dataset full of randomly mislabeled Reddit posts. .
You can draw your own conclusions about their motivations.
But no amount of talent and technology can turn a bag full of nonsense into a useful AI model when human results are at stake.