Need another indicator that the generative artificial intelligence industry is real and making progress? Look at the booming business of data labeling and annotation, which is an essential step in training the models that power AI products ranging from what’s currently vogue in the industry—chatbots!— to ongoing projects such as self-driving vehicles and tools that diagnose diseases.
During the data labeling step, usually, a team of humans will identify data points, whether that’s the severity of the damage in 100,000 photos of different cars for an insurance company, or the sentiments of people who interact with support agents for a customer service company. Data annotation is a critical step in the training of large-language models (LLMs) like OpenAI’s GPT because it makes AI models more accurate.
Following OpenAI’s release of ChatGPT last November, data annotation companies have received so much demand that it is pushing some of them to raise prices.
Realeyes is a company based in London that uses computer vision to read and understand human behavior; that data is then used to improve advertising effectiveness or to minimize identity fraud. Since the company was collecting and labeling data for its own computer vision algorithms, the company decided two years ago to move into an analogous service of data labeling for other companies, said Mihkel Jäätma, the CEO of Realeyes, which works with over 200 companies across media, technology, and advertising.
The data labeling service began generating revenue last year, with the business getting “very big, very quickly,” he said. Jäätma estimates that 80% of the business comes from companies essentially looking to make avatars less cartoonish. “It’s really kind of exploded to be a very substantial part of our business only in the last two years and keeps going that way,” he said.
From the likes of big tech companies and well-funded AI startups, “[t]he investment that we see is that this is going to be overlaid with very human-like [features],” he said. In other words, the work now is to make these avatars—bots that exhibit personalities based on made-up characters or real people—understand users and talk in a more human way.
Since the launch of its data labeling service, Realeyes has raised prices at least twice. Jäätma said he has had to tell customers that if they weren’t willing to pay up, Realeyes would not complete the full request.
Making avatars more human-like
Labeling audio and visual recording is complex. It’s not just data scrapped from the Internet. Human annotators work on assessing people’s emotions, for example—and as that work gets more nuanced, it means paying the annotators more. (Realeyes was reportedly hired by Meta to make the tech giant’s avatars, which rolled out its own AI avatars in September, more human.)
Meanwhile, Snorkel AI, a company specializing in data labeling, said that the number of inquiries it received in the past three months was more than five times the total number received in the entire previous year, with requests coming from early-stage startups building large-language models (LLMs), as well as government agencies and IT companies.
The Redwood City, California-based company has not raised prices, but it has rolled out additional service offerings around AI training since customers’ needs have diversified.
Data labeling is already a $2.2 billion industry
The growth in data labeling shows that generative AI applications are making progress. “With ChatGPT and other developments, the applications of AI are not out of reach,” said Devang Sachdev, vice president of marketing at Snorkel AI. The surge in AI products comes as LLMs from the likes of Google and OpenAI have also become much more accessible.
The global data collection and labeling market hit $2.2 billion in 2022 and it is expected to grow nearly 30% from 2023 to 2030, according to market research firm Grand View Research.