Large language models (LLMs) are one of the hottest innovations today. With companies like OpenAI and Microsoft working on releasing new impressive NLP systems, no one can deny the importance of having access to large amounts of quality data that can’t be undermined.
However, according to recent research done by Epoch, we might soon need more data for training AI models. The team has investigated the amount of high-quality data available on the internet. (“High quality” indicated resources like Wikipedia, as opposed to low-quality data, such as social media posts.)
The analysis shows that high-quality data will be exhausted soon, likely before 2026. While the sources for low-quality data will be exhausted only decades later, it’s clear that the current trend of endlessly scaling models to improve results might slow down soon.
Machine learning (ML) models have been known to improve their performance with an increase in the amount of data they are trained on. However, simply feeding more data to a model is not always the best solution. This is especially true in the case of rare events or niche applications. For example, if we want to train a model to detect a rare disease, we may need more data to work with. But we still want the models to get more accurate over time.
This suggests that if we want to keep technological development from slowing down, we need to develop other paradigms for building machine learning models that are independent of the amount of data.
In this article, we will talk about what these approaches look like and estimate the pros and cons of these approaches.
The limitations of scaling AI models
One of the most significant challenges of scaling machine learning models is the diminishing returns of increasing model size. As a model’s size continues to grow, its performance improvement becomes marginal. This is because the more complex the model becomes, the harder it is to optimize and the more prone it is to overfitting. Moreover, larger models require more computational resources and time to train, making them less practical for real-world applications.
Another significant limitation of scaling models is the difficulty in ensuring their robustness and generalizability. Robustness refers to a model’s ability to perform well even when faced with noisy or adversarial inputs. Generalizability refers to a model’s ability to perform well on data that it has not seen during training. As models become more complex, they become more susceptible to adversarial attacks, making them less robust. Additionally, larger models memorize the training data rather than learn the underlying patterns, resulting in poor generalization performance.
Interpretability and explainability are essential for understanding how a model makes predictions. However, as models become more complex, their inner workings become increasingly opaque, making interpreting and explaining their decisions difficult. This lack of transparency can be problematic in critical applications such as healthcare or finance, where the decision-making process must be explainable and transparent.
Alternative approaches to building machine learning models
One approach to overcoming the problem would be to reconsider what we consider high-quality and low-quality data. According to Swabha Swayamdipta, a University of Southern California ML professor, creating more diversified training datasets could help overcome the limitations without reducing the quality. Moreover, according to him, training the model on the same data more than once could help to reduce costs and reuse the data more efficiently.
These approaches could postpone the problem, but the more times we use the same data to train our model, the more it is prone to overfitting. We need effective strategies to overcome the data problem in the long run. So, what are some alternative solutions to simply feeding more data to a model?
JEPA (Joint Empirical Probability Approximation) is a machine learning approach proposed by Yann LeCun that differs from traditional methods in that it uses empirical probability distributions to model the data and make predictions.
In traditional approaches, the model is designed to fit a mathematical equation to the data, often based on assumptions about the underlying distribution of the data. However, in JEPA, the model learns directly from the data through empirical distribution approximation. This approach involves dividing the data into subsets and estimating the probability distribution for each subgroup. These probability distributions are then combined to form a joint probability distribution used to make predictions. JEPA can handle complex, high-dimensional data and adapt to changing data patterns.
Another approach is to use data augmentation techniques. These techniques involve modifying the existing data to create new data. This can be done by flipping, rotating, cropping or adding noise to images. Data augmentation can reduce overfitting and improve a model’s performance.
Finally, you can use transfer learning. This involves using a pre-trained model and fine-tuning it to a new task. This can save time and resources, as the model has already learned valuable features from a large dataset. The pre-trained model can be fine-tuned using a small amount of data, making it a good solution for scarce data.
Today we can still use data augmentation and transfer learning, but these methods don’t solve the problem once and for all. That is why we need to think more about effective methods that in the future could help us to overcome the issue. We don’t know yet exactly what the solution might be. After all, for a human, it’s enough to observe just a couple of examples to learn something new. Maybe one day, we’ll invent AI that will be able to do that too.
What is your opinion? What would your company do if you run out of data to train your models?
Ivan Smetannikov is data science team lead at Serokell.