Synthetic data is the secure, low-cost alternative to real data we need
Content provided by IBM and TNW.
Babies learn to talk by hearing other people — usually their parents — making sounds repeatedly. Slowly, through repetition and discovering patterns, babies begin to connect those sounds with meaning. With a lot of practice, they eventually manage to produce similar sounds that people around them can understand.
Machine learning algorithms work much the same way, but instead of having a few parents to copy from, they use data painstakingly categorized by thousands of people who must manually review the data and tell the machine what it means.
However, this tedious and time-consuming process isn’t the only problem with real-world data used to train machine learning algorithms.
Take fraud detection in insurance claims. If an algorithm wants to accurately distinguish a case of fraud from legitimate claims, it needs to see both. Thousands and thousands of both. And because AI systems are often provided by third parties – not by the insurance company itself – those third parties must have access to all that sensitive data. You understand where it is going, because the same applies to health records and financial data.
More esoteric but equally worrisome are all algorithms trained on text, images and videos. Except for questions about copyrighta lot creators have expressed their disagreement while their work is sucked into a dataset for training a machine that can eventually take over (part of) their work. And that’s assuming their creations aren’t racist or otherwise problematic—which in turn could lead to problematic outcomes.
And what if there is simply not enough data available to train an AI on all eventualities? In a 2016 RAND Corporation Report, the authors calculated how many miles, “a fleet of 100 autonomous vehicles traveling 24 hours a day, 365 days a year at an average speed of 25 miles per hour,” would need to travel to show that their failure rate (resulting in fatalities or injured), was reliably lower than that of humans. Their answer? 500 years and 11 billion miles.
You don’t have to be a super-brain genius to find out that the current process isn’t ideal. So, what can we do? How can we create sufficient, privacy-respecting, non-problematic, event-covering, accurately labeled data? You guessed it: more AI.
Fake data could help AIs deal with real data
Even before the RAND report, it was perfectly clear to companies working on autonomous driving that they were woefully under-equipped to collect enough data to reliably train algorithms to drive safely in any condition or condition.
Take Waymo, Alphabet’s autonomous driving company. Instead of relying solely on their real vehicles, they created a fully simulated world, where simulated cars with simulated sensors could drive around endlessly, collecting real data in their simulated way. According to the companyby 2020, it had collected data on 15 billion miles of simulated driving — compared to a measly 20 million miles in the real world.
More methods for producing synthetic data are gaining ground.
In AI parlance, this is called synthetic data, or “data applicable to a particular situation that is not obtained by direct measurement,” if you want to get technical. Or less technically, AIs produce fake data so that other AIs can learn about the real world at a faster pace.
An example is: Task2Sim, an AI model built by the MIT-IBM Watson AI Lab that creates synthetic data for training classifiers. Rather than teaching the classifier to recognize one object at a time, the model creates images that can be used to teach multiple tasks. The scalability of this type of model makes data collection less time consuming and cheaper for companies that need a lot of data.
In addition, Rogerio Feris, an IBM researcher who co-authored the article on Task2Sim, said:
The beauty of synthetic images is that you can control their parameters: the background, lighting, and the way objects are posed.
Thanks to all the concerns mentioned above, the production of synthetic data of all kinds has exploded in recent years, with dozens of startups in the field are thriving and raising hundreds of millions of dollars in investments.
The synthetic data generated ranges from “human data” such as health or financial data to synthesized images of a wide variety of human faces – to more abstract data sets such as genomic data, which mimic the structure of DNA.
How to make real fake data
There are a number of ways in which this synthetic data generation occurs, the most common and well-established of which is called GAN or generative hostile networks.
In a GAN, two AIs are pitted against each other. One AI produces a synthetic dataset, while the other tries to determine whether the generated data is real. The latter’s feedback returns to the former “training” to become more accurate at producing convincing fake data. You have probably seen one of the many this-X-doesn’t exist websites – ranging from people to cats to buildings – that generate their images from GANs.
Synthetic data can give smaller players the ability to turn the tables.
Recently, more methods of producing synthetic data are gaining ground. The first are known as diffusion models, in which AIs are trained to reconstruct certain types of data while adding more and more noise — data that gradually corrupts training data — to the real-world data. Eventually, the AI can get arbitrary data, which it works back into a format it was originally trained in.
Fake data is like real data without, well, the authenticity
Synthetic data, however produced, offers some very concrete advantages over using real world data. First of all, it is easier to collect a lot more of it because you are not dependent on people who make it. Second, the synthetic data is labeled perfectly, so you don’t have to rely on labor-intensive data centers to label data (sometimes incorrectly). Third, it can protect privacy and copyright because the data is, well, synthetic. And finally, and perhaps most importantly, it can reduce biased outcomes.
With AI playing an increasingly important role in technology and society, expectations around synthetic data are quite optimistic. Gartner famously estimated that: 60% of training data will be synthetic data by 2024. Market analyst Cognilytica appreciated the market from generating synthetic data to $110 million by 2021, and growing to $1.15 billion by 2027.
Data has been called the most valuable asset in the digital age. Big tech has been sitting on mountains of user data that gave it an edge over smaller contenders in the AI space. Synthetic data can give smaller players the ability to turn the tables.
As you might suspect, the big question regarding synthetic data is its so-called fidelity — or how closely it matches real-world data. The jury hasn’t decided yet, but research seems to show that combining synthetic data with real data produces statistically sound results. This year, researchers from MIT and the MIT-IBM AI Watson Lab showed that an image classifier pre-trained on synthetic data combined with real data, performed, as well as an image classification trained solely on real data.
All in all, synthetic and real traffic lights seem green for the foreseeable future dominance of synthetic data in training privacy-friendly and more secure AI models, and thus a possible future of smarter AIs is just on the horizon for us.
Contents