Making it up carefully

Banks, governments and researchers are using synthetic data to protect privacy and save lives. But combining anonymity with accuracy remains a challenge.

In August 2024, the United States National Science Foundation gave Dr. Wei Zhai USD 75,000 to meticulously photograph, document and track the interiors of people’s houses.He and his team of researchers, mostly his colleagues at the University of Texas San Antonio, planned to gather personal data about the interiors and exteriors of private residences in San Antonio’s Westside neighborhood and then build digital twins, or hyper-realistic simulations, of these homes.Zhai is researching extreme heat in Westside, the hottest neighborhood in one of the hottest cities in America. With thosedigital twins, he and his colleagues hope to gather enough information about how houses trap heat to develop newer and more cost-effective methods of cooling. This, Zhai hopes, will help save lives in Texas, where at least334 peopleand possiblyhundreds moredied from heat in 2023.When Zhai pitched the project to Westside residents in 2025, they were understandably wary of the significant intrusion on their privacy this would entail. Because there simply isn’t much data on how homes are constructed and used in under-resourced communities like Westside, Zhai’s team needed to collect data on the homes’ interior layouts and properties. That required constant camera monitoring; data on temperature requiring their own finely-tuned sensors; and information about the construction of the homes themselves, which in turn required highly sophisticatedLIDARsensors to produce an accurate 3D model of the buildings from the foundation up.

Gathering the volume of real-world data necessary to build a realistic heat model of an entire neighborhood would be a challenge in any setting, but it would be especially daunting given the extensive detail they needed to collect about the most private areas of people’s homes, armed only with a USD 75,000 grant and a homespun PR campaign. Their problem, then, was twofold: convincing an understandably reluctant population to fork over private information, and then figuring out how to build a usable, highly detailed model based on what data they could glean.To solve both problems, Zhai and his colleagues turned to an AI-powered solution:synthetic data, or artificial sets of training data that statistically replicate real-world sets that are too sensitive, or too meager, to use in live AI tools.“We already have at least 20 homes where we’ve installed sensors, but the community is still kind of a data desert,” Zhai toldIBM Think. “That’s why we’re using synthetic data to simulate data for other homes where they might not have the resources to install the sensors.”Zhai has become an evangelist of sorts for the use of synthetic data for research in low-income communities, writing in aNovember 2025 essayabout how it can ensure not just accuracy, but privacy for neighborhood residents. He also co-wrote anOctober 2025 article for theJournal of Planning Education and Researchabout how synthetic data can address “key challenges of privacy, reproducibility and technical feasibility.”

Comments (0)