The pitch for synthetic consumers is seductive. Skip recruitment. Skip screening. Skip the logistical complexity of real panels. Just generate a thousand AI personas that match your target demographic, run your concept through them, and get results in seconds. It is faster, cheaper, and always available. The problem is that it fundamentally doesn't work — and making decisions based on it is dangerous.

The synthetic data seduction

There is a legitimate use case for synthetic data in research: privacy preservation, sample augmentation, stress-testing data collection instruments. These are technical applications where the data doesn't need to represent real opinion — it just needs to be statistically plausible. Using synthetic data to stand in for actual consumer opinion is a different claim entirely, and it is one that current AI systems cannot support.

A language model generating synthetic consumer responses is doing pattern matching across training data. It knows what survey respondents typically say about products in a given category. It knows the vocabulary of consumer opinion, the shape of preference distributions, the rhetorical moves of focus group participants. What it does not know is what your specific target consumer actually thinks about your specific product concept right now.

Why simulation can't replace truth

Consumer opinion is contextual, situated, and volatile. It changes with news cycles, competitor launches, personal circumstances, and social pressures that no training dataset can fully capture in real time. When a respondent tells you they're concerned about a product's environmental impact, that concern is shaped by something they read last week, a conversation they had at work, and a purchasing decision they made yesterday. Synthetic data reflects none of that. It reflects the average of what people have historically said about similar concerns.

This is not a rounding error. Brand tracking research built on synthetic data will miss inflection points. Product concept testing run on AI personas will over-index on consensus and miss the edge cases that drive market differentiation. Pricing research against synthetic consumers will miss the real psychological thresholds that only emerge when a real person with real financial constraints thinks about actually spending money.

"A language model doesn't know what your specific consumer thinks right now. It knows what consumers have typically said. That distinction is the difference between insight and noise."

The feedback loop problem

There is a compounding risk that has not been widely discussed yet. As brands begin using AI-generated synthetic data to train their own internal AI systems — product recommendation engines, personalisation algorithms, brand health dashboards — they create a feedback loop. The AI learns from fake consumer opinion. It makes decisions based on that learning. Those decisions shape real consumer experiences. And eventually, when real consumer behaviour diverges from what the models predicted, nobody can trace it back to the original data problem.

This is the long tail of the synthetic data risk. The immediate harm is a bad product decision. The structural harm is an AI stack built on fabricated foundations.

What authentic data costs — less than you think

The reason synthetic data is attractive is cost. Real panels are expensive, slow, and administratively complex. But the premise that synthetic data is the only affordable alternative is false. AI-native research platforms that use voice screening to verify human respondents at the point of recruitment are now competitive on cost with traditional panels — and significantly faster than legacy research operations. Authentic consumer insight is no longer a premium product. It is table stakes for brands serious about making real decisions.