Synthetic Data in Qualitative Research: Promise, Peril, and Practice

Synthetic data — information generated by AI models rather than collected from human participants — is no longer a speculative technology. Research teams are already using large language models to simulate interview responses, generate synthetic personas, and pilot-test research instruments before deploying them with real participants. Reports suggest that a majority of research teams experimenting with synthetic respondents find the outputs useful for at least some purposes.

This raises difficult questions for qualitative researchers. Can an AI model produce responses that meaningfully represent human experience? Where does synthetic data help, and where does it fundamentally undermine what qualitative research is supposed to do? This post works through the practical applications, the genuine risks, and the methodological questions you need to answer before incorporating synthetic data into your work.

What Synthetic Data Actually Is

In the qualitative context, synthetic data refers to text generated by AI models that is designed to resemble data you might collect from human participants. This can take several forms:

Synthetic interview responses. A researcher prompts an AI model to respond as if it were a specific type of participant — a first-generation college student, a rural primary care physician, a mid-career software engineer — and uses the generated text as data.

Synthetic personas. Rather than generating responses to specific questions, the model produces detailed biographical profiles that represent a particular demographic or psychographic segment. These personas inform research design, marketing strategy, or product development.

Synthetic pilot data. A researcher generates artificial responses to their interview protocol before recruiting real participants, using the synthetic data to identify weak questions, test their coding framework, or train research assistants.

The quality of synthetic data depends entirely on the model, the prompt, and the training data. A model trained on a large corpus of actual interview transcripts from a specific population will produce more realistic output than a general-purpose model prompted with a brief demographic description.

Where Synthetic Data Can Help

Piloting and Instrument Development

This is the least controversial and most immediately practical application. Before recruiting real participants, you can use synthetic responses to stress-test your interview protocol. Do your questions elicit the kind of data you need? Are any questions ambiguous or leading? Does the flow of your protocol make sense?

Using synthetic pilot data does not replace piloting with real participants, but it can help you arrive at the pilot stage with a stronger instrument. It is essentially a more sophisticated version of asking a colleague to role-play a participant — something researchers have done informally for decades.

Training and Education

Synthetic data offers a valuable training tool for new qualitative researchers. Students can practice coding and thematic analysis on synthetic transcripts before working with real participant data. This avoids the ethical complications of using actual human data for training exercises and allows instructors to create datasets that illustrate specific analytical challenges — ambiguous passages, contradictory statements, emotionally charged content.

Supplementing Small Samples

Qualitative research frequently works with small samples, and for good reason. But small samples create vulnerability. If one of your eight participants drops out, you lose twelve percent of your data. Some researchers are experimenting with generating synthetic responses to supplement small datasets — not to replace missing participants, but to explore whether the patterns they have identified hold under different conditions.

This application is far more contentious. The synthetic responses are only as good as the model's understanding of the population, which is inevitably shaped by whatever biases exist in its training data.

Exploring Sensitive Topics

For research on stigmatized or traumatic topics, synthetic data can serve a preliminary function. Researchers can use synthetic responses to explore the landscape of a sensitive topic, identify potential themes, and develop more informed interview protocols — before asking real people to share painful experiences. This does not replace human data, but it can help researchers approach sensitive fieldwork with greater preparation.

Where Synthetic Data Falls Apart

It Cannot Capture Lived Experience

This is the fundamental limitation, and no amount of model improvement will resolve it. Qualitative research is built on the premise that human experience is irreducible — that understanding what it means to live with chronic pain, to navigate institutional racism, or to lose a child requires actually talking to people who have lived those experiences. An AI model can produce text that resembles what a grieving parent might say. It cannot grieve.

Phenomenological research, in particular, is incompatible with synthetic data at the data collection stage. The entire methodological framework depends on accessing the first-person lived experience of the participant. A synthetic response is, by definition, not a lived experience.

It Reproduces and Amplifies Biases

Large language models are trained on existing text, which means they reproduce the patterns, assumptions, and biases present in that text. When a model generates a synthetic response from a "low-income single mother," it draws on whatever representations of low-income single mothers exist in its training data — representations that are often stereotyped, reductive, or shaped by dominant cultural narratives.

If you use synthetic data without critically examining these biases, you risk producing research that confirms cultural stereotypes rather than challenging them. This is the opposite of what good qualitative research is supposed to do.

It Lacks the Unexpected

One of the most valuable aspects of qualitative data is surprise. Participants say things you did not anticipate. They reframe your questions. They introduce concepts that reshape your understanding of the phenomenon. This is how grounded theory works — the theory emerges from data that challenges your preconceptions.

Synthetic data tends to produce expected, coherent, well-structured responses. It rarely surprises because it is drawing on patterns that already exist. The messy, contradictory, revelatory quality of real human speech — the "ums" and tangents and sudden moments of insight — is precisely what makes qualitative data valuable, and it is precisely what synthetic data lacks.

It Creates Verification Problems

How do you member-check synthetic data? How do you establish trustworthiness when your "participants" do not exist? The quality criteria for qualitative research — credibility, transferability, dependability, confirmability — all assume that the data originated from real human beings who can be consulted, quoted, and verified. Synthetic data undermines these foundations in ways that no methodological workaround can fully address.

Ethical Considerations

Transparency and Disclosure

If you use synthetic data in any part of your research process, you must disclose it. This applies even if you use synthetic data only for piloting or training. Your methods section should clearly state what role synthetic data played, how it was generated, and how it was distinguished from human-generated data.

Failing to disclose the use of synthetic data is a form of research misconduct. It misrepresents the nature of your evidence and undermines the reader's ability to evaluate your findings.

Consent and Representation

Synthetic data raises questions about representation without consent. When a model generates a response "as" a specific type of person, it is producing a representation of that group without the knowledge or consent of anyone in that group. This is particularly concerning for marginalized populations, whose voices are already frequently spoken for rather than spoken by.

Institutional Review

Most IRBs have not yet developed clear policies on synthetic data. If you plan to use synthetic data in your research, consult your IRB early. Some boards may determine that synthetic data studies do not require full review (since no human subjects are involved), while others may raise concerns about indirect representation and potential harms.

A Practical Framework

If you are considering synthetic data for your qualitative project, ask yourself these questions:

What role will synthetic data play? Using it for piloting is methodologically defensible. Using it as primary data in a phenomenological study is not. Be clear about the function.

Can I verify the outputs? If you have no way to check whether the synthetic responses are realistic — because you have no real data to compare them against — you are building on an unverifiable foundation.

Am I replacing voices or supplementing understanding? Synthetic data should never substitute for the perspectives of real people, especially those from marginalized groups. If your study claims to represent the experiences of a particular population, those experiences must come from members of that population.

Have I disclosed everything? Transparency is not optional. Document how you generated synthetic data, what prompts you used, what model you used, and how synthetic data was handled differently from human data in your analysis.

What would my participants think? If the people your synthetic data is meant to represent would object to being represented by an AI model, that objection matters and should inform your decision.

Where This Is Heading

Synthetic data is not going away. The models will improve, the outputs will become more realistic, and the temptation to use them as a shortcut will intensify — especially as research timelines compress and budgets shrink. The qualitative research community needs to develop clear norms now, before the technology outpaces the ethics.

The most defensible position is that synthetic data is a tool for preparation, not a substitute for human encounter. Use it to sharpen your instruments, train your team, and explore the landscape. Then do the real work of sitting across from another person, asking them about their life, and listening carefully to what they say. That is something no model can do for you.

More Articles

Digital Ethnography: A Practical Guide to Online Qualitative Research

Learn how to conduct ethnographic research in digital environments, from online communities and social media to virtual worlds, including methods, tools, and ethical considerations.

Read more

Hybrid Research Design: When and How to Blend Qualitative and Quantitative Methods

A practical guide to designing hybrid research that genuinely integrates qualitative and quantitative methods, including frameworks for sequencing, integration points, and common pitfalls.

Read more

Data Integrity in Qualitative Research: Identifying and Preventing Respondent Fraud

Learn how to identify fraudulent participants, protect your qualitative data integrity, and implement vetting protocols that ensure your findings are built on authentic human responses.

Read more