Augmented Data Science: Hypothesis search agent

Image – Augmented Data Science: Hypothesis search agent

tl;dr (never ai;dr)

Generating testable hypotheses has mostly relied on the data scientist's experience and a literature review (in line with guidance from the business team). An LLM agent skill can structure and expand that search.
We built a three-step skill (context gathering, causal vs. predictive framing, and evidence-backed hypothesis table) and tested it on two retail business questions using Claude Opus 4.6 and GPT 5.4.
Framing the problem and providing the agent with available variables did most of the work: with minimal context, the causal/predictive distinction produced useful, literature-backed hypotheses.
86% of references checked out; directional claims were reliable, but effect sizes were not. Consistent with prior work, we found that LLMs exaggerate the findings in existing research.
This confirms our earlier conclusion: data science teams need to establish intent and validate the output. The skill expands the search space, but applying domain expertise in the validation step remains essential.

Podcast-style summary by NotebookLM

Note: This joint post combines the Academic’s Take and the Director’s Cut.

In our previous post, we proposed the Augmented Data Science framework and highlighted search-space expansion as a strength of LLMs in data science. In this post, we develop and test "an agent skill" for data scientists to generate useful hypotheses for business questions (which you can download below). A hypothesis here is a testable claim about the relationship between a feature and an outcome of interest.

Say our business problem is customer churn. In the past, finding plausible, testable relationships relied on the knowledge, experience, and intuition of the data science and business teams. LLM agents can now expand that search; however, when asked directly, they generate generic, context-free hypotheses. As we found in our literature review, LLMs produce more useful hypotheses once intent is established by the data science team. So what's better than an agent skill to make the search focused and repeatable?

The hypothesis search agent

The hypothesis search agent skill (click here to get the skill file to use in Claude, Codex, or other LLMs) instructs an LLM agent on how to guide the data scientist through a structured search for evidence-backed hypotheses. It first asks about the problem's context and methodological framing, then produces hypotheses grounded in literature.

The agent works in three major steps:

Step 1 – Context gathering: The agent asks for the specific outcome of interest, business environment, available data sources, and any prior literature or reports that the data scientist already knows. This context makes the search meaningful – without it, the LLM is broadly guessing.

Step 2 – Framing: The agent asks the data scientist to choose between a causal or predictive framing of the problem and explains how the search can be done differently under each framing. This step defines the boundary of the search for the target effect (causal impact vs. predictions).

Step 3 – Hypotheses and evidence table: Based on the context and framing, the agent proceeds to produce candidate hypotheses and presents them in a structured table:

Hypothesis: A testable claim about the relationship between a feature and the outcome
Feature(s): The variable(s) involved
Direction: The expected direction of the effect (when causal)
Effect Size / Importance: Prior estimate from the literature when available
Framing: Whether the relationship is causal or predictive
Source: Format customized to the type of source

Testing the agent

We tested the agent on two problems using both Claude Opus 4.6 and OpenAI GPT 5.4. Each test was a single run: we started a fresh session, loaded only the skill file, and walked through the three steps once per model per question. This reflects what a data scientist would get on a first pass – no iteration, no refinement. The evaluation is subjective; we assessed the relevance, specificity, and actionability of the hypotheses generated based on our methodological and domain expertise.

We selected the following questions because they span two common problem types (diagnosing a change vs. predicting behavior) and because we can validate the results using our own experience in these domains:

Question 1: Why is the attachment rate for cross-sell items lower than it was last quarter?

Question 2: What are the primary predictors of our customers' next purchase?

To establish a baseline, we answered every context-gathering question with 'None' except for the type of available data. For Q1, we provided "transactions, pricing, promotions, demographics, product catalog, inventory, and store traffic". For Q2, we provided "clickstream, transactions, customer demographics and loyalty, pricing, promotions, recommended products".

The skill includes a validation step; for this test, we verified the results using Claude Cowork (a separate session with Opus 4.6). Using an LLM to validate LLM-generated output introduces circularity. We accepted this limitation for the scope of this test, but in practice, validation should be done by the data science team.

Summary of the results

For Q1 (why the cross-sell attachment rate dropped), Claude Opus 4.6 generated 10 hypotheses grounded in marketing science literature. Each hypothesis names a specific identification strategy (panel data with store fixed effects, structural demand model, randomized field experiments) and flags endogeneity or selection bias in the caveats column. GPT 5.4 produced 10 hypotheses for the same question, drawing from a wider range of journals (including operations research, information systems, and consumer psychology outlets beyond the core marketing journals that Claude concentrated on).

Links to the results tables: Table Q1-C for Opus 4.6 — Table Q1-G for GPT 5.4

For Q2 (predicting a customer's next purchase), Claude Opus 4.6 generated 16 hypotheses spanning purchase prediction, recommendation systems, and customer lifetime value. GPT-5.4 generated eight hypotheses focusing on temporal dynamics (purchase regularity, preference drift) and session-level signals, naming specific engineered features (e.g., mean gap, std gap, cv gap, clumpiness index). Claude sourced from literature directly related to the domain; GPT-5.4 also drew from adjacent areas.

Links to the results tables: Table Q2-C for Opus 4.6 — Table Q2-G for GPT 5.4

Takeaways

The framing step along with the list of available variables is doing most of the work. Asking the right question (seeking causality or predictive improvement) defines the bounds of the search. With only the list of available variables as context, the causal/predictive distinction produced meaningfully different outputs. GPT's hybrid framing labels in the results ("Predictive - relevant to causal framing as a mechanism candidate") show that the framing step invoked the necessary ambiguity rather than just sorting the LLM's search into bins.
Minimal context sets the floor. We answered every context question with "None" (no prior models, no known confounders) and only provided a set of variable names. Richer context (e.g., business constraints, known confounders) would likely improve specificity and reduce the need for post-hoc validation.
Effect sizes are seldom reported and often incorrect. Despite being a stated goal of the agent, no model reliably populated quantitative effect sizes. Part of this is because the full-text research resides behind paywalls and abstracts do not report them. When reported, they may be exaggerated. For example, the effect reported in Table Q1-C (Claude), "Corsten & Gruen — 40–50% lost," overstates the actual finding. We also noticed other incorrect reports of effect sizes (e.g., Montgomery et al., Zhang & Breugelmans in Q2-C (Claude)).
Different models, different strengths. Identical instructions produced meaningfully different outputs. Claude excelled at methodological specificity: identification strategies, explicit causal caveats, concentrated peer-reviewed sources. GPT excelled at feature operationalization and source diversity. The bibliographies are almost completely non-overlapping. Running both is not redundant; combining multiple models for the same search may provide complementarity benefits, which we plan to test next.
Citations are reliable but not perfect. Using Claude Cowork (Opus 4.6), we verified the 44 hypothesis-citation pairs that were reported: 38 (86%) are valid. No papers were fabricated, and most hypotheses align with the cited sources. An 86% validation rate suggests the structured prompting meaningfully reduces hallucination compared to unstructured queries. The skill we created prevented hallucinations.
Mechanism mismatch is the most common weakness. Both models occasionally cite a paper for a general topic but attribute a specific causal mechanism that the paper does not directly test (more exaggeration?).

These patterns point to the same conclusion: the agent skill successfully structures the handoff between what LLMs know and what the data scientist needs to test, and validation remains essential. Consistent with prior work (e.g., Peters and Chin-Yee, 2025),¹ we find that LLMs exaggerate in their reporting of existing research and cannot fully replace the domain expertise, but they do usefully expand the search space.

Next, we will develop an agent skill for the method selection step. Eventually, we plan to combine agent skills under the supervision of an orchestrator agent in a full data science workflow.

References

[1] Peters, U., & Chin-Yee, B. (2025). Generalization bias in large language model summarization of scientific research. R Soc Open Sci., 12(4), 241776. ^↩

Search

Data Duets