Augmented Data Science: Hypothesis search agent

Augmented Data Science: Hypothesis search agent
tl;dr (never ai;dr)
  • Finding testable hypotheses has traditionally relied on the data scientist's experience and literature review. An agent skill can structure and expand that search.
  • We built a three-step skill (context gathering, causal vs. predictive framing, and evidence-backed hypothesis table) and tested it on two retail business questions using Claude Opus 4.6 and GPT 5.4.
  • The framing step did most of the work: with minimal context, the causal/predictive distinction produced useful, literature-backed hypotheses. Running two different models was complementary, not redundant.
  • 86% of references checked out; directional claims were reliable, but effect sizes were not. The skill expands the search space, but it does not replace the domain expertise that makes the search meaningful.

In our previous post, we proposed the Augmented Data Science framework and highlighted search-space expansion as a strength of LLMs in data science. In this post, we develop and test "an agent skill" for data scientists to generate useful hypotheses for business questions. A hypothesis here is a testable claim about the relationship between a feature and an outcome of interest.

Say our business problem is customer churn. In the past, finding plausible, testable relationships relied on the knowledge, experience, and intuition of the data science and business teams. LLM agents can now expand that search; however, when asked directly, they generate generic, context-free hypotheses. As we found in our literature review, LLMs produce more useful hypotheses once intent is established by the data science team. So what's better than an agent skill to make the search more focused?


The hypothesis search agent

The hypothesis search agent skill (click here to get the skill file to use in Claude, Codex, or other LLMs) instructs an LLM agent on how to guide the data scientist through a structured search for evidence-backed hypotheses. It first asks about the problem's context and methodological framing, then produces hypotheses grounded in literature.

The agent works in three major steps:

Step 1 – Context gathering: The agent asks for the specific outcome of interest, business environment, available data sources, and any prior literature or reports that the data scientist already knows. This context makes the search meaningful – without it, the LLM is broadly guessing.

Step 2 – Framing: The agent asks the data scientist to choose between a causal or predictive framing of the problem and explains how the search can be done differently under each framing. This step defines the boundary of the search for the target effect (causal impact vs. predictions).

Step 3 – Hypotheses and evidence table: Based on the context and framing, the agent proceeds to produce candidate hypotheses and presents them in a structured table:

  • Hypothesis: A testable claim about the relationship between a feature and the outcome
  • Feature(s): The variable(s) involved
  • Direction: The expected direction of the effect (when causal)
  • Effect Size / Importance: Prior estimate from the literature when available
  • Framing: Whether the relationship is causal or predictive
  • Source: Format customized to the type of source

Testing the agent

We tested the agent on two problems using both Claude Opus 4.6 and OpenAI GPT 5.4. Each test was a single run: we started a fresh session, loaded only the skill file, and walked through the three steps once per model per question (what a data scientist would get on a first pass). The evaluation is subjective; we assessed the relevance, specificity, and actionability of the hypotheses generated based on our methodological and domain expertise.

We selected the following questions because they span two common problem types (diagnosing a change vs. predicting behavior) and because we can validate the results using our own experience in these domains:

Question 1: Why is the attachment rate for cross-sell items lower than last quarter?

Question 2: What are the predictors of a customer's next purchase?

To establish a baseline, we answered every context-gathering question with 'None' except for the type of available data. For Q1, we provided "transactions, pricing, promotions, demographics, product catalog, inventory, and store traffic". For Q2, we provided "clickstream, transactions, customer demographics and loyalty, pricing, promotions, recommended products".

The skill includes a validation step; for this test, we verified the results using Claude Cowork (a separate session with Opus 4.6). Using an LLM to validate LLM-generated output introduces circularity. We accepted this limitation for the scope of this test, but in practice, validation should be done by the data science team.


Summary of the results

For Q1 (why the cross-sell attachment rate dropped), Claude Opus 4.6 generated 10 hypotheses grounded in marketing science literature. Each hypothesis names a specific identification strategy (panel data with store fixed effects, structural demand model, randomized field experiments) and flags endogeneity or selection bias in the caveats column. GPT 5.4 produced 10 hypotheses for the same question, drawing from a wider range of journals (including operations research, information systems, and consumer psychology outlets beyond the core marketing journals that Claude concentrated on).

Links to the results tables: Table Q1-C for Opus 4.6Table Q1-G for GPT 5.4

For Q2 (predicting a customer's next purchase), Claude Opus 4.6 generated 16 hypotheses spanning purchase prediction, recommendation systems, and customer lifetime value. GPT-5.4 generated eight hypotheses focusing on temporal dynamics (purchase regularity, preference drift) and session-level signals, naming specific engineered features (e.g., mean gap, std gap, cv gap, clumpiness index). Claude sourced from literature directly related to the domain; GPT-5.4 also drew from adjacent areas.

Links to the results tables: Table Q2-C for Opus 4.6Table Q2-G for GPT 5.4


Takeaways

  • The framing step is already doing most of the work. Asking the right question (seeking causality or predictive improvement) defines the bounds of the search. Even with no domain context provided, the causal/predictive distinction produced meaningfully different outputs. GPT's hybrid framing labels ("Predictive - relevant to causal framing as a mechanism candidate") show the step surfaced genuine ambiguity rather than just sorting into bins.
  • Minimal context sets the floor. We answered every context question with "None" — no prior studies, no known confounders — and only provided a set of variable names. These results represent what the structured framing alone contributes. Richer context (e.g., business constraints, known confounders) would likely improve specificity and reduce the need for post-hoc filtering.
  • Effect sizes are seldom reported and often incorrect. Despite being a stated goal of the agent, no model reliably populated quantitative effect sizes. Part of this is because the details of some content reside behind paywalls and abstracts do not report them. When reported, they may be incorrect. For example, the effect reported in Table Q1-C (Claude), "Corsten & Gruen — 40–50% lost," overstates the actual finding. We also noticed other incorrect reports of effect sizes (e.g., Montgomery et al., Zhang & Breugelmans in Q2-C (Claude)). The Direction column, on the other hand, was reliable.
  • Different models, different strengths. Identical instructions produced meaningfully different outputs. Claude excelled at methodological specificity: identification strategies, explicit causal caveats, concentrated peer-reviewed sources. GPT excelled at feature operationalization and source diversity. The bibliographies are almost completely non-overlapping. Running both is not redundant; combining multiple models for the same search may provide complementarity benefits, which we plan to test next.
  • Citations are reliable but not perfect. Using Claude Cowork (Opus 4.6), we verified the 44 hypothesis-citation pairs that were reported: 38 (86%) are valid. No paper was fabricated, and most hypotheses align with the cited source. An 86% validation rate suggests the structured prompting meaningfully reduces hallucination compared to unstructured queries.
  • Mechanism mismatch is the most common weakness. Both models occasionally cite a paper for a general topic but attribute a specific causal mechanism that the paper does not directly test.

These patterns point to the same conclusion: the agent structures the handoff between what LLMs know and what the data scientist needs to test. It does not replace the domain expertise that makes the handoff meaningful, but it does expand the search space in ways that manual literature review alone would not. LLM output remains a starting point for hypothesis search, not a substitute for it.

Next, we will develop an agent skill for the method selection step. Eventually, we plan to combine agent skills under the supervision of an orchestrator agent in a full data science workflow. Stay tuned!

Other popular articles

How to (and not to) log transform zero

Explaining the unexplainable Part II: SHAP and SAGE

Causal inference is not about methods