Augmented Data Science: Method selection agent
- Method selection is a stage where human judgment matters most: the hard part is often not fitting a model, but choosing a method whose assumptions match the business question and the data-generating process.
- We built a four-step agent skill (context gathering, causal, predictive, prescriptive, or descriptive framing, branch-specific clarifying questions, and an assumption-aware evidence table) to structure that choice.
- Testing the skill on price elasticity showed how the agent's questions shape the identification strategy.
- We tested the skill using two recent models: Claude Opus 4.7 emphasized instruments and design choices, while GPT-5.5 emphasized dose-response and machine-learning approaches.
- The skill provides directions, not decisions. The data scientist must decide which assumptions are plausible and which claims stakeholders actually need.
- The result is a more trustworthy method-selection workflow: LLMs produce the candidate method set and organize assumptions, while the data scientist still owns the final method choice.
Podcast-style summary by NotebookLM
Note: This joint post combines the Academic’s Take and the Director’s Cut.
We proposed the Augmented Data Science framework based on a literature review, and since then, we have published a hypothesis search agent. In this post, we develop and test an agent skill for data scientists to use for method selection (you can download it below).
The skill takes a modeling objective as input, then guides the data scientist through a sequence of steps: first, choosing a framing (causal, predictive, prescriptive, or descriptive / inferential), next, answering branch-specific clarifying questions about the data and the modeling objective, and finally, reviewing each candidate method's assumptions, split into data-testable ones the skill can verify and judgment-based ones the data scientist must defend. The skill grounds its method recommendations in a literature search and produces a table of plausible methods, each with its assumptions and trade-offs explained.
Our goal is to structure the LLM's role as an assistant to data scientists during method selection for business problems. In doing so, we hope to keep the data scientist in the driver's seat: the skill automates only what we know an LLM can succeed at (based on our literature review), while requiring the data scientist's input for all else, eventually creating a collaborative workflow in which human intent drives the LLM's execution. Our results show that the structured approach we propose makes method selection more trustworthy.
The method selection agent
The method selection agent skill (get the skill file to use in Claude, Codex, or other LLMs) instructs an LLM agent to walk a data scientist through a structured method search for a modeling objective. The skill produces a table of candidate methods, each accompanied by the assumptions it carries and the trade-offs it imposes.
The agent works in four steps:
Step 1 – Context gathering: The agent asks for the modeling objective: outcome, decision object, and time horizon. This statement anchors every subsequent step.
Step 2 – Framing: The agent asks the data scientist to choose one of four mutually exclusive framings: causal, predictive, prescriptive, or descriptive / inferential. Unlike the hypothesis-search skill, hybrid labels are not allowed; method selection requires a single identification target.
Step 3 – Branch-specific clarifying questions: The agent asks five questions tailored to the chosen framing, one at a time. The causal branch asks about assignment mechanism, panel structure, treatment timing, instruments, and controls. The predictive branch asks about task type, label availability, data shape, sample size, and deployment constraints. The prescriptive branch routes through five sub-settings (static, online, offline logs, targeting, simulation) with their own follow-ups. The descriptive branch asks about the quantity of interest, sampling design, independence, distributional assumptions, and decision threshold. As the data scientist answers these questions, the agent further constrains the candidate set.
Step 4 – Assumption taxonomy and evidence table: The agent splits every recommended method's assumptions into data-testable (the skill can also verify these or produce code that does: pre-trend parallelism, covariate overlap, positivity, stationarity, multicollinearity, and so on) and judgment-based (the data scientist must defend: SUTVA, exclusion restriction, unobserved confounders, exogeneity). The agent then searches the literature and returns candidate methods in a structured table:
- Method: A named method with a one-sentence description
- When it fits: Why this method is a candidate given the user's context
- Data-testable assumptions: Each paired with its specific check (e.g., VIF for multicollinearity, first-stage F for instrument relevance)
- Judgment-based assumptions: Each paired with what the data scientist must defend
- Trade-offs: Bias-variance, data needs, interpretability, compute
- Source: Peer-reviewed paper, arXiv preprint, or methodology textbook chapter
The skill optionally runs or produces code (in Python or R) for the data-testable checks; judgment-based assumptions are always presented as an open checklist.
Testing the agent
We tested the agent using Claude Opus 4.7 and OpenAI GPT-5.5. As in the previous post, each test was a single run: a fresh session, only the skill file loaded, the four-step walkthrough conducted once per model. The evaluation is subjective; we assessed the relevance, specificity, and actionability of the recommended methods based on our methodological and domain expertise.
We focused on a single question in this test: Estimate the price elasticity of demand. Pricing is a setting where identification, not modeling, is the binding constraint, and where the framing-and-clarifying-question framework is likely most useful. To both models, we described an SKU-store-week panel spanning roughly 104 weeks, with regional list-price changes at common dates, recurring promotions whose realized depth depends on a centrally set price point hitting locally varying baselines, and observed wholesale cost and competitor prices. We deferred all data-testable checks to the data scientist in our runs, though the agent is able to handle them if asked.
The skill includes a self-audit step; for citation verification we additionally used Claude Cowork (Opus 4.7). Of the 26 method-citation pairs across both tables, every citation resolved to a real publication, and none were fabricated. As noted in the hypothesis-search post, using an LLM to validate LLM-generated output introduces circularity, so in practice, validation should be done by the data science team.
Summary of the results
For our price-elasticity question, both models recommended eight methods each. Claude Opus 4.7 leaned toward identification design (two 2SLS variants, an event study, a regional DiD); GPT-5.5 leaned toward dose-response and machine learning (continuous-treatment DiD, generalized propensity score, causal forest, event-study/DiD); the two tables share four methods (panel fixed effects, synthetic control, double ML, structural demand).
Links to the results tables: Table P-O for Opus 4.7 — Table P-G for GPT-5.5
Takeaways
- Structured questions change what identification looks like. On the instruments question, Opus 4.7 extracted three candidates from our answers (wholesale cost, competitor prices, centrally set price points) and built two distinct two-stage least squares methods around them. GPT-5.5, given functionally similar information, interpreted competitors as controls and excluded standard instrumental-variables methods from its table.
- Opus probed where GPT didn't. When we answered "Unknown" to the control-group question, GPT accepted it and let it propagate silently into method exclusions; Opus broke the unknown into sub-questions, deferred one to a data-testable check, and listed the rest as cross-cutting open items. Opus also asked an unscripted scope question: own-price only, or own- and cross-price elasticities? We answered "both," which pinned a demand system (AIDS or random-coefficients logit) to the table as the only method capable of delivering cross-price elasticities. GPT did not ask, included a structural demand model anyway, and never flagged that it alone could answer the cross-price half.
- Each model fills a different gap. Both tables list eight methods with four in common (panel fixed effects, synthetic control, double ML, structural demand). Opus's other four lean instrumental-variables (two 2SLS variants, an event study, a regional DiD). This is consistent with its identification-first posture of extracting instruments and scope questions from our answers. GPT's other four focus on dose-response and ML (continuous-treatment DiD, generalized propensity score, causal forest, event-study/DiD). This is also consistent with its methodological-skeptic posture, which cited a recent paper (Bray et al., 2024)1 showing observational elasticities can diverge sharply from experimental ones even after standard adjustments. A pricing team running both would get a broader set of options and better guidance on which ones to implement.
- The skill provides directions, not decisions. The skill does not pick the method. What it produces is a defensible starting inventory: methods that fit the setup, assumptions to defend, claims to make or avoid. The data scientist still decides which assumptions are plausible, which methods are worth the implementation cost, and what claim stakeholders actually need.
Consistent with our Augmented Data Science framework, LLMs remain assistants for method selection, not substitutes for it, retaining the human data scientist as the decision maker. By clearly differentiating between the data-testable and judgment-based assumptions, the skill guides the data scientist through the method selection process. We'll next build on this skill and develop a data-cleaning agent skill to add to the workflow.
References
[1] Bray, R. L., Sanders, N. R., & Stamatopoulos, I. (2024). Marketing science and field experiments: Do observational elasticities match experimental elasticities? SSRN. ↩
