Augmented data science: Human Intent, AI Execution
- LLMs are now part of data science workflows. The question is no longer if we use them, but how and where.
- We reviewed emerging research to create a framework defining the synergy between human and machine: human intent followed (and bounded) by LLM execution.
- Effective AI-assisted workflows successfully decouple intent (the "what" and "why") from execution (the "how"). In this model, the data scientist acts as the orchestrator, while the LLM serves as the execution engine.
- Success hinges on assigning intent to the correct tasks. The data scientist sets goals and validates assumptions. The LLM searches well-defined spaces and executes code subject to human validation.
- This post sets the stage for a new series where we hope to introduce a task-level guide for data science leaders embedding LLMs into core business processes: The Augmented Data Science Decision Matrix.
Podcast-style summary by NotebookLM
A scenario that should worry data science leaders
The data science team at a large retailer uses LLM agents to build a weekly SKU-level demand forecasting model. The agents engineer features from promotional calendars, holiday flags, and historical transaction data. They select a gradient boosting pipeline, tune the hyperparameters, and write clean, well-documented code. Cross-validation metrics look strong. The pipeline passes code review. A few weeks post-deployment, model performance drops off a cliff. The downstream inventory system starts pushing excess stock to stores that don't need it, and leaves some other stores unstocked. Share and sales targets are missed.
What might have happened? The agents rapidly and iteratively improve the pipeline based on test-set performance, effectively training on the evaluation data. They also select a method that assumes independent observations, missing the temporal dependence in the weekly demand series. The code is syntactically perfect. The metrics are real. The model is wrong.
Or consider a subtler version. The agents, scanning for predictive features, find that the discount depth is strongly correlated with sales volume. The agents include it as a top feature. The problem is, promotional prices aren't set independently of demand expectations. Category managers set deeper discounts on items that are already expected to move slowly. The feature doesn't capture the sensitivity to the discount depths; it captures the category manager's perception of upcoming demand. The model ends up learning that low expected demand is correlated with deep discounts, which it interprets as "deep discounts predict low demand." In production, the model underforecasts demand for heavily promoted items, triggering understocking during the exact weeks the promotion was supposed to drive traffic.
These are just two examples of the vibe modeling trap: building models where the inputs, methods, and outputs look right: clean code, strong metrics, and professional documentation, while the underlying assumptions are wrong.
Value of keeping the intent with the data scientist
Treating a data science workflow purely as an automation task is dangerous because it delegates critical decisions to the LLM. A recent analysis of code-generating LLMs captures the consequences of such unguided automation. The NeurIPS 2024 StatQA benchmark of 11,623 examples is designed to evaluate LLMs on statistical analysis tasks, specifically, whether they can select the right statistical method and assess whether that method's requirements are met by the data (i.e., do the assumptions hold?). The benchmark covers five categories: correlation analysis, contingency table tests, distribution compliance tests, variance tests, and descriptive statistics. An interesting pattern emerges here. While GPT-4o achieves a 64.83% accuracy with domain knowledge prompting, it only achieves 44-49% without it (Zhu et al., 2024).1 This is another example of the importance of domain knowledge in data science tasks. Instead of forcing the LLMs to succeed at a task they inherently fail at, the data scientist needs to exercise intent.
Similar patterns emerge in other data science tasks that require judgment and intentional decisions – treating the LLM as a replacement yields syntactically correct pipelines that fail to solve the actual problem. The failure is due to a lack of domain knowledge, package hallucinations, or flawed logic (Spracklen et al., 2025; Zhu et al., 2024).2,3
We find that a better alternative is, what we call intentional augmentation. In a successful orchestration model, the data scientist must explicitly establish the intent before the LLM acts. The data scientist sets up the context, provides domain knowledge, formulates the hypothesis, selects methods that fit the data and assumptions, defines the constraints, and designs the validation metrics and strategy. Only after such critical decisions are made should the LLM take over the execution-intensive subtasks. The distinction matters because it determines who owns the problem's boundaries.
When execution follows intentional decisions, LLMs become powerful engines. They can expand the search space in directions the data scientist couldn't have manually mapped. For example, when LLM agents are combined with tree search to explore pipeline configurations, they achieve 65-80% win rates against traditional AutoML by experimenting with a more diverse set of approaches based on the data scientist's defined goal (Chi et al., 2024).4 LLM-driven feature engineering shows the same pattern: LLMs propose transformations a data scientist wouldn't think of, guided by data-driven feedback, and outperform baselines through intensive combinatorial search (Abhyankar et al., 2025).5
The same synergy between human judgment and LLMs extends to hypothesis generation. Consider the challenge of identifying whether an online review is genuine or a fake written to manipulate rankings. This task requires a data scientist to detect subtle linguistic cues and generate a hypothesis to explain the reasoning behind their classification. When data scientists are assisted by an LLM that integrates literature-based insights with data-driven pattern detection, the success rate in these tasks improves by 7–14% (Zhou et al., 2024).6 That improvement depended on the LLM receiving the researcher-provided context; without that context, a cold-start model would have no basis for generating meaningful hypotheses. Ultimately, the LLM's value lies in its ability to discover and refine candidate hypotheses under human-defined constraints, augmenting human judgment rather than replacing it.
These are just a few examples. We find similar patterns in the other stages of the data science workflow, such as data cleaning, feature engineering, model selection, and model validation. In what follows, we distill our learning from the existing research and discuss which parts of the data science workflow LLMs can execute effectively under data scientist orchestration. We present the Augmented Data Science Decision Matrix to guide data science leaders in selecting the appropriate LLM role.
The augmented data science decision matrix
The following matrix maps data science tasks to recommended LLM roles based on emerging research. Ultimately, we find that we can construct the matrix by asking a simple question: Does this task involve establishing intent (the "what/why") or executing it (the "how")?
If the task is to generate a new feature or optimize a defined metric, that's execution: the LLM can expand the search space under the data scientist's direction while the data scientist continues to review the results. If the task is to construct a hypothesis or select a statistical method, that's establishing intent: the data scientist needs to exercise judgment in making assumptions that aren't learnable from the data. In other words, where the task intent, the data scientist must act as the orchestrator; where the task is execution, the LLM thrives.
Across a range of tasks, we observe that the data scientist's intent bounds the LLM's execution. The LLM does not autonomously define the execution space. It can only execute within the bounds defined by the data scientist, which, we find, is shaped by the data scientist's experience, domain knowledge, and business context, all of which are part of the intent.
| Task | Data Scientist's Role | LLM's Role | Supporting Evidence |
|---|---|---|---|
| Hypothesis formulation | Establishes Intent. Formulates testable, domain-grounded hypotheses; defines the modeling objective and business context. | Surfaces Candidates. Can retrieve prior work to generate candidate hypotheses and broaden search space. | LLMs can produce plausible but ungrounded claims (Liu et al., 2025; Xiong et al., 2025).7,8 |
| Method selection | Establishes Intent. Selects and validates the method; checks assumptions against the data, mapping data to constraints. | Proposes Candidates. Can suggest methods with documented assumptions and implement a method once selected and tested. | LLMs prove to be poor statisticians when tasked with validating the applicability of underlying methodological assumptions (Zhu et al., 2024).9 |
| Data cleaning | Establishes Intent. Defines cleaning heuristics; selects imputation strategies; validates the risk of bias or leakage in automated fixes. | Executes Search. Standardizes formats, imputes missing values, and identifies potential outliers or data leakage. | LLMs can automate standardization tasks (Qi et al., 2024), but require human judgment to prevent data leakage and ensure alignment with domain constraints (Wang et al., 2025).10,11 |
| Feature engineering | Reviews Execution. Evaluates new features on holdout data; selects the ones that improve the model. | Executes Search. Proposes transformations, generates candidate features using domain knowledge from training data to optimize a metric. | LLMs excel at proposing novel, data-driven transformations when guided by combinatorial search and evaluation feedback (Abhyankar et al., 2025).1 |
| Modeling & Hyperparameter Tuning | Establishes Intent. Reviews the assumptions and data-method fit: sample characteristics, distributional fit, temporal structure, class balance. | Executes Search. Generates and ranks candidate pipelines; searches the configuration space. | Tree-search-enhanced agents beat AutoML at generating pipeline candidates, but frequently fail to verify underlying statistical assumptions (Chi et al., 2024; Zhu et al., 2024).2,11 |
| Debugging | Reviews Execution. Confirms the fix is correct and doesn't introduce new issues. | Executes Review. Diagnoses errors, inspects program state, proposes fixes to known problems. | LLM-driven tools are highly effective at correctly diagnosing and resolving implementation errors (Levin et al., 2025).3 |
| Interpretation | Establishes Intent. Evaluates causal validity; identifies confounders, endogeneity, spurious correlations using domain expertise. | Generates Candidate Narratives. Produces interpretation drafts and flags potential patterns for the data scientist to validate and stress-test. | LLMs exhibit critical failure modes when attempting to independently construct valid causal narratives from data (Yamin et al., 2024).10 |
| Monitoring & Drift Detection | Establishes Intent. Defines success metrics and failure modes; interprets anomalies; decides when to retrain. | Executes Monitoring. Monitors data streams for statistical drift, identifies anomalies, and tracks performance metrics. | LLM agents can automate drift detection, but require human judgment to differentiate between meaningful distributional shifts and temporary environment noise (Wang et al., 2025).8 |
The framework explained
We can collapse the matrix into five steps.
Step 1. The data scientist establishes intent (Hypothesis Formulation & Method Selection). This is the first step because it defines what is being modeled and why. A common mistake is asking an LLM to "suggest hypotheses for why sales dropped in Region A." LLMs tend to offer generic, context-free answers. If the information is not part of the internal data provided to the LLM, the LLM doesn't know that a new competitor opened three stores in Q3. Existing benchmarks confirm our intuition: the best LLM-based hypothesis generation methods recover only 38.8% of ground-truth hypotheses and performance degrades sharply with problem complexity (Liu et al., 2025).4 A similar study on scientific hypothesis generation finds that LLMs routinely produce hypotheses that sound plausible but lack grounding (Xiong et al., 2025).9 While asking LLMs to produce hypotheses from scratch is not a good idea, they can still help with ideas after the initial hypothesis formulation, especially when supplied with a context-specific direction, relevant literature, and domain knowledge by expanding the search space.
Step 2. The LLM executes the search (Bounded Exploration & Search). Data cleaning, feature engineering, and hyperparameter tuning represent the steps where LLMs thrive. The LLM uses the intent from the data scientist as guidance, searching a combinatorial space for the best implementation.
Step 3. The data scientist reviews execution (Assumption Validation). Is the model appropriate for this dataset? Does it account for seasonality? These are the questions where LLMs fail most (Zhu et al., 2024).11 The data scientist ensures the pipeline matches the intent.
Step 4. The LLM executes refinements (Debugging). Once the model is approved, the LLM can help diagnose and fix implementation issues, repairing the execution without altering the intent (Levin et al., 2025).3
Step 5. The data scientist generates insights and enforces intent (Interpretation & Monitoring). Post-deployment, the data scientist's role shifts to interpretation and continued enforcement of intent. The interpretation depends on the business context and the specific problem being solved, so the data scientist leads. The LLM acts as a sentinel, monitoring the production data stream for statistical drift or performance degradation. A sharp drop in accuracy? It doesn't always mean the model is "broken" – it might mean the environment changed in a way that requires a new analytical approach (revision of intent). The data scientist evaluates these signals to determine whether the model should be retrained, updated, or retired, ensuring alignment between the execution engine and the business goal. The LLM then takes it over again and executes the retraining or updating process. The data scientist still decides the final model that ships.
Bottom line: Data science leaders must draw the line
The studies we have reviewed show that the line between intentional augmentation and hasty delegation determines the success of AI-assisted data science. Tasks like hypothesis formulation, method selection, and interpretation establish intent and define what separates good performance from bad, instructing the LLM agents about bounds and goals. On the other hand, data cleaning, feature engineering, modeling, hyperparameter tuning, debugging, and monitoring are execution engines: they search a space and optimize a defined goal. Ultimately, success in embedding LLMs into the data science workflow requires the data science leader to define the data science team as the orchestrator, ensuring that the LLM agents' execution remains firmly anchored in the data science team's intentional decisions.
Implications for data centricity
Data centricity is staying true to the data by strengthening the assumptions that link data to models and then to decisions. The augmented data science framework draws the same line: the data scientist's intent is where those assumptions are made, validated, and stress-tested against domain knowledge and business context. Hypothesis formulation encodes the assumptions about what drives the outcome; method selection encodes the assumptions about the data-generating process; interpretation validates whether the model's output is consistent with what the data actually show. These steps preserve the integrity of the data-to-decision path. When intent is delegated to the LLM agents, the assumption chain breaks: the model may look right, the code may run, and the metrics may be strong, but the assumptions that anchor the decisions to reality go unchecked. Our framework suggests that staying data-centric in an AI-assisted data science workflow requires the data scientist to own the assumptions at each stage, using the LLMs as the execution engine that searches within those bounds rather than defining them.
References
[1] Abhyankar, N., Shojaee, P., & Reddy, C. K. (2025). LLM-FE: Automated feature engineering for tabular data with LLMs as evolutionary optimizers. arXiv preprint arXiv:2503.14434. ↩
[2] Chi, Y., Lin, Y., Hong, S., Pan, D., Fei, Y., Mei, G., ... & Wu, C. (2024). Sela: Tree-search enhanced llm agents for automated machine learning. arXiv preprint arXiv:2410.17238. ↩
[3] Levin, K. H., Van Kempen, N., Berger, E. D., & Freund, S. N. (2025). Chatdbg: Augmenting debugging with large language models. Proceedings of the ACM on Software Engineering, 2(FSE), 1892-1913. ↩
[4] Liu, H., Huang, S., Hu, J., Zhou, Y., & Tan, C. (2025). Hypobench: Towards systematic and principled benchmarking for hypothesis generation. arXiv preprint arXiv:2504.11524. ↩
[5] Liu, H., Zhou, Y., Li, M., Yuan, C., & Tan, C. (2024). Literature meets data: A synergistic approach to hypothesis generation. arXiv preprint arXiv:2410.17309. ↩
[6] Qi, D., Miao, Z., & Wang, J. (2024). CleanAgent: Automating Data Standardization with LLM-based Agents. arXiv preprint arXiv:2403.08291. ↩
[7] Spracklen, J., Wijewickrama, R., Sakib, A. N., Maiti, A., & Viswanath, B. (2025). We have a package for we! a comprehensive analysis of package hallucinations by code generating {LLMs}. In 34th USENIX Security Symposium (USENIX Security 25) (pp. 3687-3706). ↩
[8] Wang, P., et al. (2025). Large Language Model-based Data Science Agent: A Survey. arXiv preprint arXiv:2508.02744. (Presented at AAAI 2025). ↩
[9] Xiong, G., Xie, E., Williams, C., Kim, M., Shariatmadari, A. H., Guo, S., ... & Zhang, A. (2025). Toward reliable scientific hypothesis generation: Evaluating truthfulness and hallucination in large language models. arXiv preprint arXiv:2505.14599. ↩
[10] Yamin, K., Gupta, S., Ghosal, G. R., Lipton, Z. C., & Wilder, B. (2024). Failure modes of LLMs for causal reasoning on narratives. arXiv preprint arXiv:2410.23884. ↩
[11] Zhu, Y., Du, S., Li, B., Luo, Y., & Tang, N. (2024). Are large language models good statisticians? Advances in Neural Information Processing Systems, 37, 62697-62731. ↩
