Augmented Data Science: Data cleaning agent

Image – Augmented Data Science: Data cleaning agent
tl;dr (never ai;dr)
  • Data cleaning is an execution-heavy task, but only after the data scientist defines the modeling objective.
  • We developed and tested a data-cleaning agent skill that profiles the data, separates method-invariant cleaning from method-dependent cleaning, and labels each step as diagnose, act, or ask.
  • On a large, rich dataset with Amazon purchases and purchaser demographics, the agent produced a modeling-ready table compared to a baseline prompt that generically applied data cleaning.
  • The skill made decisions explicit. This helped detect leakage in our use case: the skill excluded same-period purchase-derived features such as price, spend, and frequency because those features leak the target.
  • Our analysis here supported the Augmented Data Science framework we introduced earlier: the LLM can execute cleaning, but the data scientist must set the target, method, and acceptable assumptions.

Podcast-style summary by NotebookLM

Note: This joint post combines the Academic’s Take and the Director’s Cut.

This is the fourth article in our Augmented Data Science series. Earlier posts focused on intent-heavy tasks: hypothesis search and method selection. Data cleaning sits on the other side of the matrix. It is execution-heavy, but not context-free. A dataset is not generically "clean"; it is clean for a specific modeling task.

In practice, data cleaning involves critical decisions. We focus on making these decisions more explicit. Tasks such as joining tables, parsing dates, and stripping whitespace are largely method-invariant. But handling missingness, categorical encoding, scaling, outlier treatment, and class imbalance decisions depend on the model and target.

Cleaning without defining the modeling task, therefore, creates a tidy file that may still be wrong for analysis.


The data cleaning agent

The data cleaning agent (get the skill file for your preferred LLM) asks for the modeling goal, preferred method, package, and source files. It then profiles the data and asks more detailed questions. The cleaning plan separates:

  • Method-invariant steps: type coercion, format standardization, deduplication, column removal, category cleanup, domain checks, encoding repair, joins, unit consistency, etc.
  • Method-dependent steps: missingness, scaling, categorical encoding, outlier treatment, class imbalance, etc.

Each step is labeled:

  • Diagnose: run a check and report the result
  • Act: apply a deterministic fix
  • Ask: ask for the data scientist's judgment

The skill keeps raw data read-only. The cleaned file, cleaning plan, and reusable cleaning script are saved as separate artifacts. This is to keep the data cleaning and overall data science pipeline reproducible.


Testing the agent

We tested the skill on two files with real Amazon data (Berke et al., 2024):1 amazon-purchases-with-parent.csv with 1,850,717 purchase rows, and survey.csv with 5,027 respondents (You can see and download the data here). We used a preprocessing step to map the raw product codes to parent categories, 35 of which appeared in the final modeling table. We then pursued a predictive model to estimate the probability that a customer with given demographics purchases from a given parent category, using CatBoost as our preferred method.

We tested the skill using Claude Opus 4.8 and OpenAI GPT 5.5 under two conditions:

  • Baseline: no skill, only the generic instruction to join and clean the data files.
  • Agent Skill: the data-cleaning skill loaded, with the modeling goal and method specified.

Summary of the results

The baseline runs without the skill produced useful joined tables. Both models stripped whitespace, parsed dates and numeric fields, derived useful variables, validated the many-to-one join, preserved missing product metadata as NAs, and wrote a clean purchase-level file. The agent skill changed the unit of analysis based on the modeling goal. Instead of a purchase-level table, the agent created a customer-by-parent-category grid with 5,027 respondents across 35 categories. The two skill runs agreed on the core structure: the same customer-category grid, target definition, exclusion of purchase-derived features, and nearly identical category-level target rates.

See the detailed detailed results here: Outputs of the data cleaning agent skill in Opus 4.8 and GPT 5.5


Takeaways

  • Change in unit of analysis and method-specific choices. The agent screened for leakage and excluded purchase-derived fields such as price, spend, and frequency because the same-period purchase behavior leaks the target. Because the method was CatBoost, the agent rejected scaling and one-hot encoding, kept categorical predictors as strings, filled categorical missingness only where needed, and examined class imbalance by category. The skill separated deterministic cleaning from judgment. For example, missingness and multi-select survey fields were not silently imputed or transformed without being marked as data scientist decisions.
  • Opus probed where GPT didn't. Opus 4.8 expanded multi-select survey fields into indicators and flagged a subtle validation issue: because each customer appears in 35 rows, a random train/test split leaks demographics across rows. It then recommended a customer-grouped split. GPT 5.5 did not raise that issue. GPT also reported extreme full-data diagnostics, including a maximum unit price of 999,999 and quantity of 4,000, which were not reproducible. The takeaway is straightforward: even structured agent output needs to be checked and validated.
  • Multi-select survey fields required a data scientist decision and the agent knew that. Opus split them into indicator columns; GPT kept them as categorical strings. Since the method was CatBoost, GPT's choice was technically compatible with the model. The question was whether it accurately represented the survey construct. If Q-life-changes contains sparse combinations such as Moved place of residence or Lost a job, splitting the field into indicators is usually cleaner than treating each full string as its own CatBoost category.
  • The agent provides directions, not decisions. The agent cleaned the data for a specified modeling task and method. In the process, both models used the data scientist's inputs at points where judgment was required. The skill guided the models to structure and make explicit the handoffs between execution and intent.

Consistent with the Augmented Data Science framework, LLMs remain assistants for data cleaning, not substitutes, retaining the human data scientist as the decision maker. The data scientist must define the target, method, unit of analysis, and leakage boundary. The LLM then executes the search for cleaning steps inside those boundaries.


References

[1] Berke, A., Mahari, R., Pentland, S., Larson, K., & Calacci, D. (2024). Insights from an experiment crowdsourcing data from thousands of US Amazon users: The importance of transparency, money, and data use. Proceedings of the ACM on Human-Computer Interaction, 8(CSCW2), Article 466.

Other popular articles

How to (and not to) log transform zero

Explaining the unexplainable Part II: SHAP and SAGE

Measuring long-term outcomes using short-term data and surrogates