Augmented Data Science: Data cleaning agent
Image – Augmented Data Science: Data cleaning agent tl;dr (never ai;dr) Data cleaning is an execution-heavy task, but only after the data scientist defines the modeling objective. We developed and tested a data-cleaning agent skill that profiles the data, separates method-invariant cleaning from method-dependent cleaning, and labels each step as diagnose , act , or ask . On a large, rich dataset with Amazon purchases and purchaser demographics, the agent produced a modeling-ready table compared to a baseline prompt that generically applied data cleaning. The skill made decisions explicit. This helped detect leakage in our use case: the skill excluded same-period purchase-derived features such as price, spend, and frequency because those features leak the target. Our analysis here supported the Augmented Data Science framework we introduced earlier: the LLM can execute cleaning, but the data scientist must set the target, method...