Back

Using AI to Clean Messy Data

How language models repair dirty records without becoming a black box: schema inference, validation gates, and a human review loop.

Using AI to Clean Messy Data
Written by
BSH Technologies
Published on2025-09-06

AI for data cleaning works best when it proposes, not decides

AI for data cleaning is seductive because the alternative never ends: hand-written rules for every address typo, currency format, and stray Unicode character. A language model can read a column of free-text city names and normalise "Trichur", "Thrissur", and "TCR" to one canonical value in seconds. The trap is letting it silently rewrite production records. The pattern that actually ships treats the model as a proposer and keeps a deterministic gate between its output and your warehouse.

We build cleaning pipelines for clients whose data arrives from spreadsheets, scanned forms, and legacy ERP exports. The model never has the final word. It suggests; a schema validates; a human signs off on anything low-confidence. That single principle keeps the whole system debuggable.

Where rules still beat the model

Resist the urge to send everything to an LLM. A regular expression that strips whitespace runs in microseconds, costs nothing, and never hallucinates. Reserve the model for the genuinely ambiguous work that rules handle badly.

  • Deterministic transforms (trimming, case-folding, date parsing with a known format) stay as code.
  • Lookup-table joins for known mappings, such as country codes or GST state codes, stay as code.
  • The model handles fuzzy entity resolution, free-text categorisation, and inferring intent from inconsistent human input.

A useful rule of thumb: if you can write the rule in under ten minutes and it covers ninety percent of cases, write the rule. Send the remaining ten percent to the model. This keeps the expensive, fallible component focused on the small slice where it genuinely earns its place, and keeps your costs and your error surface small.

Constrain the output to a schema

Free-form model output is impossible to validate, so force structure instead. Ask the model to return strict JSON matching a schema, then parse it with something like Zod and reject anything malformed. When the model proposes a corrected record, it should also return a confidence score and a short reason.

The reason is gold during review. It tells a human why the model thought "Bangalore" and "Bengaluru" were the same place, which is far faster to verify than re-deriving the logic from scratch. A reason of "both are common names for the same city in Karnataka" gets a fast yes; a reason of "they look similar" gets a closer look. You are effectively asking the model to show its working, and the working is what makes the output auditable.

Confidence thresholds and the review queue

Not every correction deserves the same trust. We route by confidence and by how much damage a wrong answer would do:

  1. High confidence and the change is low-risk: apply automatically, log the before and after.
  2. Medium confidence: apply to a staging table and surface it in a daily review batch.
  3. Low confidence, or the field feeds billing or compliance: never auto-apply, always queue for a human.

The logging matters more than people expect. Every automated edit should be reversible. Store the original value, the proposed value, the model version, and the prompt hash. When someone asks six months later why a record changed, you have a complete trail rather than a shrug, and you can replay the exact decision against a newer model to see whether it would still agree.

Guard against confident nonsense

Models are fluent liars. A model asked to fix a phone number will happily invent a plausible-looking one, and it will sound just as certain as when it is right. Defend against this with assertions the model cannot talk its way past.

A cleaned phone number must still have the right digit count for its region. A corrected email must pass a real validator. A normalised amount must stay within a sane range of the original. If the proposal fails the assertion, discard it and flag the row rather than letting the model retry until it guesses something that passes, because that just rewards confabulation. The assertion is a hard floor that no amount of model fluency can erode, and it is the difference between a pipeline you can trust unattended and one you cannot.

Measuring whether it actually helped

Cleaning that nobody measures tends to drift into making data different rather than better. Hold out a labelled sample, a few hundred records a human has cleaned by hand, and score the pipeline against it. Track precision, meaning of the changes you made, how many were correct, and recall, meaning of the problems present, how many you caught.

Watch precision especially. A pipeline that fixes two hundred records and breaks five may be worse than one that fixes a hundred and fifty and breaks none, depending on what the data feeds downstream. Re-run that evaluation whenever you change the prompt or upgrade the model, because a new version can improve average quality while quietly regressing on a category that matters to you. That is a silent failure you will only catch with a fixed benchmark.

How BSH can help

At BSH Technologies, we build data-cleaning pipelines that pair language models with deterministic validation, full audit logging, and a review loop your team can trust. From our base in Thrissur we have untangled messy datasets for organisations across Kerala and beyond, turning inconsistent exports into warehouse-ready records without surrendering control to a black box. If your data is slowing down your reporting or your AI projects, we would be glad to take a look and map out a pragmatic path forward.

From the blog

View all posts
Designing Multi-Tenant SaaS That Scales
Software Dev

Designing Multi-Tenant SaaS That Scales

Choosing an isolation model, keeping tenant data separate, and dodging the noisy-neighbour and migration traps that bite SaaS later.

BSH Technologies
BSH Technologies · 2026-06-14
Hitting Green Core Web Vitals in Next.js
Software Dev

Hitting Green Core Web Vitals in Next.js

A practical guide to LCP, INP and CLS in Next.js — image handling, font loading, the App Router boundary, and costly third-party scripts.

BSH Technologies
BSH Technologies · 2026-06-10