The data work.

Getting your data ready for AI

An AI assistant is only as good as the documents it can read. Before a Sovereign AI build is worth doing, your data usually needs work: finding it, working out what matters, getting it out of the systems it's stuck in, and into a shape something can actually use.

This is the part that decides whether everything built on top of it holds up. It's also the part that gets skipped. Most AI projects that disappoint don't fail on the model. They fail because the data underneath was never in a fit state, and nobody wanted to do the unglamorous work of fixing it first.

Why the data is the hard part

The model is the easy bit. You pick an open weights model, run it on AWS Bedrock in London, and you're done in an afternoon. The hard part is everything underneath: the documents, the records, and the years of files spread across systems that were never designed to talk to each other.

This is the part I've spent my career on. Getting messy data out of legacy systems and into shape so something useful can sit on top of it is the same job whether the thing on top is a website, a reporting tool, or a Sovereign AI assistant. The platform changes. The plumbing doesn't.

When an AI build underdelivers, the data is almost always why. It answered from half your documents because the other half couldn't be read. It missed the policy that mattered because it was trapped in a scanned PDF. It gave a confident wrong answer because the source it needed wasn't there to find.

What "not ready" looks like

Most organisations are somewhere on this list, and usually on more than one line of it:

Documents scattered across SharePoint, shared drives, email, and a handful of SaaS tools, with no single place that's authoritative.
Years of scanned PDFs that are really just images, so nothing can search inside them.
The same policy or template in four versions, with no clear way to tell which one is current.
Key knowledge living in people's heads or buried in inboxes, never written down anywhere a system could reach.
Records in formats a person can read but a machine can't: spreadsheets used as databases, tables pasted into Word, data locked inside proprietary exports.

None of this is unusual, and none of it is a failing. It's what normal looks like after years of real work. But an AI assistant can only answer from what it can actually read, so this is the gap that has to close first.

What the data work involves

It varies with the state of things, but it usually runs through four steps.

1. Finding it and working out what matters

The first job is a map: what documents and records you have, where they live, who owns them, and which of them the assistant actually needs. Most archives carry a lot of noise. Part of the work is deciding what's in scope and, just as importantly, what to leave out.

2. Getting it out

Then the extraction: pulling the content out of the systems it's stuck in. That can mean exporting from legacy databases, reading text out of years of scanned PDFs, or getting documents out of a tool that was never meant to give them up. This is the step that quietly eats the most time.

3. Cleaning and structuring

Raw extracted data is messy. Duplicates, dead versions, broken formatting, records that contradict each other. The work here is getting it consistent and trustworthy, settling which version is current, and structuring it so a machine can make sense of it rather than just a person.

4. Building the retrieval layer

Finally, the cleaned content goes into a layer the assistant reads from: the retrieval step that lets it find the right passage and answer with a citation back to the source. This is what turns a pile of documents into something the build can stand on.

What you get at the end

A data foundation the build can actually sit on: your documents extracted, cleaned, structured, and searchable, in one place you control, inside your own tenancy.

You also get a clear, honest map of what you have and what's still missing. If there are gaps that can't be closed yet, you'll know about them before a single tool is built, not after.

When the data work is most of the job

Sometimes this phase is the bulk of the engagement and the build is quick on top of it. Sometimes the Discovery finds your data is already in decent shape, and this phase is light or skipped entirely.

I won't pretend to know which it is before I've looked. The honest scope comes out of the Discovery, once I can see what state your data is actually in. What I won't do is start building tools on a foundation that can't hold them, then bill you to find out it didn't work.

Time and price

This is the phase that varies most, because it depends entirely on the state of your data. A clean, well organised archive is quick. Twenty years of scanned correspondence across five systems is not.

For that reason I scope it after the Discovery, when I know what I'm dealing with. Where the scope is clear, it's a fixed price. Where it isn't yet, we do it in stages, so you're never committing to a number neither of us can stand behind.

"We have worked with many developers and can confidently say that Pete is by far the best, an exceptional coder and consultant with an impressive skill set."

Kathryn Maxwell, IT Project Manager, Royal Meteorological Society

Where this fits

The data work usually follows a Sovereign AI Discovery. That's where we work out whether it's needed at all, and what it would involve. If you haven't read it yet, that's the place to start.

Or email me at peter@peterbrady.co.uk with a short note about your organisation, roughly how many people are in it, and the sector you're in. I'll come back with a time for a thirty minute scoping call.