AI fails on data more often than it fails on model choice, prompt engineering, or even engineering skill. The pattern repeats across every client engagement: teams jump to AI model selection without auditing whether their data is accessible, clean, and complete. Six months later, the project stalls and the retrospective surfaces the same issue every time. This post is the data strategy that prevents the stall.

Four-step foundation

Inventory what exists → access via proper pipelines → measure and fix quality → only then deploy AI. Teams that skip steps 1-3 rebuild them during step 4 at 3x cost.

Step 1: Inventory what you have

Most companies don't have a clear picture of their data landscape. CRM data in Salesforce, product data in Postgres, customer support in Zendesk, internal wiki in Notion, financial data in NetSuite, operational dashboards in Tableau. The data exists; no one has a unified view.

Practical deliverable: a registry (a wiki page, a spreadsheet, a dedicated data catalog product) listing every significant data source, its owner, its update frequency, its access method, and its current state of documentation. This sounds boring. It's the thing that either exists or doesn't when an AI project tries to start.

Step 2: Build proper access

Data accessible via API is usable. Data accessible only via a CSV export done manually by someone in finance isn't. A surprising number of data sources inside companies are in category 2 — technically the data exists, practically it's not engineerable.

Investment: build pipelines. Fivetran, Airbyte, custom pipelines, whatever fits — the goal is automated, governed extraction from source systems into a place AI systems can read. Don't skip the governance layer — access controls, audit logs, data classification. This prevents the PII incidents that happen when AI suddenly has access to everything.

Step 3: Measure and improve quality

Data quality has specific dimensions: completeness (are required fields populated), freshness (how stale), consistency (does it match across systems), accuracy (does it reflect reality). Measure these before using data for AI. The discovery is usually sobering — 'our customer records are 30% incomplete in ways that matter for our product.'

Quality improvement is ongoing. New tooling (Great Expectations, Monte Carlo, Soda) helps measure and alert on quality metrics. Governance (data stewards, schemas, review processes for schema changes) prevents re-introducing the problems you fixed.

The quality bar for AI is different from analytics. A BI dashboard with 5% missing values still gives useful trends. An LLM asked to draft a contract for a customer whose record is 5% incomplete generates wrong details confidently. Calibrate the bar to the use case.

Step 4: Now deploy AI

Only after steps 1-3 are reasonably solid does large-scale AI deployment make sense. Smaller pilots can run in parallel with data work, but production AI on unreliable data foundations is rework waiting to happen.

Counter-argument we hear: 'but AI can help clean the data.' True, within limits. LLMs can extract and normalize unstructured data into structured records, and this is a genuine value driver. But AI cleaning garbage without the governance layer underneath just produces cleaner garbage — you need the inventory, access, and governance first.

AI-specific data considerations

Documents, not just structured data

Traditional data strategy focuses on transactional and dimensional data. AI, particularly RAG, needs document corpora — policies, manuals, help articles, contracts, reports. This data is often scattered across file systems, SharePoint, wikis, and email. The document inventory is separate from the database inventory and equally important.

Vector indexes as a new data layer

Embeddings and vector indexes need to be maintained alongside source data. Source changes; embeddings need to update. This is operationally similar to search index maintenance but with its own tooling (see vector DB post).

Conversation and interaction data

Every AI deployment generates new data: user queries, AI responses, feedback signals, tool invocations. This becomes your eval set, your improvement signal, your regression test corpus. Capture it from day one.

Avoid this

The data-lake-to-nowhere pattern. Building a beautiful data lake that has none of the data AI workloads actually need, because it was scoped to BI use cases. Start with the AI use case and work back to the data required.

Outsourcing the data work to a consultancy while expecting in-house AI engineers to 'just use the clean data.' Data and AI can't be clean-split in this way. AI engineers need to own data access; data engineers need to understand AI requirements. Cross-training or joint ownership is required.

Data strategy for AI: what to fix before you buy models

Step 1: Inventory what you have

Step 2: Build proper access

Step 3: Measure and improve quality

Step 4: Now deploy AI

AI-specific data considerations

Documents, not just structured data

Vector indexes as a new data layer

Conversation and interaction data

Avoid this

Continue the thread.

The AI readiness audit: 10 questions before you write a single prompt

Synthetic data for AI: when to generate, when to buy

The anatomy of an AI project: phases, deliverables, pitfalls

Want to talk about this?