The Hidden Cost of Dirty Data

Andy Boettcher
Jan 23
9 min read

Updated: 2 days ago

Somewhere in your business, right now, someone is fixing a spreadsheet.

They're not supposed to be fixing said spreadsheet; they're supposed to be doing something more productive.

But the data is wrong. Again. So they're spending their Tuesday morning hunting down why the numbers don't match.

This is what dirty data looks like. Not a catastrophic breach or a system failure ... just friction. Constant, grinding, expensive friction.

And it's crushing.

None of this shows up as a line item. There's no budget category for "time lost to bad data." But it accumulates. It compounds. And it costs far more than most businesses realize.

Related: Join us Feb. 5th where I'll discuss this further + explore real AI use cases along with DealHub's Eyal Orgil

Putting a number on it

Gartner tried to quantify this back in 2020. They surveyed large enterprises, companies sophisticated enough to already be shopping for data quality software, and asked what they believed poor data quality was costing them.

Their answer: $12.9 million per year, on average.

That figure gets cited constantly and for the Fortune 500, it's probably about right.

But for most businesses? It's meaningless. A 50-person company doesn't lose $12.9 million to bad data. Most don't even have $12.9 million in revenue.

What they do have is the same underlying problem, scaled to their size. The wasted hours, missed opportunities, decisions made on bad information.

The customers who churned because something fell through the cracks.

Quick math: if your business runs 5% less efficiently because of data problems, that's real money. On a $10M operation, that's $500K of drag. On $50M, it's $2.5M. Every year.

The question we wanted to answer: what does dirty data actually cost across the entire economy? Not just the enterprises that Gartner surveys, but every business in every state. The manufacturers. The healthcare providers. The professional services firms. The retailers.

So we did the work.

What we did about it

We started with that Gartner figure. $12.9 million.

It comes from their 2020 Magic Quadrant for Data Quality Solutions. As part of that research, Gartner surveyed 154 reference customers across 16 data quality vendors. These aren't random businesses. They're large enterprises sophisticated enough to already be buying data quality software. Companies who've done the work to understand the problem.

Gartner asked them to estimate what poor data quality costs their organization. The average answer: $12.9 million per year.

That's our anchor. The question was: how do we scale it?

The U.S. Census Bureau publishes something called County Business Patterns. It's a complete count of every business establishment in America, broken down by size, location, and industry. The latest data covers 8.36 million establishments employing nearly 140 million people. It’s a chunky bit of representative data that gave us the numbers we needed.

Here's what that looks like by size:

The smallest businesses, those with 1-4 employees, account for 4.6 million establishments.
Then, you've got 1.5 million businesses with 5-9 employees.
Another million in the 10-19 range.
It tapers from there: 759,000 businesses with 20-49 employees, 249,000 with 50-99, and so on up the chain.
At the top, just 9,470 establishments employ 1,000 or more people.

Those 9,470 largest businesses? They average 2,626 employees each. That's Gartner's research base and where the $12.9 million figure comes from, but the 4.6 million smallest businesses average just two employees. They're not losing $12.9 million to dirty data.

The math doesn't work, so we built a scaling model.

Our model

The logic is simple: employee count is a reasonable proxy for data complexity. More employees means more data touchpoints, people entering information, systems generating records, and integration points where data can break.

If a 2,626-person company loses $12.9 million to dirty data, what does a 2-person company lose? Proportionally less.

We divided the average employee count for each size tier by 2,626, then multiplied by $12.9 million.

The formula: (Average employees ÷ 2,626) × $12,900,000

That gives us a cost per business for each tier.

Add it all up across 140 million workers in 8.36 million businesses and the number is staggering: $617 billion. That’s the annual cost of dirty data across the American economy.

To put that in perspective: it’s roughly 2% of the U.S. GDP ($31 Trillion), which is more than the entire federal education budget. It’s enough to fund NASA twenty times over.

But the national figure only tells part of the story. Where is that cost concentrated? Which industries are bleeding the most? Which places?

The industry breakdown

Not all data is created equal.

A restaurant chain and a software company might have the same number of employees, but their data environments are wildly different. The software company lives in data. Every customer interaction, every feature usage, every bug report generates records that need to be accurate, connected, and accessible. The restaurant chain has simpler data needs.

We needed a way to capture this. The answer came from Flexera’s State of Tech Spend Report, which surveys CIOs across industries about what percentage of revenue they allocate to IT. Industries that spend more on technology, we reasoned, have more complex data environments and more to lose when that data goes wrong.

Software and tech hosting companies spend 24.7% and 15.9% of revenue on IT respectively. Financial services: 10%. Healthcare: 5%. Retail: 6.2%. Manufacturing: around 5%.

We converted these into multipliers using the weighted average IT spend (8.2%) as the baseline. Industries above that baseline get a multiplier greater than 1, those below get less than 1.

The results are striking. The Information sector loses $12,161 per employee to dirty data annually, nearly 2.5 times the baseline, while for Finance and Insurance it’s $5,991 per employee. These are the sectors where data quality matters most, and where bad data extracts the highest toll.

In absolute terms, the largest drains come from sheer employment volume: Accommodation and Food Services ($71.8 billion total), Administrative and Support Services ($66.1 billion), Healthcare and Social Assistance ($66.1 billion), and Retail Trade ($59.6 billion).

But dollar-for-dollar, an employee in the Information sector is costing their company more than twice what a retail worker costs in data quality losses.

Breakdown by state

Geography matters because industry mix matters. States with higher concentrations of data-intensive industries will have higher costs per employee. States dominated by agriculture, retail, and hospitality will trend lower.

Two stories emerge from the state data: where dirty data costs the most per worker, and where the total bill is highest.

For cost per employee, the District of Columbia leads at $4,859. California follows at $4,658, then Wisconsin ($4,597), Washington ($4,587), and Louisiana ($4,573).

At the bottom: Nebraska ($4,117), Minnesota ($4,125), Hawaii ($4,131), Utah ($4,151), and Arkansas ($4,178). These states have workforce compositions that lean toward lower-intensity data environments.

The range is tighter than you might expect, just 18% variance between highest and lowest. State economies are diverse enough that the extremes wash out.

For total cost, it’s a population and economy story. California leads with $76.4 billion in annual dirty data costs. Texas: $53.4 billion. Florida: $45.2 billion. New York: $39.6 billion. The four largest state economies account for over a third of the national total.

Search for your state:

Breakdown by county: where it gets interesting

States are too blunt an instrument; after all, California contains both Silicon Valley and the Central Valley. New York contains both Manhattan and rural dairy farms.

The real variation happens at the county level, and it’s dramatic.

State-level variance runs 18%. County-level variance runs 114%.

The highest-cost counties pay more than double per employee compared to the lowest. Same country, same regulatory environment, radically different data economics.

Tech corridor counties dominate: the high-end San Mateo County (home to much of Silicon Valley’s venture capital and tech headquarters) comes in at $6,085 per employee, though surprisingly, the most expensive county per employee isn’t actually found in California at all. That accolade belongs to Montana, with Daniels County coming in at $6,621 per employee.

What drives these numbers? Workforce composition. Daniels County workforce is composed of 30.2% Information sector (more than any other) and 7.4% Finance, both industries heavily reliant on data.

At the other end: rural counties dominated by agriculture, resource extraction, and basic services. Aleutians East Borough, Alaska: $3,092 per employee. Storey County, Nevada: $3,212. Van Buren County, Tennessee: $3,243. These are places where the economy doesn’t run on data in the same way.

For major metro areas, the numbers are significant in absolute terms. New York County (Manhattan) and Los Angeles County lose $12.4 billion and $18.6 billion respectively, more than the GDP of some countries.

These aren’t theoretical losses, by the way. They’re happening right now, in every business, in every industry. The question isn’t whether your organization is affected ... but how much.

Search for your county using the table below:

Our conclusion: why it matters more now

For years, dirty data was a nuisance. Expensive and frustrating, yes, but survivable. Businesses built workarounds and accepted a certain level of chaos as the cost of doing business.

Then came AI.

The promise of AI is that it processes information at a scale and speed humans can't match. It spots patterns across millions of records. It automates complex workflows. It generates insights that would take analysts weeks to produce.

That promise is real. But it comes with a catch.

AI doesn't think. It amplifies. There's no AI without IA, information architecture. Feed it clean, well-structured, connected data and you get genuine insight. Feed it the same fragmented, duplicated, inconsistent data that's been causing problems for years and you get confident-sounding nonsense.

Nonsense with the scale and speed of AI, with no human in the loop to catch the errors.

It's why 42% of companies scrapped most of their AI initiatives in 2025, up from 17% the year before.

The pattern is consistent: organizations skip the unsexy infrastructure work and pay for it later.

We see this in hiring data too. Our analysis of over 180 thousand job postings found companies are hiring 46% more AI specialists than data infrastructure professionals. A gap of roughly 35,000 roles nationally.

The question for every business isn't "how do we use AI?"

It should be "is our data ready for AI to amplify?"

Related: the solution starts with our data architecture consulting

Methodology

Baseline Cost Figure

The $12.9 million annual cost of poor data quality comes from Gartner’s Magic Quadrant for Data Quality Solutions (July 27, 2020, authors Melody Chien and Ankush Jain). Gartner surveyed 154 reference customers across 16 data quality vendors and asked them to estimate what poor data quality costs their organization.

These were large enterprises sophisticated enough to already be purchasing data quality software, companies that had done the work to understand and quantify the problem.

Per-Employee Calculation

U.S. Census Bureau County Business Patterns data (2023 release) shows businesses with 1,000+ employees average 2,626 employees per establishment. This aligns with Gartner’s survey population.

Dividing $12.9 million by 2,626 employees yields a baseline cost of $4,912 per employee per year. This per-employee figure was applied across all 139.8 million employees in the Census dataset.

Industry Multipliers

Different industries have different data intensities. We used Flexera’s 2020 State of Tech Spend Report, which surveys CIOs on IT spending as a percentage of revenue, to create industry-specific multipliers.

The weighted average IT spend across all industries is 8.2%. Industries spending more than this average have higher data complexity and greater exposure to data quality costs; industries spending less have lower exposure.

Multipliers were calculated by dividing each industry’s IT spend percentage by the 8.2% weighted average.

For example: Software companies spend 24.7% of revenue on IT, yielding a multiplier of 3.01x. We averaged Software (3.01x) and Technology Hosting (1.94x) to produce a combined Information sector multiplier of 2.48x. Financial Services at 10% IT spend yields a 1.22x multiplier. Healthcare at 5% yields 0.61x. Retail at 6.2% yields 0.76x.

For industries not covered by Flexera’s survey (Construction, Wholesale Trade, Educational Services, Arts and Entertainment, Real Estate, Utilities, Mining, Agriculture, and Administrative Support), we applied a 1.00x multiplier, equivalent to the weighted average IT spend.

Geographic Calculations

State and county totals were calculated by applying the per-employee cost ($4,912) and industry multipliers to employment data from County Business Patterns.

For each geographic unit, we calculated: (Employees in Industry A × $4,912 × Industry A Multiplier) + (Employees in Industry B × $4,912 × Industry B Multiplier) for all industries present in that geography.

Cost per employee figures for states and counties reflect their industry mix. A county with high Information sector concentration will show a higher cost per employee than one dominated by hospitality, even though both use the same underlying methodology.

Data Sources

U.S. Census Bureau, County Business Patterns (2023): Employment and establishment counts by industry (2-digit NAICS), state, and county. Dataset covers 8.36 million establishments and 139.8 million employees. census.gov/data/datasets/2023/econ/cbp/2023-cbp.html
Gartner, Magic Quadrant for Data Quality Solutions (July 2020): Survey of 154 enterprise customers on estimated cost of poor data quality. gartner.com/en/data-analytics/topics/data-quality
Flexera, 2020 State of Tech Spend Report: IT spending as percentage of revenue by industry, based on CIO surveys. flexera.com/blog/perspectives/it-spending-by-industry
U.S. Bureau of Economic Analysis, Gross Domestic Product (Q3 2025): National GDP figure of $31.1 trillion used to calculate dirty data costs as a percentage of economic output. fred.stlouisfed.org/series/GDP

Limitations

The Gartner baseline comes from large enterprises already investing in data quality solutions, organisations that have quantified the problem. Smaller businesses may experience different cost profiles.

The industry multipliers assume IT spending intensity correlates with data quality cost exposure, this is a reasonable but unverified assumption.

Industries without Flexera coverage are assigned the weighted average multiplier, which may understate or overstate their actual exposure. All figures represent estimates intended to illustrate the scale of the problem, not precise measurements of actual costs.