If you run a warehouse, you've probably heard of ABC cycle counting. It's the convention. Count your A-class (high-value, high-velocity) items every month, your B-class every quarter, your C-class once or twice a year. It's been the default in WMS software for forty years.

We don't ship it.

We have an ABC report. You can run an ABC analysis if you want. But the default counting suggestion in Nautilus has never been ABC, and we'd like to make the case for why.

What ABC is actually optimizing for

ABC was designed for an era when counting was expensive. Each count required a worker to walk to a location, manually inspect quantity, and write it down. Time was the constraint. So you allocated your scarce time to the SKUs that mattered most. High-value items get counted often. Low-value items get counted rarely.

But here's what ABC does not consider: whether the count is likely to find a discrepancy. It's a static priority based on inventory value, not a dynamic priority based on probability of error.

Picture two A-class SKUs. One was counted three days ago, has had no receiving or pick activity since, and sits in a quiet aisle far from the dock. The other was counted three months ago, has had ninety pick events, eight receipts, and three relocations in the meantime, and sits two bays from a forklift turnaround zone. ABC says count them both at the same frequency. Common sense says count the second one first and don't bother with the first one yet.

When you count something where the system is already right, you've spent labor and found nothing. That's a wasted count. The whole point of cycle counting is to find errors before they compound into stockouts and mispicks. A count that doesn't find an error is not a successful count. It's a failed test that produced no information.

The shift

What you actually want is not a value-weighted priority but a probability-weighted priority. Of all the SKUs and locations in your warehouse, which is most likely to be wrong right now? Count those. Don't count the ones the system is probably right about.

This is the same shift that happened in software testing in the 2000s. You used to write test cases by feature. Now you mostly write them by risk: untested code, code that just changed, code with a history of bugs. The labor went to where the labor was likely to matter.

What goes into our model

Our counting recommender is a gradient-boosted classifier trained to predict the probability that a given SKU-at-a-location will have an inventory discrepancy if counted right now. The features:

Time since last successful count. Linear effect mostly. We found a small nonlinearity at the 90-day mark which we attribute to people simply forgetting that locations exist if they haven't been touched in three months.

Number of pick events since last count. Strong positive effect. More handling means more chances for an error.

Number of receiving events since last count. Weaker than picks but still significant, especially when the receipt was a split-quantity entry.

Number of relocations since last count. Strong positive effect. Moving an item is where you lose count.

Variance of pick-event spacing. Bursty picks correlate with higher error rates. If you pick a SKU thirty times in two hours, the chance that one of those picks was logged against the wrong location is higher than if you picked it once a day for thirty days.

Operator turnover at the location. Locations primarily handled by operators with under six months of tenure show 1.8x the discrepancy rate of those handled by veterans. We don't surface this in the UI, but the model uses it.

Time-of-day distribution. Items mostly picked during shift handoff hours have higher error rates. The model can't fix shift handoff problems but it can prioritize counts that catch them.

Last discrepancy magnitude. If a SKU was wrong by 12 units last time, the next count of it is much more likely to find another discrepancy than a SKU that has never had one. We initially thought this was noise. It isn't.

Adjacent-location density. SKUs in dense storage where adjacent bins hold similar-looking items have higher error rates.

Light-level estimation, which we get from the front camera during scans. Dark corners produce more errors. We were surprised by this one and almost dismissed it before the effect held up over more data.

Training data

The model was trained on roughly 200,000 labeled count events from beta customers across 14 warehouses, with labels being the absolute discrepancy at the time of count. We held out four warehouses entirely for evaluation. We trained per-customer fine-tunes for the largest five customers and a global model that we use for everyone else; the fine-tunes outperform the global by about 7% AUC on average.

We tuned for AUC rather than top-k accuracy because we don't actually rank-and-pick. We let the user choose how many counts they want to do today and surface the highest-probability candidates. The economic question is "how much of our daily counting labor budget should we spend on the next location?" and AUC matches that decision better than top-k.

Results from the field

Across our beta cohort, switching from ABC to probability-weighted counting cut the average daily count workload by 38% while increasing discrepancies found per count by 2.7x. In absolute terms: customers were doing roughly half the counting work and finding roughly three times more errors.

The customers who saw the biggest gains were not the largest. They were the messiest. Warehouses with high SKU diversity, frequent relocations, and operators newer than six months. Those are exactly the conditions where uniform ABC underspends on the risky locations.

Two customers saw smaller gains, both in the 10-15% range. We dug in. One had a very well-run operation already with very low discrepancy rates across the board, which limited how much we could help. The other was using us in a way we hadn't expected. They were counting only one section of their warehouse with Nautilus while doing the rest manually, which broke our model's ability to learn the customer-specific patterns.

What we got wrong in v1

The first version of the recommender didn't account for time-of-day of the last count. We discovered (about four months in) that counts performed at 4 PM had a 12% higher rediscovery rate than counts performed at 9 AM, controlling for everything else. End-of-shift counters are tired and less rigorous. Our model now down-weights counts performed in the last hour of a shift, treating them as partial information.

We also originally surfaced the model's confidence as a percentage next to each recommendation. "87% probability of discrepancy if counted now." This was a mistake. Operators interpreted the number as a forecast. They expected to find a discrepancy on 87% of those counts. The actual hit rate was closer to 31%. The number 87% was the model's calibrated probability rank; people heard it as a frequency. We now surface a five-step priority (Critical / High / Medium / Low / Skip) instead and have not had this confusion since.

What it doesn't do

Probability-weighted counting doesn't help if you have a fundamental process problem. If your operators are routinely scanning the wrong bin label because the labels are unreadable, no model will save you. You'll just be counting and recounting the same bad data. Fix the labels first.

It also doesn't help if your error distribution is dominated by a single root cause, like one operator who's consistently miscounting. The model will correctly identify their territory as high-risk, but the right intervention is conversation, not more counting.

The bigger point

Most warehouse "best practices" predate cheap compute. ABC cycle counting was a good answer to "how do I allocate scarce inspector time," when the model in your head was the best information available. With cheap compute and an event stream, the model in software can do better. Not because it's smarter, but because it has access to data the model in your head doesn't: every relocation timestamp, the variance of every pick interval, the operator tenure for every scan.

We think this same logic will eventually retire a half-dozen warehouse conventions. ABC, blanket safety stock multipliers, fixed-interval recounts, location-based shrink reporting. The convention exists because the computation was expensive. The computation is no longer expensive.

PublishedApr 12, 2026

CategoryEngineering

Read time11 min