We turned on the anomaly detection layer for every Nautilus customer on April 1st. Here are five real incidents it surfaced, anonymized but otherwise unmodified.

We turned on the anomaly detection layer for all customers on April 1st. The feature is what it sounds like: a model that watches the stream of scans, counts, voice events, and movements in your warehouse, and flags patterns that look unusual.

We were curious what it would actually catch. The team had spent months on the model and we had a long list of theoretical anomalies we'd designed for, but theory is theory. Here are five real things it surfaced in the first month, anonymized but otherwise unmodified. None of these were in our list of design cases.

One. The unproductive scanning

First customer, large auto parts distributor. Day three of the feature being live, we get a note from their warehouse manager: their early-morning shift had been scanning the same SKU 47 times in 8 minutes. The anomaly system flagged it as "repeat-scan well above normal distribution." On its own, that's not necessarily a problem; sometimes operators scan things repeatedly during a difficult count.

The manager investigated. The operator had been scanning the same item over and over to inflate a "scans per hour" metric that their previous manager had set as a productivity KPI. The current manager hadn't even known the metric existed. It came up in the conversation that followed.

We didn't design the system to detect gaming of productivity metrics, but in retrospect we should have expected it. Anywhere there's a number a person is measured by, there are operators figuring out how to make the number go up without doing more work. The model didn't flag "this person is gaming a metric." It flagged "this scan pattern is statistically improbable under normal operations." The interpretation was the manager's.

Two. The late-night adjustment

Second customer, mid-sized 3PL. At 2:47 in the morning, an inventory adjustment of -120 units of a high-value SKU was entered by a user account with administrator permissions. The model flagged it on two dimensions: time of day was outside the user's typical activity window, and the magnitude was 4 standard deviations above their average adjustment size.

It turned out to be legitimate. The user was an operations manager who was in the building doing a late-night audit before a quarterly review. The adjustment reflected a known but unreported shrinkage that they were finally writing down in the books.

But: it could have been theft. And the fact that it took a model to surface this for review, rather than a human noticing it the next morning during a 30-second glance at the activity log, is the point. The 30-second glance doesn't happen reliably. The model flag does.

The customer has since used the same pattern to set up explicit review requirements for after-hours adjustments above a threshold. We didn't build that workflow; they built it themselves on top of our alert. Good outcome.

Three. The vendor mislabel

Third customer, electronics distributor. They were receiving a shipment from a Chinese supplier whose labels had been mostly fine for two years. The anomaly system flagged that the SKUs being received didn't match the SKUs on the corresponding purchase order. Not by a wide margin: three SKUs out of 80 were off by one in their internal coding.

On investigation, the supplier had updated their internal SKU scheme three weeks earlier and shifted three digit codes. The labels were correctly printed against the new scheme, but the receiving operator was treating the codes as if they were the old scheme. About 12 units of a $200 line item had already been receipted to the wrong SKU.

The receiving operator wasn't doing anything wrong. They had been receiving from this vendor for years and the scheme had been stable. The anomaly detector noticed the mismatch in pattern and flagged it for review before more inventory got misallocated.

We talked to the supplier afterward, with the customer's permission. The supplier had emailed about the SKU scheme change but the email had gone to a procurement inbox that no one was actively monitoring. This is depressingly common. The model can't prevent a missed email, but it can catch the consequences before they spread.

Four. The avoided coworker

Fourth customer, apparel. The model flagged that one specific operator had picked exclusively from aisles 1 through 8 over a six-week stretch, despite being trained on the entire warehouse and being included in pick assignments across the floor. Picker assignment is supposed to come from the optimizer, which spreads work across operators based on workload and skill, but this one operator's actual completed picks didn't match what the optimizer was assigning to them.

The customer investigated. The operator was avoiding aisles 9-16 because a coworker who worked in that section had been making sexual comments. The picker had been declining assignments in that area, swapping them for assignments in aisles 1-8 with other operators who didn't ask questions about why.

This is not a use case we built for. The model flagged it as "pick distribution far from optimizer assignment for this operator," which is a statistical statement, not a social one. The HR conversation that followed was the customer's, and they handled it well, but we are uncomfortable that this is what our tool detected. We have left the alert active and added a help center entry recommending that managers who see this pattern start a confidential conversation, not a disciplinary one.

Five. The double-counting bug

Fifth customer, food and beverage distributor. The model flagged a single bin showing a 99% pick rate over a 3-day window. A 99% pick rate would mean nearly every item that entered the bin was immediately picked out and shipped, which is plausible only for high-velocity cross-dock items.

On investigation, the bin in question was a slow-moving SKU. The 99% rate was a bug, not a behavior. Their integration with their Shopify storefront was double-counting pick events under a specific condition (orders that involved a split shipment because some items were on backorder). The duplicate pick events were artificially inflating the velocity numbers.

We had not built the anomaly detector to catch our own integration bugs. It caught it anyway, because "this looks weird" is a generic enough framing that anything weird gets surfaced, including things that are weird because of how we wrote our software. The fix took two hours. The duplicate events had been in the data for about a week, affecting maybe four customers. We rolled the fix out across all customers, sent the affected ones a note, and audited their historical reporting to back out the inflated numbers.

This was, in some ways, the most useful catch of the five. The other four were customer-side issues. This one was ours. The anomaly detector pointed it out before our QA process did.

What it doesn't catch

In the spirit of being honest:

It doesn't catch slow drift. If your inventory accuracy is degrading by 0.3% per week, the model sees normal-looking individual transactions and won't flag any of them. The cycle-counting recommender will eventually correct the drift, but it can take longer than you'd like.

It doesn't catch policy violations that are statistically common. If half your operators are routinely skipping a step in the receiving process, the model considers that the norm and won't surface it. You need process audits for that, not pattern detection.

It doesn't reason about intent. It can flag that an adjustment happened at an unusual time, but it can't tell whether the adjustment was someone working late or someone covering up shrinkage. The interpretation is always the human's.

Tuning

By default, anomaly detection runs at a threshold that surfaces roughly 8 to 12 alerts per warehouse per week. This is what we picked based on beta feedback as "enough that the alerts are taken seriously, few enough that they don't get ignored as noise." You can tune this in Settings if you want a different volume.

The model also accepts feedback. Every alert has a "useful / not useful" thumbs button. Aggregate feedback is fed back into our training pipeline weekly, so the system gets better at surfacing alerts your team finds actionable and quieter on alerts you don't.

Closing

We are wary of the framing "AI watches your warehouse and catches problems." It's not wrong but it can sound sinister, and we are determined not to ship a surveillance product. The alerts are aimed at operations leads, not at individual operators. The defaults are conservative. The system never disciplines anyone; it surfaces patterns, and the human decides what, if anything, to do.

The five stories above are unusual catches. Most weeks, most warehouses will get a handful of alerts about things that turn out to be benign. The point isn't that the model is always right. The point is that without it, the unusual stuff drifts past everyone's attention until it compounds.

PublishedMay 02, 2026

CategoryProduct

Read time9 min