Engineering · Mar 15, 2026 · 8 min readAll posts →
Engineering

HowWeBuiltSub-200msBarcodeRecognition

A deep dive into the AI pipeline that powers Nautilus scanning — from camera frame to decoded SKU in under 200 milliseconds.

When we set out to build Nautilus, we knew scanning had to be fast. Not "fast for a web app" — fast enough that it feels instant. Our target was 200 milliseconds from camera frame capture to decoded SKU displayed on screen.

The pipeline

The scanning pipeline has four stages: frame capture, barcode localization, decode, and database lookup. Each stage had to be optimized independently, then the whole pipeline had to work together without blocking the main thread.

Frame capture uses the device camera at 30fps. We don't process every frame — an adaptive algorithm selects the sharpest frame from each 3-frame window based on edge contrast scoring. This alone eliminated 40% of decode failures from motion blur.

Neural barcode localization

Traditional barcode scanners look for specific patterns across the entire image. We trained a lightweight CNN (1.2M parameters) to predict bounding boxes around barcode regions in under 15ms. This lets us crop the image before decode, which is dramatically faster than scanning the full resolution frame.

The model handles partial barcodes, damaged labels, and unusual angles that would fail with traditional pattern matching. It was trained on 2.3 million real-world barcode images captured in warehouse conditions.

Decode and lookup

Once localized, the barcode region is processed by our decode engine which supports Code 128, Code 39, EAN-13, UPC-A, QR, and Data Matrix formats simultaneously. No need to specify which format — the engine identifies and decodes in a single pass.

Database lookup happens against a local cache of the warehouse's product catalog, synced in the background. Cache hits (98.7% of lookups) complete in under 2ms. Cache misses fall back to the API with typical response times of 40-80ms.

End to end: frame selection (10ms) + localization (15ms) + decode (8ms) + lookup (2ms) + rendering (12ms) = 47ms typical. Our 200ms target gives us 153ms of headroom for difficult conditions.