Photo Atlas
← All postsHow I catch duplicate uploads without slowing the upload

How I catch duplicate uploads without slowing the upload

·7 min read·by Thomas Hart

Duplicate detection usually slows uploads because it runs in the request path. Atlas splits it in two: a streaming SHA-256 for exact matches during upload, then perceptual hashing in a background worker afterward. Here's the false-positive math and the recovery path.

Atlas computes a SHA-256 hash while the file streams in, which catches byte-identical uploads for free, then runs perceptual hashing in a background worker after the bytes land. The upload itself never waits on duplicate detection. Near-duplicate matches are advisory, and every one of them can be undone.

Why I don't check for duplicates during the upload

The obvious version of dedupe runs the moment a file arrives. Take the bytes, hash them, query the database, decide whether to keep the file, then return a response. That holds up fine until someone uploads a 40MB photo on hotel wifi. Now the request sits there hashing and querying while the user watches a spinner.

So I split the work into two passes. One is cheap enough to run inline. The other is expensive enough that it has to run later.

The cheap pass is the exact hash. As bytes stream through the upload handler on their way to storage, I feed them into a running SHA-256 digest. No second read of the file. No extra buffer. The hash falls out of the stream I was already moving. By the time the last byte is written, I have a 64-character hex digest, and one indexed lookup tells me whether those exact bytes already exist for this account. If they do, I point the new record at the existing object and skip the write. That path adds about 3 to 8 milliseconds on a 2MB image. Nobody feels it.

What's the difference between exact and perceptual hashing?

Exact hashing answers one question: are these the same bytes? SHA-256 is perfect for it. Change a single pixel, re-save the JPEG at a different quality, or strip the EXIF block, and the digest changes completely. That makes it useless for catching the same photo saved as a slightly different file, which is most of what real users upload twice.

Perceptual hashing answers a different question: do these two images look the same? It throws away almost all the detail on purpose. The fingerprint is small, and two images that look alike produce fingerprints that sit close to each other. The version I run is dHash, the difference hash. It shrinks the image to 9 by 8 pixels in grayscale, then records whether each pixel is brighter than the one to its right. That gives 64 bits.

import sharp from "sharp";

// 9x8 grayscale, compare each pixel to its right neighbor -> 64 bits
export async function dHash(input: Buffer): Promise<bigint> {
  const { data } = await sharp(input)
    .grayscale()
    .resize(9, 8, { fit: "fill" })
    .raw()
    .toBuffer();

  let hash = 0n;
  for (let row = 0; row < 8; row++) {
    for (let col = 0; col < 8; col++) {
      const left = data[row * 9 + col];
      const right = data[row * 9 + col + 1];
      hash = (hash << 1n) | (left < right ? 1n : 0n);
    }
  }
  return hash;
}

Two images count as near-duplicates when their hashes are close. Closeness is the Hamming distance, the number of bits that differ between the two 64-bit values. Identical-looking images land at a distance of 0. A re-compressed copy usually lands at 1 to 4. Genuinely different photos sit out past 20.

How I compute the perceptual hash without blocking the upload

dHash itself is not slow. Decoding a large image and resizing it is. On a 12MP JPEG the sharp resize runs about 15 to 40 milliseconds, and that's after the decode. Run that inline on every upload and you've put image decoding in the request path for no reason, because the user does not need the duplicate answer before their file is safely stored.

So the perceptual pass happens after the upload returns. When the bytes are written, I enqueue a small job with the object id. A background worker picks it up, computes the dHash, and compares it against the other hashes in that account using a bounded Hamming search. If it finds something within the threshold, it flags the new object as a possible duplicate and records which object it matched. The whole thing happens in the second or two after upload. The user already has their success response.

This also means a slow or failing dedupe worker can never break an upload. The worst case is that a duplicate goes unflagged for a few seconds longer. That is a few kilobytes of wasted storage, which I can live with.

What's the false-positive rate?

A false positive is two images that look different to a person but land inside the distance threshold. The rate depends entirely on where you set that threshold.

I tested four thresholds against a set of about 12,000 real images from my own storage, where I already knew the true duplicate pairs. Here's what I measured:

Hamming threshold Caught real duplicates False positives
≤ 2 71% ~0.01%
≤ 6 94% ~0.3%
≤ 10 98% ~2.1%
≤ 14 99% ~7%

I settled on ≤ 6. It catches the large majority of true duplicates, and at 0.3% the false positives are rare enough to handle by hand on the few occasions they happen. The jump to ≤ 10 buys four more points of recall and costs seven times the false positives. Not worth it for a storage product where a wrong merge is annoying.

Here is the comparison that drove the whole design:

Exact (SHA-256) Perceptual (dHash)
Catches byte-identical files resized, recompressed, lightly edited
When it runs during the upload stream background worker, after upload
Added latency ~3 to 8 ms 0 ms on the request path
False positives none ~0.3% at distance ≤ 6
Reversible not applicable yes, always advisory

How do you recover from a dedupe mistake?

The rule that makes the false-positive rate survivable is simple. Perceptual dedupe never deletes anything and never silently merges. A match sets a flag and a pointer. The file stays exactly where it is.

Dedupe should never delete a user's bytes. The moment it does, one false positive turns into a support ticket about lost data, and there is no file left to hand back.

When the worker flags a near-duplicate, the user sees a small "possible duplicate" badge with a side-by-side of the two images and one button to dismiss it. Dismissing clears the flag and writes the pair to an ignore list, so those two images never get flagged against each other again. Nothing was ever removed, so there is nothing to restore. The recovery is a single click that says these two are different, leave them alone.

Exact duplicates are different. Those I do collapse automatically, because byte-identical means there is genuinely one file. Even then I keep both database records pointing at the shared object, so each upload keeps its own name, timestamp, and folder. If someone deletes one, the underlying bytes survive until the last reference is gone. Reference counting, nothing fancy.

The split comes down to confidence. SHA-256 collisions don't happen in practice, so an exact match is safe to act on automatically. A perceptual match is a strong guess, and a guess gets treated as a suggestion the user can wave off.


Does perceptual hashing work on non-image files?

Not the image version. dHash is built on pixels, so it only applies to images and to video frames if you sample them. For PDFs, audio, and arbitrary documents, Atlas falls back to exact SHA-256 only. There are perceptual schemes for audio and text, but I haven't found the false-positive math worth it for a general file store.

What Hamming distance threshold should I use?

Start at ≤ 6 for a 64-bit dHash and adjust from there. Lower it if your users complain about wrong matches. Raise it if obvious duplicates slip through. Test against your own files, because the right number depends on how your users edit and re-save images.

How much does it cost to store the hashes?

Almost nothing. A SHA-256 digest is 32 bytes and a dHash is 8 bytes. For a million files that's 40MB of hash data with both columns indexed. The perceptual search is the only real cost, and bounding it to a single account's hashes keeps it fast.

If you're adding dedupe to your own product, build the exact path first and ship it. It's a streaming hash and one indexed lookup, it can't slow anything down, and it handles the most common case where someone uploads the identical file twice. Add perceptual matching later, in a background worker, once you've decided you can treat its answers as suggestions instead of commands.


Find any asset in seconds. Photo Atlas is digital asset management for creative and brand teams, with early-access founder pricing for the first users. Get early access

Thomas Hart

Try Photo Atlas Free

Optimize images with SSIM-based compression. 10 free conversions per day, no credit card required.