A user goes for an 8,000-step walk. Their Apple Watch records it. Their iPhone, sitting in a pocket, records it too. The third-party running app they installed last month records it a third time. Read those records back naively, sum them, and your app proudly reports 24,000 steps.
This is the single most common — and most silently damaging — data-quality bug in health applications. It doesn’t throw an error. It doesn’t crash. It just quietly inflates every metric, corrupts every score built on top, and erodes user trust the moment someone notices their step count is physically impossible. Deduplication is the unglamorous engineering work that stands between raw health data and a number you can actually show a user.
This guide covers exactly how duplication happens, what Apple HealthKit and Google Health Connect do (and don’t) handle for you, and a deduplication strategy that holds up in production.
Why duplication happens in the first place
Health data duplication isn’t a glitch — it’s the predictable result of an architecture where many independent sources observe the same underlying reality. There are three distinct failure modes, and a robust solution has to handle all three.
1. Same metric, multiple devices. The user has more than one device measuring the same thing. An Apple Watch and an iPhone both count steps. An Oura Ring and an Apple Watch both track sleep. Each writes its own records, and they overlap in time.
2. Same data, multiple paths. A single source’s data arrives through more than one channel. A Fitbit might sync to your app directly and into Health Connect, so the same record appears twice with different provenance.
3. Partial interval overlap. The hardest case. Two sources don’t record identical time ranges — they overlap partially. The watch logs a workout from 6:00–6:45; the phone logs steps from 6:30–7:00. Naive summing double-counts the 6:30–6:45 window while correctly counting the rest. You can’t just dedup by exact timestamp.
What the platforms handle — and what they leave to you
Both major platforms provide some deduplication, but only under specific conditions. The most common production bug is assuming that protection is broader than it is.
Apple HealthKit
HealthKit stores samples from every source — Apple Watch, iPhone, and any app on either device — and each sample carries an HKSourceRevision describing who wrote it [1][2]. The critical distinction is how you read:
- Statistics queries deduplicate. If you read cumulative quantity types (like steps) with
HKStatisticsQueryorHKStatisticsCollectionQuery, HealthKit merges overlapping samples for you and returns a value matching the system Health app [1][3]. When records overlap or one is fully contained in another, Apple’s internal logic excludes the redundant data so it isn’t double-counted [3]. - Raw sample queries do not. Read the same data with
HKSampleQueryand you get every overlapping sample, untouched. Sum them yourself and you double-count. - Sleep and category data are not auto-deduplicated. Sleep is stored as overlapping category samples, and HealthKit won’t reconcile them for you — developers routinely have to merge overlapping sleep intervals by hand to get a non-overlapping timeline [4].
The practical rule: for cumulative quantities, prefer statistics queries; for everything else, assume you own deduplication.
Google Health Connect
Health Connect takes a more explicit, priority-driven approach [5][6]:
aggregate()deduplicates by priority. For cumulative types likeStepsRecord, the aggregate API “accounts for any duplicate data and keeps only the data from the app with the highest priority” [5]. The ranking comes from a user-controlled priority list — when two apps disagree, the one higher on the list wins.readRecords()does not deduplicate. Raw reads return records from every source, each tagged with adataOriginidentifying the writing app [6]. Reconciliation is yours.- Metadata helps you filter. Records carry a recording method (actively recorded, automatically recorded, or manually entered), which you can use to drop noise — and the priority-based dedup specifically governs actively recorded data.
The rule mirrors Apple’s: use aggregate() for cumulative metrics; the instant you call readRecords(), you’re deduplicating yourself.
The gap neither platform closes
Here’s the part teams discover too late: the OS stores only deduplicate within themselves. HealthKit knows nothing about Health Connect. Neither knows anything about data you pull directly from the Oura, WHOOP, or Garmin cloud APIs. The moment your product spans iOS and Android — or ingests data from a direct wearable integration alongside the phone’s health store — you are on your own for cross-source deduplication. No platform API helps you here.
A deduplication strategy that holds up
Whether you’re reconciling raw reads on one platform or merging across several, the same five-step framework applies.
1. Prefer aggregation APIs over raw reads
If all you need is a daily total for a cumulative metric, use HKStatisticsCollectionQuery or Health Connect’s aggregate() and let the platform do the merge. The most reliable deduplication code is the code you don’t have to write. Only drop to raw reads when you genuinely need per-sample detail.
2. Define a source-priority list per metric
When you do reconcile yourself, don’t treat all sources as equal — encode a trust hierarchy, and make it metric-specific [6]:
| Metric | Prefer | Over |
|---|---|---|
| Sleep stages | Dedicated sleep wearable (Oura, WHOOP) | Phone-estimated sleep |
| Steps / distance | Wrist wearable | Phone pedometer |
| Heart rate / HRV | Optical wearable | Manual entry |
| Workouts | The device that recorded the session | Generic step source |
The principle: a purpose-built sensor beats a general one, and an automatic measurement beats a manual guess. For each metric and time window, pick one source by priority rather than blending.
3. Reconcile overlapping intervals
For interval data — sleep sessions, workouts — exact-timestamp matching isn’t enough, because sources overlap partially. Build a single non-overlapping timeline: sort records by start time, and where two intervals overlap, keep the higher-priority source for the contested window and trim or drop the other. This is the step naive deduplication always misses, and it’s why summing “distinct” records still over-counts.
4. Filter on metadata before you trust a record
Use provenance to drop noise before deduplicating. Health Connect’s recording method lets you separate actively recorded data from manual entries; HealthKit’s HKSourceRevision (and the wasUserEntered metadata key) tells you who wrote a sample and whether a human typed it. A manually entered “8 hours of sleep” shouldn’t override a sensor-measured 6h12m.
5. Handle the cross-source case explicitly
Because no OS store deduplicates across platforms or cloud APIs, you need your own reconciliation layer whenever data arrives from more than one origin. Normalize every source into a common schema first (different vendors use different units, field names, and sleep-stage definitions), then apply priority and interval reconciliation across the unified set. Deduplicating before normalizing is a common and painful mistake — you can’t compare records you haven’t aligned.
The edge cases that will bite you
Even with the framework above, a few scenarios cause production incidents:
- Resurrected backfill. Devices sometimes delete and re-write historical samples — old watchOS data can disappear and reappear years later with new identifiers [7], breaking dedup logic that assumes records are immutable. Key on content and time range, not just sample IDs.
- Timezone drift. A record’s local day depends on where the user was. Aggregate in the wrong timezone and a single night’s sleep splits across two days — creating apparent duplicates and gaps simultaneously.
- Manual entries that outrank sensors. Without metadata filtering, a user’s hand-typed estimate can override a precise measurement. Always weight provenance.
- Same brand, two channels. A Fitbit syncing both directly and via Health Connect needs source-level dedup, not just device-level — same data, two
dataOriginvalues.
What this means for builders
Deduplication looks like a small detail and behaves like a foundational one. Every score, trend, and insight your product computes inherits the errors of the data underneath it — and inflated, double-counted inputs are worse than missing ones, because they look plausible. Getting this right is a prerequisite for everything above it in the stack, not an optimization you bolt on later.
The honest assessment of the work: per-platform aggregation APIs cover the easy cumulative cases, but the real burden — raw-read reconciliation, partial-interval merging, metadata filtering, and especially cross-platform and cloud-API deduplication — is ongoing engineering that grows with every source you add. It’s also entirely undifferentiated; no user has ever chosen a product because its deduplication was elegant.
That’s exactly why it’s a strong candidate to push below your product line. A normalization layer that ingests across HealthKit, Health Connect, and direct wearable APIs, reconciles overlapping sources into a single deduplicated record per metric, and exposes one clean value is the boundary that lets your team build on health data instead of constantly repairing it. (It’s the layer we work on at Sahha.) However you draw that line, the goal is the same: turn three contradictory step counts into one number you’d be willing to show a user.
References
- Apple Developer Forums. (2024). Step Data Duplication Issue with Apple Watch. https://developer.apple.com/forums/thread/759709
- Apple Developer Documentation. HKSourceRevision. https://developer.apple.com/documentation/healthkit/hksourcerevision
- Apple Developer Forums. Differences in Step Counts Between HealthKit and the Health App. https://developer.apple.com/forums/thread/769812
- Apple Developer Forums. How can I get non-overlapping sleep samples from HealthKit in Swift? https://developer.apple.com/forums/thread/730258
- Android Developers. Read aggregated data — Health Connect. https://developer.android.com/health-and-fitness/guides/health-connect/develop/aggregate-data
- Android Developers. Read raw data — Health Connect. https://developer.android.com/health-and-fitness/health-connect/read-data
- Apple Developer Forums. Old HealthKit samples from watchOS getting deleted and recreated years later. https://developer.apple.com/forums/thread/799882