How we handle activity data

Sahha Research

How we handle activity events

Sahha expertly and efficiently tackles multiple challenges, from collecting critical health data logs from multiple sources to delivering accurate biomarkers and scores. This makes it a hassle-free experience for businesses integrating our offerings.

Physical activity is one of the most important factors in reducing various mental health issues like anxiety, depression, negative mood, and improving self-esteem and cognitive functions (1).

Data Collection from Various Sources

Leveraging the power of device variety

Different devices have different methods of ascertaining physical activity due to various:

Sensors - Mobile phones and smart wearables typically use gyroscopes to track steps. They can also utilize GPS and the user’s height to measure speed, distance, etc. Smartwatches have additional sensors for measuring blood oxygen levels and heart rate.
Device Placement - Even the same device with the same sensors can provide different readings depending on its placement. In case of smartwatches, a snug fit and proper positioning can improve measurement accuracy (2).
Device Comfort - It directly affects user experience and satisfaction.

Also, it is generally seen that a user can be using multiple devices at a time during any physical activity.

Physical Activity Data Variety

We primarily capture steps and exercise data to understand physical activity. Determining the accurate number of steps is more challenging than extracting features from exercise data due to:

Randomness of movement - People move around without much thought, making it difficult for devices to determine if a user is moving or not. In contrast, exercises are actively tracked on health-specific apps.
Active Exercise Tracking - Exercises are actively tracked using sophisticated methods to determine exercise features accurately.

Data Ingestion and Cleanup

Real-time Ingestion

Physical activity data from multiple devices is ingested into activity, exercise, and energy source tables in a uniform format designed to ensure data uniformity and scalability. These tables store hundreds of millions of rows from tens of thousands of users. Developers are expected to send activity data following our schema, which promotes scalability, maintainability, data uniformity, and standardization. This simplifies sending and processing events.

Data Quality via Cleanup

Pre-filtering - Data points from source tables are filtered on a preset window size to ensure enough data is used for creating features, biomarkers, and scores while maintaining pipeline performance.
Trivial De-duplication - Events with the same parameters (profile ID, timings, type, and value) are de-duplicated to prevent recounting the same data point during aggregations.
Complex De-duplication -
- Walking cadence - Rows that do not meet the minimum walking cadence criteria are discarded.
- Device selection - The device generating the most steps in a given day is chosen, and other step logs are filtered. This approach favors consistency over accuracy and typically selects wearables.
- Time zone parity - If a user changes time zones mid-activity, the timings are transformed to the time zone where the activity started.
Inference Tagging - Each profile is tagged with an inference ID that tracks a unique inference daily for a given batch duration.

Advanced Data Aggregation for Complex Features

Data Processing through Aggregation

Aggregation converts raw data into actionable features. We leverage aggregation to accurately combine data from different sources, time scales, and units.

Types of Aggregation
- Temporal Aggregation - Grouping data by time intervals. It is quite useful in determining trends that change over time and quite useful to understand behavioral changes in the user. This can be on any scale of time from minutes (e.g. - Heart rate) to years (e.g. - Changes due to consistent physical activity).
- Spatial Aggregation - Grouping data by geographical regions over a set period. This helps in understanding trends influenced by geographical parameters, such as physical activity in different locations. This can range anywhere from locations (e.g. - Physical activity in different parks, offices, footpaths, etc.) to countries (e.g. - Observing energy burned through exercise in different countries)
A simplified example to distinguish temporal aggregation and spatial aggregation:
Let's say a young person lives in a plain region for 10 years and then moves to a mountain range for the next 10 years. Temporal aggregation would summarize their walking activity over 20 years, while spatial aggregation would highlight differences in activity between the plain and mountain regions. On average, a young person living in plain areas will have higher physical activity than an older person living on the mountains due to the difficulty of movement in the terrain.
Although there are many internal and external factors that affect a person’s physical activity, it is quite necessary to understand that even these influences can be captured by aggregated data. Taking the above example, we can also note that one dimension will indeed influence the other dimension with a lower impact. They might generally show reducing number of walking steps throughout the years due to age and change in elevation/region. While the actual factors may not be imminent and easy to figure out, temporal aggregation is still able to capture quite a lot of information.

We will not focus on Spatial Aggregation from here on because of two reasons:
- Just having temporal aggregation is enough for understanding most trends in physical activity.
- Spatial data such as the user’s residence, office location, etc. can be a security concern and it should be avoided until necessary. This is also covered in the data compliance section.

Multi-layered Aggregation

Aggregation is handled through multiple layers to evolve raw data into complex features. The following aggregation types cover low-level to high-level features -

Hourly Aggregation - Activity events are distributed for each hour, which can be further aggregated.
Daily Aggregation - Daily features created from activity events include -
- Total Steps
- Active Duration
- Active Calories Burned - Amount of energy burned from physical activity. This energy burned is in excess to the energy burned by the body to keep bodily functions active (known as Basal Metabolic Rate or BMR (3))
- Low Intensity Activity Duration - Duration of physical activity whose MET < 3 (Metabolic Equivalent is defined as the energy cost of a physical activity. An MET of 1 is equivalent to the BMR as the user is not performing any physical activity. MET increases with the activity intensity.)
- Medium Intensity Activity Duration - Duration of physical activity whose MET is from 3 to 6.
- High Intensity Activity Duration - Duration of physical activity whose MET > 6.
- Stand Hours - Number of hours when the user is active for more than a minute
- Active Hours - Total number of hours when the user is exercising or walking for more than a minimum threshold number of steps
- Activity Sedentary Duration - Covers the duration when the user isn’t sleeping or engaged in any activity
- Floors Climbed
Weekly Aggregation - These are quite useful for Machine Learning models which focus on evaluating health scores for users as the focus of the aggregation is on a larger scale of time. It is built on top of both hourly and daily aggregates although there can be some aggregates which are exclusively built from raw data. It includes -
- Average Daily Steps
- Total Active Duration
- Total Active Hours
- Activity Goals - Total number of days when the user has walked more than a pre-defined number of steps
- Average Daily Active Calories Burned
- Average Daily High Intensity Activity Duration
- Average Daily Stand Hours
- Total Sedentary Hours - Total number of hours when the user has not walked more than a threshold number of steps
- Sedentary Periods - Number of continuous periods of time when the user did not have an active hour
- Activity Deviation - Deviation of activity for each hour across the week

Activity Biomarkers and Scores

The aggregated data can then be used to generate biomarkers and scores.

Biomarkers - Indirect indicators of physical activity, such as daily steps or active duration. Biomarkers help users track their physical activity trends. Any aggregation can be chosen as a biomarker if it is viable enough, letting users notice their trend is quite useful especially for more health minded people. A biomarker can either be a singular aggregation or can be a mixture of multiple aggregations.
Scores - Built on aggregated data using scoring functions. These scores provide direct references to user performance in various aspects, utilizing multiple aggregations. These factors generally utilize multiple aggregations to calculate a factor score.

Developers using Sahha can choose specific offerings as per their requirements. For example, if your use case is specifically bound to the user’s activity duration then you can pick the duration biomarker, score or both. This provides a way to reduce data footprint, data overhead and architecture costs.

Data Privacy and Compliance

Ensuring data privacy is critical for handling personal data safely and confidentially. Data privacy and compliance are vital for maintaining trust and reputation with the community.

No Personal Data - The data warehouse does not contain personal or identifiable data. It only stores the user’s profile ID which maps to data in a separate database hosted elsewhere.
Data Anonymization - Data points are tagged with a UUID, preventing identification of the actual person.
End-to-End Encryption - The ELT pipeline is encrypted at all stages. Clients must comply with our encryption standards.
Strict Authorization - Only specific people and services can access the data. Modifications are strictly controlled unless requested by the user such as changing their input data or removing their profile.

Conclusion

Activity events undergo comprehensive data processing to ensure clean and standardized data. The cleaned data is then temporally aggregated to produce biomarkers and scores while adhering to strict data privacy practices.

References

Sharma A. et al. (2006). Exercise for Mental Health. Prim Care Companion J Clin Psychiatry, 8(2), 106. doi: 10.4088/pcc.v08n0208a. PMCID: PMC1470658. PMID: 16862239.
Martín-Escudero P. et al. (2023). Are Activity Wrist-Worn Devices Accurate for Determining Heart Rate during Intense Exercise? Bioengineering, 10(2), 254. Retrieved from MDPI (MDPI).
https://www.britannica.com/science/human-nutrition/BMR-and-REE-energy-balance

Sahha Research

How we handle activity events

Physical activity is one of the most important factors in reducing various mental health issues like anxiety, depression, negative mood, and improving self-esteem and cognitive functions (1).

Data Collection from Various Sources

Leveraging the power of device variety

Different devices have different methods of ascertaining physical activity due to various:

Sensors - Mobile phones and smart wearables typically use gyroscopes to track steps. They can also utilize GPS and the user’s height to measure speed, distance, etc. Smartwatches have additional sensors for measuring blood oxygen levels and heart rate.
Device Placement - Even the same device with the same sensors can provide different readings depending on its placement. In case of smartwatches, a snug fit and proper positioning can improve measurement accuracy (2).
Device Comfort - It directly affects user experience and satisfaction.

Also, it is generally seen that a user can be using multiple devices at a time during any physical activity.

Physical Activity Data Variety

We primarily capture steps and exercise data to understand physical activity. Determining the accurate number of steps is more challenging than extracting features from exercise data due to:

Randomness of movement - People move around without much thought, making it difficult for devices to determine if a user is moving or not. In contrast, exercises are actively tracked on health-specific apps.
Active Exercise Tracking - Exercises are actively tracked using sophisticated methods to determine exercise features accurately.

Data Ingestion and Cleanup

Real-time Ingestion

Data Quality via Cleanup

Pre-filtering - Data points from source tables are filtered on a preset window size to ensure enough data is used for creating features, biomarkers, and scores while maintaining pipeline performance.
Trivial De-duplication - Events with the same parameters (profile ID, timings, type, and value) are de-duplicated to prevent recounting the same data point during aggregations.
Complex De-duplication -
- Walking cadence - Rows that do not meet the minimum walking cadence criteria are discarded.
- Device selection - The device generating the most steps in a given day is chosen, and other step logs are filtered. This approach favors consistency over accuracy and typically selects wearables.
- Time zone parity - If a user changes time zones mid-activity, the timings are transformed to the time zone where the activity started.
Inference Tagging - Each profile is tagged with an inference ID that tracks a unique inference daily for a given batch duration.

Advanced Data Aggregation for Complex Features

Data Processing through Aggregation

Aggregation converts raw data into actionable features. We leverage aggregation to accurately combine data from different sources, time scales, and units.

Types of Aggregation
- Temporal Aggregation - Grouping data by time intervals. It is quite useful in determining trends that change over time and quite useful to understand behavioral changes in the user. This can be on any scale of time from minutes (e.g. - Heart rate) to years (e.g. - Changes due to consistent physical activity).
- Spatial Aggregation - Grouping data by geographical regions over a set period. This helps in understanding trends influenced by geographical parameters, such as physical activity in different locations. This can range anywhere from locations (e.g. - Physical activity in different parks, offices, footpaths, etc.) to countries (e.g. - Observing energy burned through exercise in different countries)
A simplified example to distinguish temporal aggregation and spatial aggregation:
Let's say a young person lives in a plain region for 10 years and then moves to a mountain range for the next 10 years. Temporal aggregation would summarize their walking activity over 20 years, while spatial aggregation would highlight differences in activity between the plain and mountain regions. On average, a young person living in plain areas will have higher physical activity than an older person living on the mountains due to the difficulty of movement in the terrain.
Although there are many internal and external factors that affect a person’s physical activity, it is quite necessary to understand that even these influences can be captured by aggregated data. Taking the above example, we can also note that one dimension will indeed influence the other dimension with a lower impact. They might generally show reducing number of walking steps throughout the years due to age and change in elevation/region. While the actual factors may not be imminent and easy to figure out, temporal aggregation is still able to capture quite a lot of information.

We will not focus on Spatial Aggregation from here on because of two reasons:
- Just having temporal aggregation is enough for understanding most trends in physical activity.
- Spatial data such as the user’s residence, office location, etc. can be a security concern and it should be avoided until necessary. This is also covered in the data compliance section.

Multi-layered Aggregation

Aggregation is handled through multiple layers to evolve raw data into complex features. The following aggregation types cover low-level to high-level features -

Hourly Aggregation - Activity events are distributed for each hour, which can be further aggregated.
Daily Aggregation - Daily features created from activity events include -
- Total Steps
- Active Duration
- Active Calories Burned - Amount of energy burned from physical activity. This energy burned is in excess to the energy burned by the body to keep bodily functions active (known as Basal Metabolic Rate or BMR (3))
- Low Intensity Activity Duration - Duration of physical activity whose MET < 3 (Metabolic Equivalent is defined as the energy cost of a physical activity. An MET of 1 is equivalent to the BMR as the user is not performing any physical activity. MET increases with the activity intensity.)
- Medium Intensity Activity Duration - Duration of physical activity whose MET is from 3 to 6.
- High Intensity Activity Duration - Duration of physical activity whose MET > 6.
- Stand Hours - Number of hours when the user is active for more than a minute
- Active Hours - Total number of hours when the user is exercising or walking for more than a minimum threshold number of steps
- Activity Sedentary Duration - Covers the duration when the user isn’t sleeping or engaged in any activity
- Floors Climbed
Weekly Aggregation - These are quite useful for Machine Learning models which focus on evaluating health scores for users as the focus of the aggregation is on a larger scale of time. It is built on top of both hourly and daily aggregates although there can be some aggregates which are exclusively built from raw data. It includes -
- Average Daily Steps
- Total Active Duration
- Total Active Hours
- Activity Goals - Total number of days when the user has walked more than a pre-defined number of steps
- Average Daily Active Calories Burned
- Average Daily High Intensity Activity Duration
- Average Daily Stand Hours
- Total Sedentary Hours - Total number of hours when the user has not walked more than a threshold number of steps
- Sedentary Periods - Number of continuous periods of time when the user did not have an active hour
- Activity Deviation - Deviation of activity for each hour across the week

Activity Biomarkers and Scores

The aggregated data can then be used to generate biomarkers and scores.

Biomarkers - Indirect indicators of physical activity, such as daily steps or active duration. Biomarkers help users track their physical activity trends. Any aggregation can be chosen as a biomarker if it is viable enough, letting users notice their trend is quite useful especially for more health minded people. A biomarker can either be a singular aggregation or can be a mixture of multiple aggregations.
Scores - Built on aggregated data using scoring functions. These scores provide direct references to user performance in various aspects, utilizing multiple aggregations. These factors generally utilize multiple aggregations to calculate a factor score.

Data Privacy and Compliance

Ensuring data privacy is critical for handling personal data safely and confidentially. Data privacy and compliance are vital for maintaining trust and reputation with the community.

No Personal Data - The data warehouse does not contain personal or identifiable data. It only stores the user’s profile ID which maps to data in a separate database hosted elsewhere.
Data Anonymization - Data points are tagged with a UUID, preventing identification of the actual person.
End-to-End Encryption - The ELT pipeline is encrypted at all stages. Clients must comply with our encryption standards.
Strict Authorization - Only specific people and services can access the data. Modifications are strictly controlled unless requested by the user such as changing their input data or removing their profile.

Conclusion

References

Sharma A. et al. (2006). Exercise for Mental Health. Prim Care Companion J Clin Psychiatry, 8(2), 106. doi: 10.4088/pcc.v08n0208a. PMCID: PMC1470658. PMID: 16862239.
Martín-Escudero P. et al. (2023). Are Activity Wrist-Worn Devices Accurate for Determining Heart Rate during Intense Exercise? Bioengineering, 10(2), 254. Retrieved from MDPI (MDPI).
https://www.britannica.com/science/human-nutrition/BMR-and-REE-energy-balance

How we handle activity events

Data Collection from Various Sources

Leveraging the power of device variety

Different devices have different methods of ascertaining physical activity due to various:

Physical Activity Data Variety

Data Ingestion and Cleanup

Real-time Ingestion

Data Quality via Cleanup

Advanced Data Aggregation for Complex Features

Data Processing through Aggregation

A simplified example to distinguish temporal aggregation and spatial aggregation:

We will not focus on Spatial Aggregation from here on because of two reasons:

Multi-layered Aggregation

Activity Biomarkers and Scores

The aggregated data can then be used to generate biomarkers and scores.

Data Privacy and Compliance

Conclusion

References

How we handle activity events

Data Collection from Various Sources

Leveraging the power of device variety

Different devices have different methods of ascertaining physical activity due to various:

Physical Activity Data Variety

Data Ingestion and Cleanup

Real-time Ingestion

Data Quality via Cleanup

Advanced Data Aggregation for Complex Features

Data Processing through Aggregation

A simplified example to distinguish temporal aggregation and spatial aggregation:

We will not focus on Spatial Aggregation from here on because of two reasons:

Multi-layered Aggregation

Activity Biomarkers and Scores

The aggregated data can then be used to generate biomarkers and scores.

Data Privacy and Compliance

Conclusion

References