Sahha Research
How we handle activity events
Sahha expertly and efficiently tackles multiple challenges, from collecting critical health data logs from multiple sources to delivering accurate biomarkers and scores. This makes it a hassle-free experience for businesses integrating our offerings.
Physical activity is one of the most important factors in reducing various mental health issues like anxiety, depression, negative mood, and improving self-esteem and cognitive functions (1).
Data Collection from Various Sources
Leveraging the power of device variety
Different devices have different methods of ascertaining physical activity due to various:
Sensors - Mobile phones and smart wearables typically use gyroscopes to track steps. They can also utilize GPS and the user’s height to measure speed, distance, etc. Smartwatches have additional sensors for measuring blood oxygen levels and heart rate.
Device Placement - Even the same device with the same sensors can provide different readings depending on its placement. In case of smartwatches, a snug fit and proper positioning can improve measurement accuracy (2).
Device Comfort - It directly affects user experience and satisfaction.
Also, it is generally seen that a user can be using multiple devices at a time during any physical activity.
Physical Activity Data Variety
We primarily capture steps and exercise data to understand physical activity. Determining the accurate number of steps is more challenging than extracting features from exercise data due to:
Randomness of movement - People move around without much thought, making it difficult for devices to determine if a user is moving or not. In contrast, exercises are actively tracked on health-specific apps.
Active Exercise Tracking - Exercises are actively tracked using sophisticated methods to determine exercise features accurately.
Data Ingestion and Cleanup
Real-time Ingestion
Physical activity data from multiple devices is ingested into activity, exercise, and energy source tables in a uniform format designed to ensure data uniformity and scalability. These tables store hundreds of millions of rows from tens of thousands of users. Developers are expected to send activity data following our schema, which promotes scalability, maintainability, data uniformity, and standardization. This simplifies sending and processing events.
Data Quality via Cleanup
Pre-filtering - Data points from source tables are filtered on a preset window size to ensure enough data is used for creating features, biomarkers, and scores while maintaining pipeline performance.
Trivial De-duplication - Events with the same parameters (profile ID, timings, type, and value) are de-duplicated to prevent recounting the same data point during aggregations.
Complex De-duplication -
Walking cadence - Rows that do not meet the minimum walking cadence criteria are discarded.
Device selection - The device generating the most steps in a given day is chosen, and other step logs are filtered. This approach favors consistency over accuracy and typically selects wearables.
Time zone parity - If a user changes time zones mid-activity, the timings are transformed to the time zone where the activity started.
Inference Tagging - Each profile is tagged with an inference ID that tracks a unique inference daily for a given batch duration.
Advanced Data Aggregation for Complex Features
Data Processing through Aggregation
Aggregation converts raw data into actionable features. We leverage aggregation to accurately combine data from different sources, time scales, and units.
Types of Aggregation
Temporal Aggregation - Grouping data by time intervals. It is quite useful in determining trends that change over time and quite useful to understand behavioral changes in the user. This can be on any scale of time from minutes (e.g. - Heart rate) to years (e.g. - Changes due to consistent physical activity).
Spatial Aggregation - Grouping data by geographical regions over a set period. This helps in understanding trends influenced by geographical parameters, such as physical activity in different locations. This can range anywhere from locations (e.g. - Physical activity in different parks, offices, footpaths, etc.) to countries (e.g. - Observing energy burned through exercise in different countries)
A simplified example to distinguish temporal aggregation and spatial aggregation:
Let's say a young person lives in a plain region for 10 years and then moves to a mountain range for the next 10 years. Temporal aggregation would summarize their walking activity over 20 years, while spatial aggregation would highlight differences in activity between the plain and mountain regions. On average, a young person living in plain areas will have higher physical activity than an older person living on the mountains due to the difficulty of movement in the terrain.
Although there are many internal and external factors that affect a person’s physical activity, it is quite necessary to understand that even these influences can be captured by aggregated data. Taking the above example, we can also note that one dimension will indeed influence the other dimension with a lower impact. They might generally show reducing number of walking steps throughout the years due to age and change in elevation/region. While the actual factors may not be imminent and easy to figure out, temporal aggregation is still able to capture quite a lot of information.
We will not focus on Spatial Aggregation from here on because of two reasons:
Just having temporal aggregation is enough for understanding most trends in physical activity.
Spatial data such as the user’s residence, office location, etc. can be a security concern and it should be avoided until necessary. This is also covered in the data compliance section.
Multi-layered Aggregation
Aggregation is handled through multiple layers to evolve raw data into complex features. The following aggregation types cover low-level to high-level features -
Hourly Aggregation - Activity events are distributed for each hour, which can be further aggregated.
Daily Aggregation - Daily features created from activity events include -
Total Steps
Active Duration
Active Calories Burned - Amount of energy burned from physical activity. This energy burned is in excess to the energy burned by the body to keep bodily functions active (known as Basal Metabolic Rate or BMR (3))
Low Intensity Activity Duration - Duration of physical activity whose MET < 3 (Metabolic Equivalent is defined as the energy cost of a physical activity. An MET of 1 is equivalent to the BMR as the user is not performing any physical activity. MET increases with the activity intensity.)
Medium Intensity Activity Duration - Duration of physical activity whose MET is from 3 to 6.
High Intensity Activity Duration - Duration of physical activity whose MET > 6.
Stand Hours - Number of hours when the user is active for more than a minute
Active Hours - Total number of hours when the user is exercising or walking for more than a minimum threshold number of steps
Activity Sedentary Duration - Covers the duration when the user isn’t sleeping or engaged in any activity
Floors Climbed
Weekly Aggregation - These are quite useful for Machine Learning models which focus on evaluating health scores for users as the focus of the aggregation is on a larger scale of time. It is built on top of both hourly and daily aggregates although there can be some aggregates which are exclusively built from raw data. It includes -
Average Daily Steps
Total Active Duration
Total Active Hours
Activity Goals - Total number of days when the user has walked more than a pre-defined number of steps
Average Daily Active Calories Burned
Average Daily High Intensity Activity Duration
Average Daily Stand Hours
Total Sedentary Hours - Total number of hours when the user has not walked more than a threshold number of steps
Sedentary Periods - Number of continuous periods of time when the user did not have an active hour
Activity Deviation - Deviation of activity for each hour across the week
Activity Biomarkers and Scores
The aggregated data can then be used to generate biomarkers and scores.
Biomarkers - Indirect indicators of physical activity, such as daily steps or active duration. Biomarkers help users track their physical activity trends. Any aggregation can be chosen as a biomarker if it is viable enough, letting users notice their trend is quite useful especially for more health minded people. A biomarker can either be a singular aggregation or can be a mixture of multiple aggregations.
Scores - Built on aggregated data using scoring functions. These scores provide direct references to user performance in various aspects, utilizing multiple aggregations. These factors generally utilize multiple aggregations to calculate a factor score.
Developers using Sahha can choose specific offerings as per their requirements. For example, if your use case is specifically bound to the user’s activity duration then you can pick the duration biomarker, score or both. This provides a way to reduce data footprint, data overhead and architecture costs.
Data Privacy and Compliance
Ensuring data privacy is critical for handling personal data safely and confidentially. Data privacy and compliance are vital for maintaining trust and reputation with the community.
No Personal Data - The data warehouse does not contain personal or identifiable data. It only stores the user’s profile ID which maps to data in a separate database hosted elsewhere.
Data Anonymization - Data points are tagged with a UUID, preventing identification of the actual person.
End-to-End Encryption - The ELT pipeline is encrypted at all stages. Clients must comply with our encryption standards.
Strict Authorization - Only specific people and services can access the data. Modifications are strictly controlled unless requested by the user such as changing their input data or removing their profile.
Conclusion
Activity events undergo comprehensive data processing to ensure clean and standardized data. The cleaned data is then temporally aggregated to produce biomarkers and scores while adhering to strict data privacy practices.
References
Sharma A. et al. (2006). Exercise for Mental Health. Prim Care Companion J Clin Psychiatry, 8(2), 106. doi: 10.4088/pcc.v08n0208a. PMCID: PMC1470658. PMID: 16862239.
Martín-Escudero P. et al. (2023). Are Activity Wrist-Worn Devices Accurate for Determining Heart Rate during Intense Exercise? Bioengineering, 10(2), 254. Retrieved from MDPI (MDPI).
https://www.britannica.com/science/human-nutrition/BMR-and-REE-energy-balance
Sahha Research
How we handle activity events
Sahha expertly and efficiently tackles multiple challenges, from collecting critical health data logs from multiple sources to delivering accurate biomarkers and scores. This makes it a hassle-free experience for businesses integrating our offerings.
Physical activity is one of the most important factors in reducing various mental health issues like anxiety, depression, negative mood, and improving self-esteem and cognitive functions (1).
Data Collection from Various Sources
Leveraging the power of device variety
Different devices have different methods of ascertaining physical activity due to various:
Sensors - Mobile phones and smart wearables typically use gyroscopes to track steps. They can also utilize GPS and the user’s height to measure speed, distance, etc. Smartwatches have additional sensors for measuring blood oxygen levels and heart rate.
Device Placement - Even the same device with the same sensors can provide different readings depending on its placement. In case of smartwatches, a snug fit and proper positioning can improve measurement accuracy (2).
Device Comfort - It directly affects user experience and satisfaction.
Also, it is generally seen that a user can be using multiple devices at a time during any physical activity.
Physical Activity Data Variety
We primarily capture steps and exercise data to understand physical activity. Determining the accurate number of steps is more challenging than extracting features from exercise data due to:
Randomness of movement - People move around without much thought, making it difficult for devices to determine if a user is moving or not. In contrast, exercises are actively tracked on health-specific apps.
Active Exercise Tracking - Exercises are actively tracked using sophisticated methods to determine exercise features accurately.
Data Ingestion and Cleanup
Real-time Ingestion
Physical activity data from multiple devices is ingested into activity, exercise, and energy source tables in a uniform format designed to ensure data uniformity and scalability. These tables store hundreds of millions of rows from tens of thousands of users. Developers are expected to send activity data following our schema, which promotes scalability, maintainability, data uniformity, and standardization. This simplifies sending and processing events.
Data Quality via Cleanup
Pre-filtering - Data points from source tables are filtered on a preset window size to ensure enough data is used for creating features, biomarkers, and scores while maintaining pipeline performance.
Trivial De-duplication - Events with the same parameters (profile ID, timings, type, and value) are de-duplicated to prevent recounting the same data point during aggregations.
Complex De-duplication -
Walking cadence - Rows that do not meet the minimum walking cadence criteria are discarded.
Device selection - The device generating the most steps in a given day is chosen, and other step logs are filtered. This approach favors consistency over accuracy and typically selects wearables.
Time zone parity - If a user changes time zones mid-activity, the timings are transformed to the time zone where the activity started.
Inference Tagging - Each profile is tagged with an inference ID that tracks a unique inference daily for a given batch duration.
Advanced Data Aggregation for Complex Features
Data Processing through Aggregation
Aggregation converts raw data into actionable features. We leverage aggregation to accurately combine data from different sources, time scales, and units.
Types of Aggregation
Temporal Aggregation - Grouping data by time intervals. It is quite useful in determining trends that change over time and quite useful to understand behavioral changes in the user. This can be on any scale of time from minutes (e.g. - Heart rate) to years (e.g. - Changes due to consistent physical activity).
Spatial Aggregation - Grouping data by geographical regions over a set period. This helps in understanding trends influenced by geographical parameters, such as physical activity in different locations. This can range anywhere from locations (e.g. - Physical activity in different parks, offices, footpaths, etc.) to countries (e.g. - Observing energy burned through exercise in different countries)
A simplified example to distinguish temporal aggregation and spatial aggregation:
Let's say a young person lives in a plain region for 10 years and then moves to a mountain range for the next 10 years. Temporal aggregation would summarize their walking activity over 20 years, while spatial aggregation would highlight differences in activity between the plain and mountain regions. On average, a young person living in plain areas will have higher physical activity than an older person living on the mountains due to the difficulty of movement in the terrain.
Although there are many internal and external factors that affect a person’s physical activity, it is quite necessary to understand that even these influences can be captured by aggregated data. Taking the above example, we can also note that one dimension will indeed influence the other dimension with a lower impact. They might generally show reducing number of walking steps throughout the years due to age and change in elevation/region. While the actual factors may not be imminent and easy to figure out, temporal aggregation is still able to capture quite a lot of information.
We will not focus on Spatial Aggregation from here on because of two reasons:
Just having temporal aggregation is enough for understanding most trends in physical activity.
Spatial data such as the user’s residence, office location, etc. can be a security concern and it should be avoided until necessary. This is also covered in the data compliance section.
Multi-layered Aggregation
Aggregation is handled through multiple layers to evolve raw data into complex features. The following aggregation types cover low-level to high-level features -
Hourly Aggregation - Activity events are distributed for each hour, which can be further aggregated.
Daily Aggregation - Daily features created from activity events include -
Total Steps
Active Duration
Active Calories Burned - Amount of energy burned from physical activity. This energy burned is in excess to the energy burned by the body to keep bodily functions active (known as Basal Metabolic Rate or BMR (3))
Low Intensity Activity Duration - Duration of physical activity whose MET < 3 (Metabolic Equivalent is defined as the energy cost of a physical activity. An MET of 1 is equivalent to the BMR as the user is not performing any physical activity. MET increases with the activity intensity.)
Medium Intensity Activity Duration - Duration of physical activity whose MET is from 3 to 6.
High Intensity Activity Duration - Duration of physical activity whose MET > 6.
Stand Hours - Number of hours when the user is active for more than a minute
Active Hours - Total number of hours when the user is exercising or walking for more than a minimum threshold number of steps
Activity Sedentary Duration - Covers the duration when the user isn’t sleeping or engaged in any activity
Floors Climbed
Weekly Aggregation - These are quite useful for Machine Learning models which focus on evaluating health scores for users as the focus of the aggregation is on a larger scale of time. It is built on top of both hourly and daily aggregates although there can be some aggregates which are exclusively built from raw data. It includes -
Average Daily Steps
Total Active Duration
Total Active Hours
Activity Goals - Total number of days when the user has walked more than a pre-defined number of steps
Average Daily Active Calories Burned
Average Daily High Intensity Activity Duration
Average Daily Stand Hours
Total Sedentary Hours - Total number of hours when the user has not walked more than a threshold number of steps
Sedentary Periods - Number of continuous periods of time when the user did not have an active hour
Activity Deviation - Deviation of activity for each hour across the week
Activity Biomarkers and Scores
The aggregated data can then be used to generate biomarkers and scores.
Biomarkers - Indirect indicators of physical activity, such as daily steps or active duration. Biomarkers help users track their physical activity trends. Any aggregation can be chosen as a biomarker if it is viable enough, letting users notice their trend is quite useful especially for more health minded people. A biomarker can either be a singular aggregation or can be a mixture of multiple aggregations.
Scores - Built on aggregated data using scoring functions. These scores provide direct references to user performance in various aspects, utilizing multiple aggregations. These factors generally utilize multiple aggregations to calculate a factor score.
Developers using Sahha can choose specific offerings as per their requirements. For example, if your use case is specifically bound to the user’s activity duration then you can pick the duration biomarker, score or both. This provides a way to reduce data footprint, data overhead and architecture costs.
Data Privacy and Compliance
Ensuring data privacy is critical for handling personal data safely and confidentially. Data privacy and compliance are vital for maintaining trust and reputation with the community.
No Personal Data - The data warehouse does not contain personal or identifiable data. It only stores the user’s profile ID which maps to data in a separate database hosted elsewhere.
Data Anonymization - Data points are tagged with a UUID, preventing identification of the actual person.
End-to-End Encryption - The ELT pipeline is encrypted at all stages. Clients must comply with our encryption standards.
Strict Authorization - Only specific people and services can access the data. Modifications are strictly controlled unless requested by the user such as changing their input data or removing their profile.
Conclusion
Activity events undergo comprehensive data processing to ensure clean and standardized data. The cleaned data is then temporally aggregated to produce biomarkers and scores while adhering to strict data privacy practices.
References
Sharma A. et al. (2006). Exercise for Mental Health. Prim Care Companion J Clin Psychiatry, 8(2), 106. doi: 10.4088/pcc.v08n0208a. PMCID: PMC1470658. PMID: 16862239.
Martín-Escudero P. et al. (2023). Are Activity Wrist-Worn Devices Accurate for Determining Heart Rate during Intense Exercise? Bioengineering, 10(2), 254. Retrieved from MDPI (MDPI).
https://www.britannica.com/science/human-nutrition/BMR-and-REE-energy-balance