Class recommendation with archetypes

Ming Xuan Samuel Tan

Executive Summary

  • We investigated the feasibility of building a similarity-based class recommendation system by examining whether users who engage in multiple sports prefer sports that are similar to their preferred exercise types (i.e. Archetypes: primary_excercise and secondary_exercises).

  • 78 unique sports from 499 users were collected from test period were identified. Sports were compared along Structure, Location, Objective, Intensity, Skill Required to generate exercise type similarity matrix.

  • 34.9% of users recorded at least one sport from the 10 sports most similar to their primary_exercise. 82.7% users reported in at least one sport from either the top 5 sports most similar to their primary_exercise or the top 5 sports most similar to their secondary_exercise. Both observed proportions are statistically significant (p-value: <0.0005).

  • Users tend to engage in sports similar to their preferred sport as captured by primary_excercise and secondary_exercises. Sport similarity measures can be combined with exercise preference archetypes to design personalized sport/exercise recommendation systems.


Introduction

Sports preferences are shaped by a complex interplay of personal, social, and environmental factors. Previous research highlights that an individual's history of sports participation plays a crucial role in determining their current and future sports interests (source). Moreover, the specific characteristics of the sports one has participated in—such as intensity, skill requirements, structure, and setting also influence their preferences. These factors are further shaped by cultural norms and environmental accessibility (source).

Recent advancements in wearable technology and smartphone-based fitness tracking have generated an unprecedented volume of granular data on individual exercise habits. Millions of users now regularly log not just their physical activity levels, but also the exact sports or fitness activities they engage in. This offers a unique opportunity to empirically examine patterns in sports selection.

Given that past participation and the characteristics of previous sports strongly influence future preferences, we hypothesize that analyzing users’ logged sports activities can reveal underlying patterns—specifically, that users tend to engage in sports with similar attributes. This leads to the potential of leveraging such data to recommend new sports or exercise classes aligned to an individual’s preferences, thereby enhancing engagement and diversification in physical activity.

In this report, we construct a comprehensive sports taxonomy based on the similarity between sports in the five key characteristics: Structure, Location, Objective, Intensity, and Skill Required. Using this taxonomy, we identify sports similar to a user’s most preferred sports as captured by the archetypes *primary_exercise* and *secondary_exercises*. This list of similar sports identified was compared against the user’s exercise records to assess the validity of the hypothesis that individuals tend to engage in sports sharing similar characteristics.


Methods

Dataset

Exercise sessions logged between Oct 2024 to January 2025 were compiled. 78 unique sports/exercise types were identified. Seven sports were excluded as they were non-specific: "other", “preparation and recovery", "stretching", "cooldown", “walking”, “workout”, ”flexibility”. Giving us a final set of 71 unique sports. A total of 12852 sessions from 499 unique users were logged during the assessment period. To ensure meaningful activity patterns, we filtered for users who reported at least 10 exercise sessions during this period, resulting in a subset of 11,805 exercise sessions from 249 users. The mean [SD] number of sessions logged per user was 47.4 [128.6].

Sport Similarity Scoring

Sports were compared across 5 dimensions, namely and the similarity between the sports in each dimension was scored on a scale of 0-1:

  1. Structure – Similarity in the format and rules.

  2. Location – Whether they are typically performed indoors/outdoors, in specific venues, etc.

  3. Objective – Competitive vs. recreational, individual vs. team-based.

  4. Intensity – Physical exertion level required.

  5. Skill Required – Technical skill level or learning curve.

Scores across all dimension were averaged to yield an overall similarity score. We then performed hierarchical clustering with Ward’s linkage on the sports to generate a dendrogram to aid visualization.

Hypothesis testing

This similarity score was used to identify the sport most similar to the users’ preferred sport as captured in Archetype primary_exercise and secondary_exercise.

2 methods for identifying similar sport were tested:

  • Method 1: using the 10 sports most similar to a user’s primary_exercise.

  • Method 2: using the 5 sports most similar to primary_exercise and 5 sports most similar to secondary_exercise.

for both methods, we calculated the proportion of number of users whose list of recorded sports contained at least one overlap with the list of similar sports identified

Permutation testing was performed to assess whether the observed overlap between recorded sports and similar sports was statistically significant. 5,000 iterations were performed. in each iteration, instead of using the similarity scores, sports were randomly selected without replacement for each user to simulate the null hypothesis (H₀: Users engage in sports independently of sport similarity). A final p-value was calculated as the proportion of iterations where overlap ≥ observed overlap.


Results

Sport Taxonomy

Using the similarity scores obtained from the sport similarity scoring, we were able to construct a taxonomy of sports observed in the logs. This taxonomy can be represented as a dendrogram (Figure 1).

Figure 1: Dendrogram of sports encountered in data logs.

Permutation Test Results

We observed that 87 (34.9%) users logged at least one sport from the list of similar sports identified using method 1. In the permutation test, no iteration achieved a higher proportion than the observed proportion (p-value: <0.0005) (Figure 2a).

206 (82.7%) users logged at least one sport from the list of similar sports identified using method 2. In the permutation test, no iteration achieved a higher proportion than the observed proportion (p-value: <0.0005) (Figure 2b).

The observed proportions of users who engage in at least 1 sport identified using both methods are statistically significant.

Figure 2: Density plot of the proportions obtained with randomly selected sports (blue solid line) and the observed proportion when similarity scores are used (red dotted line). The observed proportion far exceed the expected proportion expected by the null hypothesis in both (a) method 1 and (b) method 2.


Discussion

In this study, we examined the hypothesis that users tend to engage in sports sharing similar characteristics. We constructed a taxonomy of sports encounter in our logs and identified similar to user’s preferred sport as captured in archetype primary_exercise and secondary_exercise.

We tested 2 methods of identifying similar sports. We found that 87 (34.9%) users logged at least one sport from the list of similar sports identified using method 1 and 206 (82.7%) users logged at least one sport from the list of similar sports identified using method 2. Both outcomes are statistically significant (p-value: <0.0005). This supports our hypothesis that users tend to engage in sports sharing similar characteristics and that a user’s preferred sport as captured in archetype primary_exercise and secondary_exercise can be used to predict other sports that a user might engage or might be interested in engaging in.

These findings demonstrate that a recommendation systems using sports similarity is viable. Such systems could suggest new class offerings to users based on their engagement history, potentially improving engagement, user retention and satisfaction.

Future work could combined sports similarity with customer usage pattern to further refine recommendation.


Conclusion

This study confirms our hypothesis that the sports a user engages in is predictable using sport similarity. The findings of this study can be combined with Sahha archetypes to build compelling recommendation systems to offer new sports and exercise classes to user.

Executive Summary

  • We investigated the feasibility of building a similarity-based class recommendation system by examining whether users who engage in multiple sports prefer sports that are similar to their preferred exercise types (i.e. Archetypes: primary_excercise and secondary_exercises).

  • 78 unique sports from 499 users were collected from test period were identified. Sports were compared along Structure, Location, Objective, Intensity, Skill Required to generate exercise type similarity matrix.

  • 34.9% of users recorded at least one sport from the 10 sports most similar to their primary_exercise. 82.7% users reported in at least one sport from either the top 5 sports most similar to their primary_exercise or the top 5 sports most similar to their secondary_exercise. Both observed proportions are statistically significant (p-value: <0.0005).

  • Users tend to engage in sports similar to their preferred sport as captured by primary_excercise and secondary_exercises. Sport similarity measures can be combined with exercise preference archetypes to design personalized sport/exercise recommendation systems.


Introduction

Sports preferences are shaped by a complex interplay of personal, social, and environmental factors. Previous research highlights that an individual's history of sports participation plays a crucial role in determining their current and future sports interests (source). Moreover, the specific characteristics of the sports one has participated in—such as intensity, skill requirements, structure, and setting also influence their preferences. These factors are further shaped by cultural norms and environmental accessibility (source).

Recent advancements in wearable technology and smartphone-based fitness tracking have generated an unprecedented volume of granular data on individual exercise habits. Millions of users now regularly log not just their physical activity levels, but also the exact sports or fitness activities they engage in. This offers a unique opportunity to empirically examine patterns in sports selection.

Given that past participation and the characteristics of previous sports strongly influence future preferences, we hypothesize that analyzing users’ logged sports activities can reveal underlying patterns—specifically, that users tend to engage in sports with similar attributes. This leads to the potential of leveraging such data to recommend new sports or exercise classes aligned to an individual’s preferences, thereby enhancing engagement and diversification in physical activity.

In this report, we construct a comprehensive sports taxonomy based on the similarity between sports in the five key characteristics: Structure, Location, Objective, Intensity, and Skill Required. Using this taxonomy, we identify sports similar to a user’s most preferred sports as captured by the archetypes *primary_exercise* and *secondary_exercises*. This list of similar sports identified was compared against the user’s exercise records to assess the validity of the hypothesis that individuals tend to engage in sports sharing similar characteristics.


Methods

Dataset

Exercise sessions logged between Oct 2024 to January 2025 were compiled. 78 unique sports/exercise types were identified. Seven sports were excluded as they were non-specific: "other", “preparation and recovery", "stretching", "cooldown", “walking”, “workout”, ”flexibility”. Giving us a final set of 71 unique sports. A total of 12852 sessions from 499 unique users were logged during the assessment period. To ensure meaningful activity patterns, we filtered for users who reported at least 10 exercise sessions during this period, resulting in a subset of 11,805 exercise sessions from 249 users. The mean [SD] number of sessions logged per user was 47.4 [128.6].

Sport Similarity Scoring

Sports were compared across 5 dimensions, namely and the similarity between the sports in each dimension was scored on a scale of 0-1:

  1. Structure – Similarity in the format and rules.

  2. Location – Whether they are typically performed indoors/outdoors, in specific venues, etc.

  3. Objective – Competitive vs. recreational, individual vs. team-based.

  4. Intensity – Physical exertion level required.

  5. Skill Required – Technical skill level or learning curve.

Scores across all dimension were averaged to yield an overall similarity score. We then performed hierarchical clustering with Ward’s linkage on the sports to generate a dendrogram to aid visualization.

Hypothesis testing

This similarity score was used to identify the sport most similar to the users’ preferred sport as captured in Archetype primary_exercise and secondary_exercise.

2 methods for identifying similar sport were tested:

  • Method 1: using the 10 sports most similar to a user’s primary_exercise.

  • Method 2: using the 5 sports most similar to primary_exercise and 5 sports most similar to secondary_exercise.

for both methods, we calculated the proportion of number of users whose list of recorded sports contained at least one overlap with the list of similar sports identified

Permutation testing was performed to assess whether the observed overlap between recorded sports and similar sports was statistically significant. 5,000 iterations were performed. in each iteration, instead of using the similarity scores, sports were randomly selected without replacement for each user to simulate the null hypothesis (H₀: Users engage in sports independently of sport similarity). A final p-value was calculated as the proportion of iterations where overlap ≥ observed overlap.


Results

Sport Taxonomy

Using the similarity scores obtained from the sport similarity scoring, we were able to construct a taxonomy of sports observed in the logs. This taxonomy can be represented as a dendrogram (Figure 1).

Figure 1: Dendrogram of sports encountered in data logs.

Permutation Test Results

We observed that 87 (34.9%) users logged at least one sport from the list of similar sports identified using method 1. In the permutation test, no iteration achieved a higher proportion than the observed proportion (p-value: <0.0005) (Figure 2a).

206 (82.7%) users logged at least one sport from the list of similar sports identified using method 2. In the permutation test, no iteration achieved a higher proportion than the observed proportion (p-value: <0.0005) (Figure 2b).

The observed proportions of users who engage in at least 1 sport identified using both methods are statistically significant.

Figure 2: Density plot of the proportions obtained with randomly selected sports (blue solid line) and the observed proportion when similarity scores are used (red dotted line). The observed proportion far exceed the expected proportion expected by the null hypothesis in both (a) method 1 and (b) method 2.


Discussion

In this study, we examined the hypothesis that users tend to engage in sports sharing similar characteristics. We constructed a taxonomy of sports encounter in our logs and identified similar to user’s preferred sport as captured in archetype primary_exercise and secondary_exercise.

We tested 2 methods of identifying similar sports. We found that 87 (34.9%) users logged at least one sport from the list of similar sports identified using method 1 and 206 (82.7%) users logged at least one sport from the list of similar sports identified using method 2. Both outcomes are statistically significant (p-value: <0.0005). This supports our hypothesis that users tend to engage in sports sharing similar characteristics and that a user’s preferred sport as captured in archetype primary_exercise and secondary_exercise can be used to predict other sports that a user might engage or might be interested in engaging in.

These findings demonstrate that a recommendation systems using sports similarity is viable. Such systems could suggest new class offerings to users based on their engagement history, potentially improving engagement, user retention and satisfaction.

Future work could combined sports similarity with customer usage pattern to further refine recommendation.


Conclusion

This study confirms our hypothesis that the sports a user engages in is predictable using sport similarity. The findings of this study can be combined with Sahha archetypes to build compelling recommendation systems to offer new sports and exercise classes to user.

©Sahha 2025