Effect of Combinations of Sensor Positions on Wearable-sensor-based Human Activity Recognition

CNN-LSTM


Introduction
In recent years, several studies have been conducted in the field of human activity recognition (HAR), which has been widely used in human-computer interaction, (1) work performance management, (2) and healthcare. (3) Data acquisition methods in HAR can be divided into two types: external-sensor-based HAR and wearable-sensor-based HAR (WHAR). (4) In externalsensor-based HAR, system designers arrange cameras in the locations where users perform activities (5) or sensors in the environment, (6) for example, on furniture and floors. However, the approach inherently has three drawbacks: 1) external sensors are usually large and powerconsuming, which may incur high costs for installation and maintenance; 2) external sensors are not suitable for long-term, continuous recording of human activities, because the activities cannot be recognized once the users leave the place where the sensors are installed; and 3) devices such as cameras and microphones can infringe on the user's privacy. In WHAR, sensors are attached to the user or carried by the user, which allows for the continuous recording of human activity. In addition, with the popularity of smartphones and smartwatches equipped with inertial measurement units in daily life, WHAR system designers can use user-owned devices in their systems, which reduces the cost of system deployment. The WHAR system also has little impact on the feeling of privacy violation because the user can take control of the sensors and applications that use their data. The user can remove the sensor or turn off the application if they do not want their activities recorded. Owing to these advantages, WHAR systems have been favored by many researchers in recent years.
In recent studies, in the field of WHAR, accelerometers have been shown to be effective in determining behavioral characteristics. (7) Therefore, accelerometers have been used in many WHAR systems. However, the positioning of an accelerometer on the user's body remains unresolved. As mentioned in Ref. 8, significant differences exist in the amplitudes of the acceleration signals at different positions of the body, even in the same activity. Wearablesensor-based application designers often follow experience or subjective judgments to decide sensor placement locations for a particular set of activities. However, this approach may fail if ineffective positions are selected, in which case, ineffective motion or posture signatures might be recorded, resulting in poor system performance. The difficulty in sensor position selection is that the best placement of the sensor is not necessarily where the movement is most apparent, as discussed in Ref. 9. In a study on gait detection in limb injuries, the results showed that the head provided the best classification feature for gait rather than the legs, (10,11) which demonstrates the difficulty for a system designer to find the best location for the sensor based on subjective judgment. For activity recognition, the selection of number of sensors and their positions remains unresolved and requires further research.
In addition, almost all studies have considered only conventional machine learning (ML)based sensor position placement strategies. The classification accuracy of each position is highly dependent on the features employed by the researchers, and the manually determined features do not generalize the classification performance of each position, which may affect one's judgment of the importance of the sensor position. Deep learning (DL) can extract the deep features of a sensor, which can reduce classification inaccuracies caused by insufficient information from manually designed features and better reflect the differences in information between the positions themselves. However, a performance comparison between DL and conventional ML using the same position may be worth exploring.
In this study, we evaluated the classification performance of different combinations of sensor positions by conducting experiments using daily life activity data. We applied and compared three types of classification model: a conventional ML-based model with classification feature engineering and two DL-based models with feature learning. The processing performance and processing time were compared. The results are expected to contribute to the determination of appropriate positions and combinations of sensors and to the selection of a classification model for the complex activities of daily life.

Overview of experiment
We performed an offline experiment that aims to provide wearable-sensor-based application designers with useful information to choose an appropriate classification method and specify both desirable and undesirable sensor positions for complex daily activity recognition. In Sect. 2.2, a dataset consisting of 23 complex activities of daily life (CADL) collected from 14 young adults who wore seven accelerometers is described. Three classification models were used: a conventional ML-based model (Sect. 2.3.1) and two DL-based models (Sect. 2.3.2). These three models were compared in terms of their tendency toward effective sensor-position combinations, classification performance, and processing time per window.
The classification performances for all the sensor combinations were obtained, in which 127 combinations of sensor positioning were tested. For the performance measure, we used the F1-score, which is the harmonic mean between recall and precision.
To implement the conventional ML-based method, we utilized the Weka 3.10 machine learning toolkit. In contrast, scikit-learn 0.24.2 and PyTorch 1.10.1 were used to implement the DL-based methods. The evaluation was run on an 11th generation Intel Core i9-11900K CPU with an NVIDIA GeForce RTX3080Ti GPU.

Dataset
A dataset collected from the laboratory of the authors was used. The dataset consists of threeaxis acceleration data for 23 daily life activities from seven positions on the bodies of 14 volunteers (five females and nine males between the ages of 22 and 25 years, all right-handed). Figure 1 shows (a) the sensor placement and (b)-(x) snapshots of the activities. Six of the seven sensor nodes (ATR Promotions Inc., TSND151) were attached to the upper arms, wrists, and thighs for symmetry, whereas one node (TSND121) was placed on the chest. All the sensor nodes were securely attached to the body with a band. The sensor nodes on the upper arms and wrists were worn such that they could be on the outside of the body. Each sensor node has a realtime clock (RTC) synchronized using the clock on a data-collection personal computer. The major difference between TSND121 and TSND151 is the six-axis motion (accelerometer and gyro) sensing unit, that is, the Invensense MPU-6050 and MPU-9250 for TSND121 and TSND151, respectively. In this study, we only used accelerometers by setting the measurement range to ±19.62 m/s 2 and believe that the effect of this difference is minimized by the placement of the same sensor, TSND151, in symmetrical positions on the body. Note that the main reason for using the TSND series is that the data collection experiments, including synchronization of time between sensor nodes, can be managed on a single personal computer using dedicated data recording software. This allowed rapid data collection and subsequent analysis.
The activities included not only simple activities such as walking and running but also complex upper-limb activities such as making coffee and vacuum cleaning, which are frequently performed in daily life. The subjects performed various activities for approximately 12 min each in the way they usually do. Note that this does not indicate a continuous 12-min session but the total time of several separate sessions. The acceleration signal sampling rate was 50 Hz. Data were collected for approximately 64 h (12 min × 23 activities × 14 persons). Notably, the dataset was balanced for both activity classes and individuals, with 10268.3 s [standard deviation (SD): 333.9 s] per activity and 16871.1 s (SD: 332.9 s) per individual. Therefore, the training of a classifier is less likely to be affected by specific activities or individuals regarding the bias in the number of data sets.

Conventional ML-based method
Conventional ML comprises two parts: feature extraction and a classification model. The features that could characterize the motions of various activities were calculated from the raw acceleration signals. In addition to the three axes of an accelerometer, i.e., x, y, and z, we   A total of 39 features were defined by the four axes of the acceleration signal in the time and frequency domains of each sensor, as summarized in Table 1. These features are frequently used for on-body HAR and device localization. (9,(12)(13)(14)(15) A window size of 256 (N = 256) was selected. The importance of the features at each position was then evaluated using RelieF (16) that evaluates the worth of an attribute by repeatedly sampling an instance and considering the value of the given attribute for the nearest instance of the same or different classes. We confirmed that, for each position, adding more features did not significantly improve classification performance when the number of features exceeded 19. Thus, 19 features were used for each position to examine the combination of positions, as listed in Table 2. The table also shows the effective features for each position. Among them, F3, F10, F15, F19, F30, and F38 were selected at any position and are thus effective features regardless of the position.
For each sensor combination, an activity was characterized by 19 × K features when K sensors were used. We used RandomForest (RF) as the classification algorithm for the conventional method because RF has been shown to exhibit good classification performance in WHAR tasks. (17,18)

Deep-learning-based methods: CNN-LSTM
Several WHAR studies based on DL models have emerged in recent years. The most frequently used network layers are convolutional neural networks (CNNs), recurrent neural networks (RNNs), and long short-term memory (LSTM). Recently, hybrid models have also been used. CNN and LSTM have been used in combination to outperform CNN alone. (19)(20)(21)(22) In a Pearson's correlation coefficient of signal from two axes s and t, i.e.,cov st /v s v t , where F37, F38, and F39 correspond to the values between axes x and y, x and z, and y and z, respectively.
Note: a ∈ {x, y, z, m}. f a,i indicates the value of the ith smallest frequency component for axis a. The four features in each feature category, e.g., F1, F2, F3, and F4, correspond to x, y, z, and m in this order. The term cov st indicates the covariance between signals from axes s and t. study using multiple sensors, (22) multiple convolutional subnetworks were used to collect features from each sensor, which were then integrated into the depth concatenation layer. Finally, classification was performed in the output layer after collecting the temporal features through the LSTM layer. A similar CNN subnetwork design was also applied in Ref. 23 and was shown to be effective in separately extracting the information provided by multiple sensors. Because the DEBONAIR model proposed in Ref. 22 achieved a higher accuracy of 83% on the CADL dataset, the CNN-LSTM model in this study was built with reference to the architecture of DEBONAIR, as shown in Fig. 2. For each sensor, a convolutional subnet containing three convolutional and three pooling layers was used to extract information, which was integrated into the depth concatenation layer and subjected to a convolution operation. Then, the time features were extracted from the data using a two-layer LSTM, and finally, classification was performed using a softmax function on a fully connected layer.
A preliminary experiment showed that the optimizer and learning rate affected the classification performance among the other hyperparameters. These hyperparameters were tuned for each sensor combination. The hyperparameters considered are listed in Table 3.

Deep-learning-based methods: CNN-transformer-based method
The transformer model based on multihead attention has been proven to be highly advantageous in recent years for handling sequence analysis tasks. Shavit and Klein used the transformer encoder for WHAR tasks (24) for the first time, and the results showed that the     Figure 3 illustrates the architecture. First, each sensor data point was integrated using the time dimension, and then token embedding and position embedding operations were performed on the data. Subsequently, the self-attention value was calculated for each vector using the transformer encoder, and the class token embedded in the token embedding was used to classify the data using a softmax function on the fully connected layer. The hyperparameters are listed in Table 4 and are based on the specific values or calculation methods used in the model of Shavit and Klein. (24)

Evaluation method
Classification performance was evaluated by cross-validation (CV) of the training and test data. We chose leave-one-person-out (LOPO) CV as the primary CV method, which was performed by testing a dataset from a particular person with a classifier that was trained without the data from that person. Training and testing were repeated with different combinations of participants. Because the trained classifier did not contain data from the test participants, LOPO-CV was regarded as a fairer and more practical test method.
In addition, n-fold CV was applied in two ways: against a dataset containing data of all participants (n-fold CV_all), and averaging the results from n-fold CV against datasets consisting of each participant's data (n-fold CV_each). The n-fold CV utilizes (n−1)/n of the dataset for training a classifier and 1/n for testing the classifier. The n-fold CV_all represents the average classification performance because the classifier knows the participants from (n−1)/n of their data. In contrast, n-fold CV_each has an optimistic performance because each classifier knows nothing except the test participant. We set n to 10. Section 3.3 utilizes three evaluation methods. Otherwise, LOPO-CV is used to understand the lower bound of the performance.
The classification performance is evaluated using a macro-average F1-score. An F1-score is a harmonic mean between recall and precision. Equations Hereinafter, we simply refer to a macro-average F1-score as an F1-score.

Classification performance comparison in the three methods
The differences among the three classification algorithms were analyzed. Figure 4 shows the maximum classification performance (F1-score) for different numbers of sensors in the three classification models; the sensor combination is also presented. The CNN-LSTM model obtained the highest score for single-sensor usage. When the number of sensors was greater than one, the RF outperformed the two DL models. Comparing the three algorithms for the 127 sensor combinations, we found that the RF model achieved the highest F1-score for 119 sensor combinations. The CNN-LSTM model achieved the highest F1-score for the remaining eight combinations, and in seven of these eight sensor combinations, one sensor was used. Although the CNN-LSTM model outperformed the RF model when only one sensor was used, the degradation in the classification performance of the CNN-LSTM model was most pronounced when the number of sensors exceeded four. The number of trainable hyperparameters of the model increased by 1400 per sensor because the CNN subnetwork structure was used to extract the features of data from each sensor. In contrast, we did not use a subnetwork structure in the CNN-transformer model but instead integrated different sensors into one dimension; this increased the number of hyperparameters by only 192 per sensor. Such a difference in the number of tunable hyperparameters made the CNN-LSTM model susceptible to overfitting when noisy data were passed in, which led to a decrease in the F1-score. We believe this is especially applicable to LOPO-CV, where the providers of training and test data are different. In contrast, in the RF model, we performed feature selection for each position, i.e., dimensionality reduction, such that only valid features would be used in the training for each position. This did not significantly degrade classification performance, even when useless sensors were added.
The CNN-transformer model did not achieve high F1-scores. In this model, we used the hyperparameters provided in Ref. 24, which might have led to such results. Another reason could have been the lack of data. The multihead self-attention mechanism in the transformer encoder can focus on the information at any one position in the data; however, this also requires an extensive dataset for support. As mentioned in Ref. 25, where the transformer structure was applied for image processing, transformers lack some of the inductive biases inherent to CNNs, such as translation equivariance and locality, and therefore do not generalize well when trained on insufficient amounts of data. In WHAR, collecting large amounts of data with labels is challenging. Although the sample size of our dataset exceeded that of most large publicly available datasets, (26)(27)(28) the results showed that conventional ML still had an advantage. The collection of large amounts of high-quality ADL data is a major future challenge.
To determine whether classification performance for a combination of sensors varied with the classification model, we examined the strength of the relationship using the F1-scores of all 127 combinations under the three classification models. Table 5 shows the Pearson's correlation coefficients between different pairs of the three models, where values closer to 1.0 indicate a stronger relationship between the pairs of classification models. The table shows strong correlations among the three pairs, indicating that trends in the effectiveness of the sensor combinations were stable against changes in the classification models. Therefore, the averages of the F1-scores of the three classification models per sensor-position combination are presented in the following sections, unless otherwise noted. Figure 5 shows a heatmap representing the overall trend of the F1-scores per combination of sensors, grouped by the number of sensors and sorted in ascending order. In the figure, the check marks in the sensor position columns indicate the use of the sensor position. In the activity columns, the darker cells indicate higher F1-scores. In the rightmost columns, the macro averages are presented as bar charts and numbers. The figure suggests that the sensors' positions and combinations affect not only the average classification performance but also the classification per activity. Note that Table A1 in Appendix shows concrete values. A larger number of sensors was not necessarily better, as F1-scores lower than the largest value in the group with a smaller number of sensors were obtained with a larger number of sensors. For example, the highest value was obtained from one sensor (No. 7) worn on the right wrist, whereas only five of the 21 combinations yielded higher values when two sensors were used: Nos. 24, 25, 26, 27, and 28. This is not surprising because activities that show differences in the upper body, such as brushing teeth (b) or washing dishes (c), cannot be distinguished using sensors attached to the left and right thighs. Furthermore, in several cases, the use of fewer sensors was better than the use of all seven sensors. These were Nos. 62, 63, 95, 96, 97, 98, 116, 117, 118, and 119, among which the sensor combination of the left wrist, right upper arm, right wrist, and right thigh (No. 98) was the best. In all cases, except for case 117, no sensor was worn on the chest. The differences in the posture and movement of the chest showed little difference between activities with different hand use, which could be seen from the fact that the value obtained from the chest was the lowest in the case of one-sensor use. This suggests that the information obtained from the chest-mounted sensor was noisy when discriminating between activities that differed in hand or arm movements. In fact, comparing the intensity of the heatmap for each activity in combination Nos. 98 and 127, we found that the cells for standing tasks such as washing dishes (c), washing hands (e), and eating food while sitting (j) appeared  Table A1 in Appendix. darker in No. 98 than in No. 127. Table A1 in the Appendix concretely shows this fact by indicating a higher F1-score of No. 98.

Effect of sensors' positions on classification
The right wrist made the highest contribution among the seven positions. For each sensor group, the combination that included the right wrist was the top-ranking. This may be because all of the subjects were right-handed, although the participants in the data collection were not instructed to hold objects such as toothbrushes with their dominant hand. As watches are often worn on the opposite side of the dominant hand, the usefulness of the left wrist must be verified when considering a smartwatch as a practical implementation of the sensor. In the case of singlesensor use, the left wrist ranked second in usefulness, behind the right wrist (No. 6). When two sensors were used, the left wrist appeared first with the right upper arm (No. 23), followed by the right thigh (No. 21) and left thigh (No. 20), with the exception of the right wrist (No. 26). During yoga, a smartphone can be "worn" in a holder attached to the upper arm. Although the degrees of freedom of movement are greater than those under the current data collection conditions, a sensor can be attached to the thigh by keeping a smartphone in the front pocket of the pants. Therefore, we believe that these three pairs represent the classification performance for activity recognition under practical conditions; however, they were 0.066, 0.071, and 0.072 lower than the best pair (No. 28).

Individual user differences
The relationship between the number of sensors and classification performance under different user data distributions is discussed next. Here, we focus on the RF model because it proved to be the most effective, as discussed in Sect. 3.1. The three evaluation methods presented in Sect. 2.4 were used. Figure 6 shows the relationship between classification performance and number of sensors under the three evaluation methods. Each bar indicates the average F1-score With respect to the maximum value, the three evaluation methods appeared to be saturated with four sensors. The trend in the mean values was LOPO-CV < 10-fold CV_all < 10-fold CV_ each with LOPO. As expected, LOPO-CV and 10-fold CV_each indicate the lower and upper bounds of the classification performance, respectively. The F1-score of LOPO-CV was much lower than that of 10-fold CV_all and 10-fold CV_each because the methods used to perform CADL can vary considerably among individuals. Thus, misclassifications can occur. The results of 10-fold CV_all showed that an average F1-score of more than 0.82 could be achieved using more than three sensors if even a small amount of the user's own activity data was included in the training data. Furthermore, an average F1-score of more than 0.95 was achieved for 23 CADL using only two sensors if the training data were obtained exclusively from a particular user (10-fold CV_each). To improve the classification performance in the LOPO-CV, that is, when testing on data from an unknown user, data should be collected from more participants to increase the heterogeneity of the training data, which would increase the possibility of including people whose data are comparable to those of the unknown user. In other words, it creates a situation that is similar to including the user's own data. Table 6 summarizes the processing speeds in milliseconds per window, in which the DLbased models were evaluated with and without the GPU (using only the CPU). This table presents the following three facts: First, the RF model required a much longer time than the two DL-based models. This is because the processing time in the RF model includes feature calculations requiring approximately 2.70 ms/window. Second, the processing times of the RF and CNN-LSTM models increased linearly, whereas that of the CNN-transformer model was nearly constant. In the RF model, because the feature calculation time with K sensors was almost K times longer, even though the optimal feature subset varied by position, the time required for feature calculation had a greater impact on the overall processing time than the classification time (0.015 ms/window per sensor). In the CNN-LSTM model, as shown in Fig. 2, the number of sensors, K, affects even the concatenation layers, which we considered increased the processing time linearly, although not significantly. By contrast, because K appeared only at the input of the convolutional layers, as shown in Fig. 3, the computational cost for subsequent processing is Table 6 Processing speed by the number of sensors (ms/window). independent of the number of sensors. Therefore, the processing time in the CNN-transformer model was almost constant. Third, GPUs were more than 10 times faster than CPUs when the CNN-LSTM model was processed and 50 times faster for the CNN-transformer model, as expected. In Sect. 3.1, the RF model exhibited the best classification performance; however, its processing speed was the lowest among the three models. Thus, overall, the CNN-LSTM model is the best classification model for both classification performance and processing speed if a large amount of labeled data can be obtained.

Conclusion
In this study, we examined the effect of different combinations of seven body-worn accelerometer positions on the classification of 23 CADL. One conventional ML model (RF) and two DL models (CNN-LSTM and CNN-transformer) were used to understand the differences between the classification models. A total of 127 combinations using the three classification models were tested. The findings are as follows: • A strong correlation between combinations of sensor positions and classification performance was found. • A larger number of sensors did not necessarily yield better classification performance.
• The sensors placed on the right sides of the subjects exhibited better classification performance than those on the left side and center because of the effect of the dominant hand (all participants were right-handed). • The combination of four sensors placed on the left and right wrists, right upper arm, and right thigh was the best. • Assuming that sensors could be integrated with smartwatches and smartphones, practical combinations where a smart watch was worn on the nondominant wrist (left) and a smartphone was kept in the left or right trouser pocket were ranked 85th and 87th performance-wise in the 127 combinations, lower than the best combination by 0.147 and 0.149, respectively. • A comparison of the three evaluation methods showed the lower, average, and upper bounds of classification performance. Training a classifier using a small amount of data from the test participants significantly improved classification performance. • The RF model required processing time for feature calculation, which caused a significantly longer processing time per window than DL-based models. Thus, the CNN-LSTM model would be a better choice than RF if a large amount of data is used for training the model. The findings of this study enable application designers who use activity information to choose a combination of the sensor positions based on the requirements for the wearability of sensors and classification performance of activities according to their interests. In the future, we plan to apply active learning, (29) a machine-learning method that engages the user in the labeling process, to adapt the decision boundary of a classifier to the data distribution of a particular user. Furthermore, we will investigate a method to determine the best combination for a new set of activities without evaluating all the combinations.