Fuzzy Spatiotemporal Representation Model for Human Trajectory Classification

spatiotemporal


Introduction
The widespread usage of the mobile Internet has resulted in a heavy reliance on mobile devices to access online services.Concurrently, the massive data collected by these devices capture individuals' behavioral features, particularly their spatiotemporal patterns, which have attracted significant attention from researchers.It is a consistent and significant direction to study the use of spatiotemporal information for identifying and classifying different behavioral groups. (1)he collection of movement behaviors can be achieved comprehensively by utilizing location information from cell phones.The movement trajectory data encompasses specific time and location information, essentially, spatiotemporal data.(4) A crucial application scenario of spatiotemporal data involves classifying individuals based on spatiotemporal information.Extracting discriminated features that strongly correlate with specific application scenarios is very useful for machine learning.For example, Du et al. (5) utilized user check-in data from social media during morning rush hours, working hours, evening rush hours, and nonworking leisure hours as features, thereby acquiring user movement behaviors from datasets.By employing the k-means clustering method and k-nearest neighbor algorithm, citizens were successfully classified on the basis of the above features, facilitating the identification of additional personal details such as workplace, residence, and occupation.Furthermore, researchers have investigated the regularity of individuals' activity trajectories.Song and coworkers (6,7) highlighted that people's activities can be predicted, with a predictability rate of up to 93%.De Montjoye et al. (8) proved the uniqueness of individuals' activities.Additionally, it can be concluded that the activity trajectories of individuals are related to their social connections, as observed in Ref. 9: the closer the relationship between individuals and their social connections, the greater the similarity observed in their activity trajectories.
Despite the significant benefits of these methods in classifying or predicting human behaviors, they present some challenges to our problem.The selection of trajectories from the vast spatiotemporal data collected by sensors in a city encounters limitations.Owing to limitations in collection technologies and the distribution of collection points, the spatiotemporal data is frequently characterized by sparsity, positional offset, and high feature dimension.Moreover, the sparsity of data hampers the accurate reflection of activity trajectories, whereas the high feature dimension introduces computational complexity and can potentially impair the performance of conventional classification models, namely, the curse of dimensionality. (10)onsequently, conventional classification models struggle to attain satisfactory results in this context.
To address the challenges associated with sparse and diverse trajectory data, we propose a method of extracting crowd habit features on the basis of fuzzy spatiotemporal data.The proposed method aims to improve classification performance by integrating it with classification models.The key steps and contributions are as follows.
• Regarding high spatiotemporal dimensions in data, we come up with the Time-Geo Hash (TGH) model.The TGH model effectively handles time information by processing it in fragments and encoding spatial location information in a vague manner.Additionally, the TGH model maps adjacent acquisition time points to the same time slice, thereby reducing the number of time dimensions.Furthermore, applying the hash algorithm to the location information mapping of the collected data significantly reduces the number of spatial dimensions.

Ensemble learning
Ensemble learning, which originated from the concepts of strongly and weakly learnable concepts, (10) has emerged as a fundamental technique in machine learning.It has proven to be instrumental in improving the generalization ability and prediction accuracy of classifiers. (11)ntegrated learning can be observed in both narrow and broad senses.In the narrow sense, multiple subsets are randomly selected from the training set, and the same classification algorithm is applied to each subset to enhance the generalization ability of each classifier.On the other hand, in the broad sense, the same problem is tackled using multiple learners.The ensemble learning process mainly involves three steps: generating training subsets, training base classifiers, and integrating the results obtained from these classifiers.Bagging (12) and Boosting are representatives and most commonly used in ensemble learning methods. (13)

Similarity calculation of spatiotemporal information
According to the spatiotemporal information used in calculating similarity, research on the similarity of spatiotemporal information can be divided into three categories.
(1) Spatial similarity focuses solely on spatial information and does not consider temporal aspects.Research studies on this primarily explore the geometric shapes of spatiotemporal trajectories using distance metrics such as Euclidean distance, (14,15) longest common subsequence (LCSS), (16) edit distance on real sequence (EDR), (17) and graph structure similarity (18) to measure the similarity between trajectories.(2) Time similarity concentrates on analyzing the similarity based on time series data.For example, the Fast search method for dynamic time warping (DTW) (19) is used to calculate the distance between two time series.(3) Spatiotemporal similarity considers both spatial and temporal information, such as by implementing k-most-similar-trajectory (K-MST) queries using data structures similar to R-trees, (20) utilizing low-resolution trajectories to compose sets of crowd classification rules (FCRs), (21) or employing the top-bottom clustering algorithm for small crowd classification based on offline crowd trajectories. (22)he representation form of spatiotemporal information in the similarity calculation process can be divided into two categories: multidimensional vector and string forms.The multidimensional vector form requires more computational resources but provides a higher accuracy. (23)Therefore, in this study, the multidimensional vector form is utilized.

Problem Formulation
Here, we introduce several basic concepts and provide a formal definition of the user identification problem.

Overall framework
As shown in Fig. 1, the overall framework is constructed according to the following steps: (1) data preprocessing, (2) spatiotemporal information coding and user behavioral modeling, (3) Bagging algorithm, and (4) outputting the final prediction result.
The data on the far left in the figure represents the quintuple obtained after data cleaning.Through subsequent feature extraction and preprocessing, a dataset is constructed to train the classification model, that is, the training set.On the basis of the MAC address list of persons of interest provided by the public security department, classified labels can be determined in a training set, or in other words, whether they belong to a specific group of people.
Note that the original location information is organized in the order of collection point and time.The data quintuple after data cleaning indicates that the MAC address serves as the unit, and features are extracted from multiple activity records corresponding to a single MAC address and aggregated into a sample.Granularity is no longer a single record but the MAC address of mobile devices.
As depicted in the middle of Fig. 1, our approach aims to address dimensionality reduction and compensate for data imperfections.To achieve this, we proposed two feature extraction algorithms, namely, TGH and UTPS, which offer distinct perspectives for generating training sets.These algorithms help overcome limitations in the data and improve its overall quality.
Subsequently, the determined classification algorithms are employed to train and combine these feature extraction approaches into a robust classifier.To obtain the final prediction result, we employ a voting mechanism within ensemble learning.By synthesizing the outputs of the two feature extraction algorithms, we achieve an enhanced predictive capability that leverages the strengths of both approaches.

Data sources
The spatiotemporal data in our study were mainly collected via public safety sensors in a city.The data model can be represented by the following quintuple: (MAC address, sensor number, acquisition time, sensor's latitude and longitude).
The quintuple can describe moving tracks of a single MAC in various time and space domains.

Data preprocessing
Since the collected data are affected by duplication and incompletion, it is necessary to preprocess abnormal data.Moreover, we should also deal with other problems during research, such as the uniform conversion of particular time formats in the data reported by some sensors, remove the hyphens (-) in data, and convert decimal MAC addresses into hexadecimal ones.After that, other smart devices such as smart air conditioners, smart sockets, and smart home products should be identified and discarded.

TGH algorithm
In this section, we encode a given path X in time and space to reduce its dimensionality.First, we divide the continuous time information by the TGH algorithm and map adjacent acquisition time points to the same time slice (ts).Then, we use the UTPS algorithm to map the collected data's location information and encode the geolocation region into a hash value lh (location hash).The pseudocode is described in Algorithm 1.

Time slice:
The TGH algorithm divides time into four slices, specifically, a quarter of an hour as a unit, and divides one hour into four slices.Thereby, there are 96 slices in a day.They are numbered from 1 to 96, and each is called a time slice.In Fig. 1, the time point on the right side of each time slice is taken from an open interval.For example, the time period corresponding to time slice number 1 is [00:00, 00:15] and so on.
Through experimental comparison, 15 min was selected as the time slice granularity.When every 15 min is taken as a time slice, moderate time dimensions are produced with good activity discrimination.

Geohash location encoding:
The latitude and longitude are encoded into a alphanumeric string by Geohash.The coded string represents a rectangular region of Earth.As shown in Table 1, the greater precision obtained with longer strings.Considering the division of urban functional blocks and the dimension quantity of the entire city, latitude and longitude coordinates are converted into 5-bit Geohash codes.This conversion enables a more precise representation of geographic locations.As seen in the table, each Geohash code roughly corresponds to the length of 2.4 km of the rectangular region.

Space-time encoding:
A single MAC address yields path information, and after encoding through the TGH model, it contains a total of 486 features as below.

(label, mac, time-features, geo-features)
For a specific MAC address, the labels are represented in the first column, indicating whether the owner of the mobile device associated with the MAC address is a follower.A label value 1 denotes "yes," whereas 0 denotes "no."The second column denotes the unique MAC address.The third column consists of a series of values representing the number of times the MAC address is collected in each time slice.This sequence is determined to be 96 dimensions, as established previously.Following that, a collection of 5-digit Geohash codes represents the frequency of data collection for the MAC address in each corresponding geographical area.As determined earlier, there are a total of 388 codes.

UTPS algorithm
The UTPS model is proposed for each MAC address in different work and rest time periods during the working day, and each MAC address corresponds to similarities in the daily activities of the mobile terminal holder.
As shown in Algorithm 2, UTPS is mainly used to describe the degree and similarity of activities of users in different time periods and regions.It can be used to compensate for deficiencies such as missing data collection.In addition, the similarity of daily behavior is incorporated into the statistics to extract features from a new perspective.

Proportion of spatiotemporal information:
Inspired by some research, (5)(6)(7)(8)24) the UTPS algorithm defines the time period division table as below, in accordance with work and rest habits of office workers in work days.
Given the inherent disparities in transit behaviors between work days and holidays, the UTPS algorithm is purposefully designed to prioritize the analysis of data collected exclusively on work days.This approach incorporates the invaluable expertise and experience of the public security department, which serves to comprehensively consider the distinct patterns and unpredictable activities characteristic of nonworking days.

Similarity information:
The UTPS algorithm uses the TGH algorithm to convert all quintuples of data collected daily for each MAC address into a sample piece of data, such as where S dim indicates the dimensions of time features and g dim the dimensions of geo-features.In a specified time period, the number of days on which the activity record of each MAC address is collected corresponds to the number of multidimensional vectors.
Then, we set the dimension of the multidimensional vectors as v dim = s dim + g dim and define all multidimensional vectors as ( ) , , ..., , , , ..., , The similarity of space-time vectors of multiple days should be calculated to determine whether the user behavior is regular.The process can be divided into three steps.
Step 1: Combine v i into a multidimensional vector mix with the same dimensions.
If at least one of the values of the first dimension in v i is 1, the first dimension of the mix will be 1.Otherwise, it is 0. All subsequent dimensions are processed in return in the same way,  obtaining a multidimensional vector mix of the same dimension.The mix can be viewed as a synthesis of v i .
Step 2: Calculate similarity between v i and mix.
The Jaccard index is introduced to calculate the similarity of two multidimensional vectors as ( ) Among them, i and j are two multidimensional vectors.The value of each dimension is 0 or 1. q represents the number of dimensions when the same dimensions of i and j are both 1. r represents the number of dimensions when the same dimension of former values is 1 and that of latter values is 0. s represents the number of dimensions when the same dimension of former values is 0 and that of latter values is 1.
Equation ( 1) is used to compute the similarity between v i and mix.The denominator clearly represents the count of dimensions in a mix with a value of 1, whereas the numerator signifies the presence of these dimensions in v i , that is, ( Step 3: Determine the overall similarity of all multidimensional vectors by computing the average of all similarities in Step 2. It is easy to calculate the overall similarity via Eq.(3).
When the number of days is 1, indicating that there is only a single multidimensional vector, the mix is identical to that vector, resulting in an overall similarity score of 1.Similarly, if all multidimensional vectors are identical, the mix will also be identical to these vectors, yielding an overall similarity score of 1.

User habit coding:
The UTPS algorithm generates a dataset comprising sample rows that correspond to individual MAC addresses.This dataset encompasses a total of 15 distinct features as below.

(label, mac, proportion of p i records, proportion of p i regions, similarity of daily activities)
The proportion of p i records with i = 1, 2, 3, and 4 is defined in Table 2.It represents people's activities in different time periods.The proportion of p i regions with i = 1, 2, 3, …, 10, represents how active the person is in the top-10 most active areas.

Model improvements
To enhance the classification model's performance, we leverage the Bootstrap aggregating algorithm within the realm of ensemble learning.This technique combines multiple weak classifiers, each trained by a specific algorithm, to form a robust classifier that delivers the ultimate prediction.
We focus on enhancing the classification model in four key areas: (1) reducing feature dimensionality, (2) selecting appropriate algorithms, (3) employing Bagging ensemble techniques, and (4) enhancing comprehensive decision-making.As we have already explored feature dimensionality reduction, the subsequent sections will delve into the latter three aspects in detail.

Performance indicator of classification model:
Accuracy (ACC) is calculated using Eq. ( 4).

TP TN ACC TP TN FP FN
In cases of imbalanced data, the accuracy metric may not adequately capture the overall performance of the classification model.Therefore, we introduce other performance indicators, namely, Precision (precision), Recall (recall rate), and F1, as , Within these metrics, Precision signifies the fraction of samples accurately predicted as the target category out of all the samples predicted as such, whereas Recall denotes the portion of correctly predicted samples among those belonging to the target category.These two metrics offer distinct viewpoints on the classification model's performance.A higher Precision implies that the classification model seldom misclassifies nontarget samples as the target category, whereas a higher Recall suggests that the classification model rarely misclassifies the target category as nontarget.The F1 score serves as a combined metric that considers both Precision and Recall.

TGH classification model algorithm selection:
Various machine learning classification algorithms are used to train models on the TGH training set.On the basis of the performance metrics mentioned earlier, we have opted for a random forest classifier to classify the feature datasets generated by the TGH algorithm (refer to the experimental section for a comparison of performance indicators of various classification algorithms).
Random forest is a classifier composed of decision trees.Each decision tree judges a new sample to be classified.The most frequent classification category in the result is taken as the final classification of the sample on the basis of the classification results of each decision tree.This process is called "Votes".
Random forest performs better in classifying many datasets because multiple decision trees vote.It can process data with many features and does not select features.The disadvantage is that building a forest takes up more memory, and overfitting will occur on some datasets that include diverse categories of features.
Random forest performs the best among many classification algorithms to classify feature datasets extracted by the TGH algorithm.This is probably related to its suitability for multidimensional feature data and applying multiple decision trees for comprehensive voting.

UTPS classification model algorithm selection:
Classification models are trained by different machine learning classification algorithms on the UTPS training set.On the basis of the above performance indicators, Naive Bayes is selected to classify feature datasets extracted by the UTPS algorithm (refer to the experimental section for a comparison of performance indicators of various classification algorithms).
On the basis of the Bayes theorem, the Naive Bayes algorithm learns a joint probability model for classification prediction via prior and conditional probabilities.Owing to the conditional independence assumption, or in other words, when the category is determined, all features used for classification are conditionally independent, the amount of calculation is significantly reduced.However, such an assumption may not hold in real life, so the classification accuracy of Naive Bayes probably declines.
Naive Bayes is equipped with high training velocity and easiness of generating a classification model, but its classification accuracy is low under scenes related to classification features.Naive Bayes performs the best among the many classification algorithms to classify feature datasets extracted by the UTPS algorithm.This could be attributed to the limited correlation among the features in the UTPS dataset.

Bagging ensemble
As a fundamental ensemble learning technique, Bagging is a commonly employed method to enhance the performance of classification models when dealing with imbalanced data.The Bagging algorithm has the following steps.

Bagging ensemble of TGH classification model:
As previously discussed, we have chosen to employ random forest for classifying the feature dataset generated by the TGH algorithm.Random forest operates by constructing multiple decision trees, which aligns with the Bagging concept.Consequently, in this study, we do not employ Bagging for ensembling the TGH classification model any further.

Bagging ensemble of UTPS classification model:
The stability of a classifier plays a crucial role in affecting the effectiveness of the Bagging algorithm.Classifier instability implies that perturbations in the dataset can lead to significant fluctuations in classification outcomes.When the base classifier within an ensemble is unstable, Bagging can substantially improve performance. (13)Conversely, the impact is limited if the base classifier is already stable.Fortunately, the Naive Bayes algorithm has been demonstrated to exhibit stability. (25)Therefore, it becomes necessary to induce instability in Naive Bayes to enable Bagging with this base classifier.In this study, we leverage a Bagging Naive Bayes classification approach from the existing literature; (25) it creates diverse training subsets to introduce instability into Naive Bayes and build an ensemble base classifier.
When forming training subsets for each base classifier, the process involves a random selection between two categories, followed by a random sampling of 30% of the samples within the chosen category.Simultaneously, 70% of the samples are drawn randomly from the other category to create a distinct training subset.Consequently, these training subsets exhibit substantial diversity and differ in distribution from the original dataset.

Comprehensive decision-making
We combine two classification models to further improve the model's accuracy and determine the final prediction result via a voting mechanism.The objective is to use this combination to categorize mobile device users associated with the same MAC address.In accordance with the definition, this constitutes a form of ensemble learning in a broader context.
In this study, a label of 1 denotes a person of interest, whereas a label of 0 signifies the general public.When two classification models yield concordant labels, the final prediction adopts their consensus.Conversely, if there is disagreement between the models, the final outcome defaults to 0, indicating the general public.One reason is that two classification models incorrectly predict much of the general public as persons of interest, while such people account for only a small proportion in real life.
Prediction accuracy will be significantly enhanced when the TGH and UTPS are combined in accordance with the above rules.For more details, see the experimental process.

Data processing
Concerning MACs collected in a city, the general public accounts for the most significant proportion, and their behaviors differ.Naturally, it is unrealistic to perform statistical analysis on data of the general public.In addition, the number of individuals of the general public is markedly different from that of persons of interest, so the former is undersampled.In this study, a random undersampling technique was employed to select 10653 MAC addresses from a pool of 8 million MAC addresses belonging to regular residents.MAC information of a total of 803 suspects was obtained from relevant departments to constitute an experimental dataset.
In terms of research on space-time trajectories, Gong (26) pointed out that people's activities were mainly concentrated in seven days as one cycle.Consequently, in this study, we conducted classification research on data of one week in August 2017 at the beginning of exploration.
In this study, we explored MAC addresses with seven-day data of the week, including 7,588 general public and 593 persons of interest, as shown in Table 3.
As shown in Fig. 2, the activities of different types of person will show different characteristics at different time periods, especially in the middle of the night, and the activity of persons of interest is much higher than that of the general public.
The trajectory data of different categories show the characteristics of long tail distribution, as shown in Fig. 3.

Experimental environment
The platform is a Dell server 64-bit system (16 core CPU, each with 2.6 GHz, four GPUs GTX 3090, 32 Gb of main memory).The algorithms and models described herein were implemented by Python 3.7.

Feature dimensionality reduction
Table 4 shows the original dataset and dataset dimensions generated by two algorithms.
The TGH has 399 dimensions, including 96 time dimensions and 303 space dimensions.Specifically, the former is formed by the division of 24 h a day, with 15 min as a time slice.The granularity of 15 min is determined through experiments.
Table 5 shows the experimental results obtained under the granularities of 1 h, 30 min, 15 min, and 5 min.The data originate from records collected within one week of MAC addresses of 593 general public and 593 persons of interest selected by the One-Sided Selection undersampling algorithm.The algorithm with default parameters adopts random forest to obtain performance indicators through 10-fold cross-validation.
It is most suitable to select 15 min as the granularity of time slices.

Classification algorithm selection
In subsequent tests, an oversampling method was found to cause overfitting.Therefore, it was abandoned, and only MAC address samples of the general public were undersampled.The training set obtained finally consisted of MAC addresses of 593 general public and 593 persons of interest.

TGH classification model algorithm selection
As shown in Table 6, random forest used to implement the TGH classification model has the best performance.

UTPS classification model algorithm selection
Ultimately, we opt for the Naive Bayes algorithm as the classification method for the UTPS classification model, because this method can achieve the optimal performance, as shown in Table 7.This choice supersedes more commonly employed classification algorithms, which could be attributed to the limited correlation observed among the features in the UTPS dataset.Owing to the limited number of samples for persons of interest, the classification model's accuracy currently falls below 80%.However, the lift value signifies that the probability of an actual person of interest among people predicted to be a person of interest has increased several times.This is conducive to better analyzing and classifying different groups of people.
To recap, we introduce the TGH and UTPS algorithms for feature extraction.Coupled with the ongoing refinement of our classification model, we aspire to provide guidance for future research endeavors in this domain.The findings from this phase have already found practical applications within real-world systems.

Conclusions
Using the vast spatiotemporal data from city sensors, we explore the classification prediction of a person of interest using spatiotemporal data in traditional machine learning methods.We shall provide ideas for similar research.Unfortunately, owing to uncertainties in machine learning, methods should be improved and expanded, including the combination of classification algorithms and the processing of unbalanced datasets.In this paper, we limit the number of persons of interest, which may hinder the improvement of the classification model performance.Therefore, we shall continue investigations in this field and constantly improve classification models after acquiring more samples of persons of interest.

About the Authors
Lifeng Chen works in the Infor mation and Technology Center (Supercomputing Center) of Hangzhou City University and is mainly in charge of the construction, operation, and maintenance of major digital applications of the school, such as the school's public data collaboration platform, the design and development of campus mobile and PC applications, and the online teaching platform, with rich experience in project construction and management.He is currently serving as the digital commissioner of Hangzhou City, connecting the school with the digital reform task of Hangzhou City.(chenlf@hzcu.edu.cn)Canghong Jin is an associate professor of computer science at Hangzhou City University.His research focuses on mining and modeling large social and information networks, spatiotemporal series mining, and the big data platform.The problems he investigates are motivated by large-scale transit records, the web, and online media.(jinch@hzcu.edu.cn)

( 1 )
Train a multitude of foundational classifiers.Adopt the bootstrap sampling method and samples from the original dataset to obtain a new dataset; obtain a base classifier by training with a new dataset.Repeat many times to obtain multiple base classifiers.(2) Integrate multiple base classifiers.Use multiple base classifiers to classify and predict the same samples; adopt a voting mechanism to determine the category most frequently predicted as the final category.

Fig. 2 .
Fig. 2. (Color online) Activities of different types of person at different time periods.

Fig. 3 .
Fig. 3. (Color online) Regional activity distributions of different types of person.

Fig. 4 .
Fig. 4. (Color online) Selection of number of base classifiers for Bagging ensemble of UTPS classification model.
• The User Transit Pattern and Similarity (UTPS) model is developed to extract user habits.It involves the calculation of spatiotemporal information collected through the media access control (MAC) address, which can be regarded as the identity of the mobile device during various work and rest periods, as well as the evaluation of the similarity of daily activities for each MAC.The model depicts the intensity and regularity of daily activities in different regions in various time periods of each MAC.Additionally, the Bagging algorithm of ensemble learning is introduced to imporve the UTPS model.• Eventually, the TGH and UTPS models are synergistically combined for comprehensive decision-making.The experimental results show that the combined model considerably improves the classification accuracy compared with a single model.Lift value calculation results indicate that the proposed model can better classify and predict people with diverse behaviors.

Definition 1 (Moving Point): The
moving point is represented by O = (p, m, t), where p represents the location information, including latitude, longitude, and the name of the monitoring point, m refers to the MAC information, and t indicates the time information.

Definition 2 (Path): Given
a set of moving points <O> and a specific MAC address a, a path associated with a can be expressed as P = {O 1 , O 2 , …, O n }, where ∀O i (m) = a, and for i < j, O i (t) < O j (t).

Definition 3 (Person of Interest): MAC addresses can be classified into two types: Person of Interest and General Public. The
former consists of MAC addresses provided by the public safety department of a city, representing individuals who are of specific security or investigative interest.These addresses are assigned on the basis of criteria such as suspected criminal activity, surveillance targets, and involvement in ongoing investigations.The focus is on monitoring the movements and activities of these individuals for public safety and security purposes.The latter category includes MAC addresses associated with the general population.

Table 1
Correspondence between Geohash coding length and error.
of distinct eigenvalues for P in <P> do Divide trajectory P into time segments p1 to p4 by time, denoted by P i for O i in <P> do sum(record), sum(record, P i ) ← #O j // Calculate number of records and total number of records in the top-10 most frequent areas, which is similar for other data if O j (p) ∈ <KAP> then sum(kap), sum(kap, P i ) ← #O j Calculate proportion of each feature in each time period hf, <HF> ← hf Calculate user behavior similarity, see Sect.4.4.2 return <HF>

Table 4
The resulting lift value of 2.3 demonstrates that, with the TGH classification model in use, the likelihood of selecting a person of interest from those predicted is 2.3 times higher than in the original dataset.This lift value further increases to 3.4 when combining the TGH and UTPS classification models.

Table 9
Performance metrics for the dataset in the classification model.

Table 8
Prediction results of classification model on dataset.