Real-time Hand Movement Trajectory Tracking with Deep Learning

In this study, we employed deep learning to develop a real-time hand trajectory tracking system. Our primary approach integrates the MobileNetv2 single-shot multibox detector, known for accuracy, with the versatile CAMShift algorithm. This synergy ensures robust hand detection across diverse scenarios. Through rigorous testing on webcam images and leveraging advanced feature extraction methods, such as contour discernment and skin hue differentiation, we report an 88.17% increase in detection accuracy over traditional models. Moreover, with a latency of merely 0.0343 s, our system demonstrates its prowess in immersive gaming and assistive devices for individuals with disabilities


Introduction
(7)(8) Dynamic gestures, in particular, offer a potent medium for the general public to convey information effectively.
Historically, hardware devices were the primary tools for detecting hand positions in dynamic gestures.However, with the advent of computer vision (image processing), machine learning, and deep learning, there are now methods to detect images containing human hands.Deep learning has shown significant strides in image recognition, especially object detection.
In this research, our focus is on deep learning for hand detection.Once the hands are detected, relying on a deep learning network for recognition in each frame might not be efficient, given the potential performance constraints on devices with limited hardware capabilities.A more streamlined way is predicting hand features within the hand region of interest (ROI) and tracking the hand's movement trajectory.
We present a dynamic gesture-tracking system that employs deep learning to detect human hands from images captured via a network camera.Upon successful detection, the hand ROI undergoes binarization for enhanced clarity.Tracking algorithms then predict the hand's position in subsequent frames, facilitating the continuous tracking of hand movement trajectories.This tracked data is then visualized on a computer, providing users with an intuitive view of hand movements.

Literature Review
Research in hand detection predominantly falls into two categories: pure palm detection and sign language recognition.Various methodologies, ranging from image processing to machine learning and deep learning, have been employed in these studies.The classical image processing approach encompasses techniques such as color space transformation, skin color detection, edge detection, and pinpointing hand localization within images. (9)Machine learning techniques, especially the support vector machine with hand-containing training images, have reported accuracies exceeding 90%. (10,11)eep learning, given its versatility, has found applications in numerous domains.Specifically, the single-shot multibox detector (SSD) (12) integrated with VGG16, (13) a renowned Convolutional Neural Network, has been pivotal in recognizing sign language, particularly within the deaf community.The optimized MobileNetv2 SSD stands out because of its efficiency and accuracy in object detection tasks, especially in real-time scenarios. (14)Google's MediaPipe module, a fusion of SSD and a neural feature pyramidal network, excels in real-time palm detection and has showcased impressive results even on embedded devices.17) In hardware, devices such as Kinect have been instrumental in dynamic hand tracking when paired with neural networks or tracking algorithms.For instance, Kinect discerns the movement trajectory of the palm and subsequently employs a backpropagation neural network (BPNN) to decipher the pattern traced by the movement. (18)Multicamera setups have been explored for sustained target tracking. (19)After successful hardware-based hand coordinate detection, Kalman filtering has emerged as a preferred choice for tracking. (20,21)The integration of MediaPipe with tracking algorithms has further enhanced hand-tracking capabilities.Techniques that extract histograms of oriented gradients from images, followed by deploying the continuously adaptive mean-shift (CAMShift) algorithm, have been proposed as viable handtracking solutions. (22,23)

System Architecture and Methodology
Our hand-tracking system, equipped with a webcam, processes images in real time through a series of interconnected modules.This system is structured around three primary modules: the hand detection module, which identifies the hand's presence; the CAMShift module, which is responsible for adaptive tracking; and the track retention module, which ensures consistent tracking across frames (Fig. 1).
Hand detection is orchestrated through three pivotal steps: image resizing, mean RGB values subtraction, (24) and SSD detection.Initially, RGB images, sourced from diverse devices such as webcams, undergo resizing to a standardized dimension of 300 × 300 pixels.This dimension, chosen on the basis of empirical studies, is tailored to amplify the SSD's performance.A cornerstone of this methodology is the subtraction of mean RGB values, a normalization technique that harmonizes the image's average intensity, fortifying object detection capabilities.This process accentuates the object's distinct features, ensuring steadfast detection, even when faced with varying luminosities within the image.The transformative impact of this step is vividly illustrated by the pronounced contrast between the images' pre-and post-mean RGB value subtraction, as shown in Fig. 2 The intricate process of mean RGB subtraction is elaborated in Eqs. ( 1) and ( 2), with the entire workflow illustrated in Fig. 3.

( ) ( )
, , , ,  Following the preprocessing steps, the SSD, a specialized detection architecture, commences its detection routine.The SSD boasts a VGG16-based neural network structure meticulously designed to bolster model performance.This design choice facilitates the removal of fully connected layers, paving the way for the integration of supplementary convolutional layers.A distinctive feature of the SSD is its reliance on multiscale feature maps, colloquially termed a pyramidal feature hierarchy.This hierarchy is instrumental in detecting targets spanning different sizes across a spectrum of feature maps.Figure 4 offers a visual representation of the SSD's architecture, where the foundational layers are rooted in VGG16, while the ensuing convolutional layers function as feature maps, adept at detecting objects across a range of sizes.This stratagem proficiently predicts miniature objects in the nascent layers and more substantial objects in the advanced layers.The SSD incorporates prior boxes, characterized by diverse sizes and aspect ratios, to further refine detection precision across each feature map.
The SSD's output encapsulates the coordinates of the detected entities.To ensure the output mirrors real-world scenarios, redundant bounding boxes are meticulously pruned using an expedited nonmaximum suppression technique.Subsequently, the position of each detected entity is restricted on the image, facilitating intuitive visualization.

CAMShift
Upon successful detection, the CAMShift module is invoked to facilitate hand tracking.Figure 5 delineates the intricate architecture of the CAMShift module.To kickstart the tracking mechanism, the bounding box's spatial coordinates and dimensions are inferred from the antecedent frame's hand detection outcomes.Recognizing the susceptibility of the RGB space to luminosity fluctuations, which can significantly impede tracking efficacy, the image undergoes a transformation from RGB to a more stable HSV space.Subsequently, a histogram representing the hue (h) component is constructed, encapsulating the probability distribution of diverse hue values scattered across the image's pixels.Figure 6 vividly illustrates the histogram postinitialization.In the next step, a back-projection technique is used to map the value of each pixel in the image to obtain a color probability distribution map, which is represented as a grayscale image as follows: By leveraging the mean-shift algorithm, the color probability distribution map within the confines of the bounding box is selected.This is followed by the computation of the zeroth-and first-order distance matrices, symbolized by I (x, y).These matrices encapsulate the pixel value at the designated position (x, y) within the field, iterating over the pixels within the bounding box.
( ) A meticulous analysis of these distance matrices yields the centroid (x c , y c ) and the dimensions of the emergent bounding box.The bounding box undergoes adaptive resizing, ensuring that it swiftly acclimatizes to alterations in the target region, thereby guaranteeing effective tracking.As the process advances to the ensuing frame, the outcome procured after step A024, encompassing the bounding box's coordinates, serves as the foundation to recalibrate the bounding box's dimensions and position.This paves the way for the execution of step A021, ushering in the next tracking phase.

Track retention
The track retention module is instrumental in meticulously preserving the intricate tracking data associated with hand movements.As the CAMShift mechanism unfolds, the bounding box coordinates are adeptly harnessed to pinpoint the hand's central locus within the freshly charted hand trajectory.Subsequently, these pivotal center point coordinates are systematically cataloged within an array, ensuring a seamless and accurate tracking of the hand's trajectory over time.Figure 7 offers a comprehensive visual representation, elucidating the sophisticated architecture underpinning the track retention module.

Environment
Our experiments were meticulously executed on a Microsoft® Windows® 10 Home Edition computer.This platform was judiciously chosen for its stability and compatibility with our tools.The system was powered by an Intel i5-12500 CPU, bolstered by 32 GB of RAM, ensuring swift data processing and multitasking.Further enhancing our computational capabilities was an Nvidia GeForce RTX 3070 graphics card, pivotal for the intensive graphical tasks inherent in our research.

Image databases
Our experiments framework leveraged two distinct and comprehensive databases, each catering to specific facets of hand movement analysis.The inaugural database, known as the EgoHands database, (25) boasts a collection of 4800 images, each with a resolution of 1280 × 720 pixels.These images, captured via Google Glass, predominantly spotlight first-person interactions, often featuring up to four hands in diverse scenarios, such as engaging in card games in an office or collaborative puzzle-solving in a courtyard.A visual glimpse into the EgoHands database is provided in Fig. 8.In contrast, our second database was a bespoke creation, meticulously curated to encompass hand-waving videos.Postprocessing these videos yielded 5065 images, each standardized to a resolution of 1920 × 1080 pixels.This custom database was designed to be versatile, encapsulating four distinct scenes: a bedroom, a living room, a dining room, and an academic classroom.Figure 9 offers a visual tour of this database, underscoring the rich diversity in hand poses and ambient backgrounds.

Experimental results
To holistically evaluate the performance of our model, we leaned on average precision (AP) as our primary metric, given its widespread acceptance in gauging detection tasks.Following established protocols, AP was computed using precision and recall.Figure 10 vividly showcases our detection outcomes juxtaposed against the original input, whereas Fig. 11 clearly shows the binarized ROI, amplifying the discernibility of hand trajectories.
{ } ( ) We judiciously partitioned all databases for our experiments, allocating 80% for training and the remaining 20% for validation, ensuring a balanced distribution and robust evaluation.Our experimental setup was consistent, employing a batch size of 16, a momentum of 0.9, a learning rate of 0.0001, and an intersection-over-union threshold of 0.5.Table 1 lists our experimental outcomes, juxtaposing the performance metrics across different databases.A nuanced analysis revealed a marginal performance delta between the EgoHands and our custom databases.Intriguingly, amalgamating the two databases manifested in a slight dip in performance.These observations underscored the SSD mode's prowess in hand detection, albeit with nuanced variations contingent on the dataset.
Figure 12 visually encapsulates the experimental outcomes achieved with the SSD model, demarcated by the vibrant green rectangle.In contrast, the hand-tracking outcomes, derived from the CAMShift algorithm, are vividly portrayed by a purple rectangle in Fig. 13.A meticulous comparison between the two figures unveils a subtle size discrepancy between the bounding boxes.Figure 13 further embellishes the narrative with a purple trajectory line, offering an intuitive visualization of hand motion and underscoring the system's precision and efficacy.
To rigorously assess our system's efficiency, we orchestrated a series of tests, gauging the execution time per frame for MobileNetv2 SSD, both in the presence and absence of the CAMShift algorithm.While hand detection was a staple for the inaugural frame, the ensuing frames witnessed a marked reduction in execution time when augmented with the CAMShift algorithm.This observation resonates with the CAMShift algorithm's lightweight nature, ensuring minimal hardware overhead (Table 2).Such efficiency enhancements, courtesy of the CAMShift algorithm, bolster the real-time performance of our system, rendering it apt for realworld applications.

Discussion
In this study, we have meticulously developed a system tailored for real-time hand movement tracking anchored firmly on the SSD model.Our empirical evaluations underscore the prowess of tracking algorithms, with CAMShift standing out in significantly curtailing hardware overhead.Notably, our system registers an impressive 88.17% increase in detection efficacy compared with traditional methods, complemented by a swift latency of 0.0343 s.These metrics resonate with the system's robustness and aptitude for real-time hand movement tracking across many practical use cases.
By delving deeper into deep learning, integrating avant-garde models with tracking algorithms promises to bolster accuracy and system reliability.Furthermore, the burgeoning domain of compact neural networks tailored for embedded systems presents a tantalizing prospect warranting exploration.Striking an optimal harmony between detection fidelity and computational alacrity remains paramount, mainly when catering to devices constrained by resources.Our scholarly endeavors lay a robust foundation for groundbreaking strides in hand movement tracking.We foresee many HCI and gesture-centric control applications by synergizing deep learning paradigms with adept tracking methodologies.As the tapestry of research unfolds, we remain optimistic about unearthing myriad avenues that promise to revolutionize user experiences, making them more intuitive and immersive.

Conclusions
Real-world applications underscore the importance of comprehending the intricacies, challenges, and limitations of hand movement tracking.Central to this discourse is the quest to harmonize detection accuracy with computational agility.Anchored by the SSD model, our system manifests stellar performance metrics in real-time tracking.However, the pursuit of excellence intimates the potential for precision augmentation.While CAMShift stands as a beacon of hardware resource conservation, its fidelity in tracking might waver in fluidic environments.
The amalgamation of cutting-edge deep learning architectures is non-negotiable to elevate our system to the zenith of its potential.Titans in this space, such as You Only Look Once (YOLO) and Faster R-CNN, renowned for their prowess in object detection, promise to ease accuracy concerns.However, they might amplify computational overheads.This dichotomy accentuates the allure of edge computing and the strategic deployment of hardware accelerators.
The versatility of our system, evident in domains such as HCI and gesture-centric controls, hints at a broader application canvas.A case in point is its potential metamorphosis for telemedicine applications, especially in the nuanced realm of remote patient monitoring for afflictions such as Parkinson's disease.
In the ever-evolving technological tapestry, it is paramount to remain aware that emergent technologies either bolster our assertions or pose challenges.The advent of avant-garde paradigms such as quantum computing and neuromorphic chips reiterates the ethos of adaptability.
In conclusion, this discourse underscores the imperative for a holistic strategy that augments system accuracy, optimizes computational throughput, and broadens its application spectrum.The fluid dynamics of this domain reiterate the essence of sustained research endeavors, all aimed at sculpting a truly adaptive hand movement tracking paradigm.

Table 2
Execution times of MobileNetv2 SSD with and without CAMShift.