RGB-D Depth-sensor-based Hand Gesture Recognition Using Deep Learning of Depth Images with Shadow Effect Removal for Smart Gesture Communication

gesture recognition approach can benefit people requiring interaction action recognition and further promote smart gesture communication


Introduction
(6)(7)(8)(9) Well-known RGB-D sensor devices, such as Microsoft Kinect, (10) Leap Motion Control (LMC), (11) and ASUS Xtion, (12) contain two different modalities of sensors for image capture: RGB color and depth image sensors.Generally, the RGB sensor cannot acquire standard images under low illumination.In contrast, the depth sensor has strong light tolerance and therefore its performance is satisfactory even in a dark environment.15)(16)(17)(18)(19)(20)(21)(22)(23)(24) Among the depth-sensor-based works, such as depth-based scene map construction for robot operations (13)(14)(15)(16) and depth-based human body gesture image acquisitions for human body analysis, (17)(18)(19) little attention has been paid to the property of depth-grayscale images for recognition due to the relatively long sensor capture distance.However, for studies in which a depth sensor is used to develop a hand gesture recognition system, (20)(21)(22)(23)(24) the depth-sensor-based hand gesture recognition system will be extremely sensitive to the characteristics of depthgrayscale hand gesture images captured by the sensor.This is because a very short distance between the active hand and the depth sensor is required to obtain the complete hand gesture information.Such short-distance hand gesture image acquisition by the depth sensor will unavoidably cause a serious shadow effect.(22)(23)(24) An improved regression network for depth hand pose estimation was proposed by Xu et al. (20) Lai and Yanushkevich (21) developed a dynamic hand gesture recognition system where two different deep learning schemes based on a convolutional neural network (CNN) and a recurrent neural network are adopted to analyze both the skeleton information and the spatial information from depth images.Otberdout et al. proposed a hand pose estimation approach based on a deep-learning depth map for hand gesture recognition. (22)A CNN and a generative adversarial network were used for depth-sensor-based hand gesture recognition and hand pose estimation, respectively. (23,24)owever, the undesired phenomenon of the shadow effect caused by depth sensors deployed in an RGB-D device has scarcely been taken into consideration in current research on depthsensor-based hand gesture recognition.To overcome the problem of the shadow effect resulting from the RGB-D depth sensor and further evaluate the effectiveness of depth sensor data without the shadow effect on the recognition performance of a DNN, we have developed an RGB-D depth-sensor-based interactive hand gesture recognition system using a typical visual geometry group (VGG)-type CNN with improved data of shadow effect removal derived from serial binary image extraction.The proposed depth-sensor-based hand gesture recognition with the improved depth gesture image with removed shadow regions can ensure more reliable action recognition results, and such hand gesture recognition with satisfactory performance can then be used in real-life applications that require interactive gesture recognition for interaction, communication, or operations (e.g., gesture interaction between disabled and able-bodied people, the communication of actions between miners underground in low light, and hand gesture control in a smart factory/car/home).In this work, the Kinect sensor device illustrated in Fig. 1 is employed to capture depth-grayscale hand gesture actions.As mentioned, the Kinect device, which is a compound image sensor, has separate RGB and depth sensors.These two image sensors, which are collectively referred to as a CMOS image sensor, and the time of flight can be used to capture color-RGB and depth-grayscale images simultaneously.Another feature of Kinect image acquisition is that either the RGB sensor or depth sensor is excited and can operate alone.In this work on RGB-D depth-sensor-based hand gesture recognition, only the Kinect depth sensor channel is operated.As can be seen in Fig. 1, the depth sensors in Kinect essentially include one IR projector and one IR camera, resulting in an adverse shadow effect, as described in detail in Sect. 2.
The main contributions of this study are summarized as follows:

Undesired Shadow Effect of Typical Depth Sensors in Gesture Recognition
A typical depth sensor uses both an IR projector and an IR camera to capture the depth image of a target.As mentioned before, popular RGB-D sensor devices generally have RGB and depth sensors.The depth sensor in the RGB-D sensor device is composed of one IR projector and one IR camera.During image capture or recording, both the IR projector and the IR camera are simultaneously used to capture the depth image of the target.Figure 2 depicts the projection and reception of Kinect depth-sensor-derived IR light beams.As can be seen in Fig. 2, the captured depth image representing the target object has a shadow region with black pixels.This is the socalled shadow effect of the depth sensor, and the shadow region that appears in the captured depth image mainly consists of black pixels.According to the analysis of Danciu et al., (25) there are two causes of the shadow region: (1) unexpected object occlusion and (2) optical refraction and reflection.The rationale behind the first cause is that an unexpected object can act as an obstacle to the target object located behind it.Regarding the second cause, reflection or refraction will occur when the IR light emitted by the IR projector reaches the target object.In this case, depending on the material and the degree of light absorbance of the target object, different reflection or refraction phenomena will occur.Owing to the design limitations of such depth sensor devices, when constructing a depth sensor image-based application system, black shadow regions will inevitably exist in captured depth images.The effect of the black shadow region on the captured image will significantly depend on the relative distance between the depth sensor device and the desired target object.When the distance is relatively large (such as human body or skeleton detection for human gesture recognition applications), the effect of the black shadow region will be small.However, when a relatively small distance is required for fine image capture (e.g., recognition of hand gesture communication actions carried out in this study), the effect of the unwanted black region in the overall captured depth image will greatly increase.Figure 3 illustrates the unwanted black shadow segment located around a real hand segment with depth-grayscale information.It can be clearly seen in Fig. 3 that the black hand shadow region has a large effect on the intended depth-grayscale hand region, which will be a serious problem in hand gesture recognition with high recognition accuracy.
For depth-sensor-based applications, the above-mentioned shadow effect adversely affects the system performance and is undesired.In this work, the Kinect depth sensor is used to recognize hand gesture communication actions, and a large black shadow region in the captured hand gesture depth image will clearly reduce the recognition accuracy of the established system.To alleviate the shadow effect, a serial binary image extraction approach is proposed to obtain improved hand gesture depth data for VGG-type CNN deep learning and recognition, as described in detail in Sect.3.

Deep Learning of Depth Images with Shadow Effect Removal for RGB-D Depth-sensor-based Interactive Hand Gesture Recognition
Figure 4 shows the overall framework of the constructed hand gesture communication action recognition system using the Kinect sensing device with depth sensors comprising one IR projector and one IR camera.Using the developed recognition system, 10 hand gesture communication actions that are commonly used for interactions can be categorized.When performing recognition on a dynamic hand gesture action, continuous-time hand gesture images (frames) are obtained from the Kinect depth sensor.As shown in Fig. 4, for the acquired depth sensor images, the region of interest (ROI) is first extracted to derive the significant image segment of active hand gesture regions, referred to as the ROI-depth image.As mentioned in the previous section, such depth-sensor-derived images are not ideal at this stage and have a relatively large shadow region.The proposed serial binary image extraction scheme is effectively applied to the ROI-depth image to eliminate most of the shadow region and alleviate the shadow effect.After removing the shadow from the ROI-depth image, the improved depth sensing data, referred to as ROI-depth + binary-values or ROI-depth + binary-values + binary-values, will further be used for the training and recognition test of the VGG-type CNN deep learning model, as described in Sects.3.1 and 3.2.

Shadow effect removal of depth images using serial binary image extraction
Each pixel in the ROI-depth image has a grayscale value between 0 and 255, which indicates the degree of closeness between the depth sensor and the point.Assuming that the size of the ROI-depth image is n (width) by m (height), there will be n m ⋅ depth degree values determined directly from the Kinect software development kit (Kinect-SDK).Each of these n m ⋅ values represents the depth-grayscale value of the corresponding pixel (x, y), denoted as Depth degree value(x, y).Equation (1) shows the calculation of Depth degree value(x, y).Note that the value of Threshold is generally set as 4000, corresponding to the maximum sensing distance of the Kinect depth sensor of 4000 mm (i.e., 4 m).When the distance of the object is greater than 4 m, the value of all pixels in this object will be 255 ('white').Conversely, if the object is much closer to the depth sensor (i.e., an extremely small value of Distance), then Depth degree value(x, y). will be close to 0 ('black').The depth-grayscale degree varies linearly in the range of Threshold.

Distance Distance Threshold Depth degree value x y
Threshold Distance Threshold By using the estimated value of each Depth degree value(x, y) of the ROI-depth image, the proposed serial binary image extraction method removes the undesired shadow region in the ROI-depth image.The proposed process of serial binary image extraction for shadow effect removal is composed of two stages of binary value estimation, called phase-1 and phase-2.Phase-1 binary image extraction is performed using the following equation: (2) In the phase-1 extraction, the data of the ROI-depth image with each Depth degree value(x, y) determined by Eq. ( 1) are further converted to binary-valued information.As can be seen in Eq. ( 2), the grayscale-degree value of each pixel in the ROI-depth image is transformed to an estimated binary value.Hand gray level in Eq. ( 2) represents the depth-grayscale value of the hand region of the ROI-depth image, and its value is computed using the algorithm in Fig. 5 expressed in pseudo-code.In Eq. ( 2), the value of 255 represents a white pixel, i.e., no degree of grayscale appears in this pixel.The proposed algorithm for determining Hand gray level is conceptually simple and computationally fast.In the algorithm, the input is the ROI-depth image, and the main purpose of this algorithm is to find significant regions including the hand region, each of which has numerous pixels with the same depth-grayscale value.The depthgrayscale value of the estimated hand region is finally returned.As can be seen in Fig. 5, the values of Gray-0, Gray-1, …, Gray-255 are set to zero in the initialization step.Gray-0, Gray-1, …, Gray-255 denote the vote boxes that accumulate votes of pixel grayscale values of the input ROI-depth image.When serial binary image extraction is performed on an ROI-depth image with a size of n by m, a total of n • m pixels are required to vote among Gray-0, Gray-1, …, Gray-255.Note that the pixel grayscale values are in the range of 0 to 255.If an image pixel has a grayscale value of i, the vote box of Gray-i is then increased by one.Also note that in Eq. ( 2), if the value of Depth degree value(x, y) is located between Hand gray level and 255, this pixel is directly set to 0. The operations in Eq. ( 2) transform the ROI-depth image to a binary-valued image.In the transformed image, the hand region retains the same depth-grayscale value, and the pixels in the other regions of the image are black.The black shadow region is therefore not distinguishable and is discarded (see Fig. 6).
Equation ( 3) describes phase-2 binary image extraction.The main operation of phase-2 binary image extraction is to further convert the output binary image derived from phase-1 binary image extraction to different categorizations of binary images.As can be seen in Fig. 5, by setting Gray degree = 255 in Eq. ( 3), all pixels in the hand region become white.The transformed image is a binary-valued black-white image.Note that Gray degree in Eq. ( 3) is variable and adjustable.To achieve the optimal recognition performance of the recognition system with the depth sensor, using a trial-and-error approach, a grayscale value between Hand gray level and 255 (white) is appropriately chosen and given to Gray degree.Images with various settings of Gray degree are illustrated in Fig. 7.

Deep learning and recognition of interactive hand gestures of depth images without shadow effect
We employ the VGG-16 CNN (26) to construct a system for hand gesture communication action recognition using the depth sensor.As mentioned, the improved data of the depth-sensorderived image without the shadow effect, ROI-depth + binary-values or ROI-depth + binaryvalues + binary-values, are used to perform training and recognition in the VGG-16 CNN.As shown in Fig. 8, the deep learning model of the VGG-16 CNN is composed of 16 processing layers: 13 layers for the deep learning and extraction of color characteristics of input image data and three layers for the classification of the extracted image feature.Note that these 13 layers for image feature estimation mainly perform a series of computations of convolutions and maxpooling.The final three layers in the VGG-16 CNN, which act as a general artificial neural network, are composed of two fully connected (FC) layers and one classification layer.
For typical VGG-16 CNN deep learning calculations, the size of the input image is fixed at 224 by 224.As can be seen in Fig. 8, in this work, the final layer of the VGG-16 CNN is designed  to have 10 calculation nodes, each of which denotes the categorization scores of the corresponding type of hand gesture communication actions (as mentioned above, 10 different classes of common hand gesture actions for interactions in daily life are considered in this study).

Experiments and Results
Experiments on depth-sensor-based interactive hand gesture recognition are carried out in a laboratory environment.The test user is requested to make 10 specific hand gestures that are widely used for interaction.These 10 different actions are labeled as Action-1, Action-2,… and Action-10.Table 1 shows each of these 10 hand gesture actions.For briefness and readability of the paper, only the first frame (image) of the specific action is shown as the representative in each of these 10 continuous-time gesture actions in this work, which can be clearly seen in Table 1.When performing VGG-16 CNN recognition with improved data of depth-sensor-derived images, all images contained in each of these 10 continuous-time actions will still be considered.A hand gesture action database established in the laboratory with the Kinect sensor device contains 3000 images, half of which are used for training of the VGG-16 CNN, and the other half of which are used for testing the constructed VGG-16 CNN model.For the experimental database of 3000 images, the test user is requested to make 50 actions for each of these 10 different categorizations of hand gesture actions.Each action contains 60 frames (2 s), in which the frame rate of the Kinect depth sensor is 30 (i.e., 30 depth sensing frames are captured by Kinect per second).
As shown in Table 1, for each of these 10 different types of hand gesture actions, the original depth image obtained from Kinect, the ROI-depth image, the improved data of ROI-depth + binary-values, and the improved data of ROI-depth + binary-values + binary-values are provided.It can be observed that, by using the proposed serial binary image extraction approach, the undesired black shadow region existing in ROI-depth is greatly reduced in both ROI-depth + binary-values and ROI-depth + binary-values + binary-values.image extraction, the item Gray degree in Eq. ( 3) is set to 255, which corresponds to all white pixels distributed in the hand region.As mentioned, the depth-grayscale degree of the hand region is adjustable by appropriately setting Gray degree, and the optimal recognition accuracy can be acquired by trial and error.In

Conclusions
Hand gesture depth images obtained from the RGB-D depth sensor include an undesired shadow effect, adversely affecting hand gesture recognition.In this study, a serial binary image extraction approach containing two consecutive phases of well-designed binary value image process schemes was employed to effectively remove the black shadow regions in hand depth images.Depth-sensor-based hand gesture interaction action recognition was developed by VGG-type CNN deep learning with improved depth data with removed shadow regions.Experimental results clearly show that in terms of the recognition accuracy of 10 common hand gesture interaction actions by VGG-type CNN deep learning, the improved hand depth data derived from the binary image extraction calculations of the two phases are superior to the original depth images acquired from the depth sensor.

( 1 )
Removal of undesired shadow effect of hand gesture depth images from the RGB-D depth sensor set comprising one IR projector and one IR camera by the proposed serial binary image extraction.(2) Effectiveness demonstrations of improved hand gesture depth images without shadow effect on recognition of CNN deep learning.(3) Development of RGB-D depth-sensor-based hand gesture recognition by VGG-type CNN incorporated with serial binary image extraction for smart gesture communication.

Fig. 1 .
Fig. 1. (Color online) Popular Kinect RGB-D image sensor device with set of depth sensors comprising IR projector and IR camera.

Fig. 2 .
Fig. 2. Illustrations of the undesired shadow effect of the depth sensor set (greater effect at a shorter distance).

Fig. 3 .
Fig. 3. (Color online) RGB and depth images with the undesired shadow region obtained with the Kinect RGB-D device (interactive hand gesture "what?").

Fig. 4 .
Fig. 4. (Color online) Framework of depth-sensor-based interactive hand gesture recognition with deep learning of depth images and shadow effect removal.

Fig. 5 .
Fig. 5. Algorithm to determine Hand gray level (grayscale value of hand regions) in shadow effect removal of depth images by serial binary image extraction.
Fig. 6.Shadow effect removal of depth images by serial binary image extraction (from left to right, original depth with shadow regions and improved depths after phase-1 and phase-2 binary image extraction).

Fig. 8 .
Fig. 8.Typical VGG-16 CNN used in depth-sensor-based hand gesture recognition for deep learning of depth images with shadow effect removal.

Fig. 7 .
Fig. 7. Depth-grayscale values of 80, 140, 170, and 255 (from left to right) set in the hand region in the phase-2 binary image extraction approach.
Figures 9 and 10 depict ROIdepth + binary-values and ROI-depth + binary-values + binary-values frame sequences of a specific hand gesture categorization, Action-3 (denoting "Goodbye"), for training and recognition of the VGG-type CNN, respectively.Table 2 gives the recognition accuracy of VGG-type CNN hand gesture communication action recognition using depth sensing data with and without shadow effect removal.Three recognition performance results are compared, training accuracy, validation accuracy, and test accuracy, in the phases of model training, model validation, and model testing, respectively, with images of ROI-depth, ROI-depth + binary-values and ROI-depth + binary-values + binary-values.It can be clearly seen that the proposed serial binary image extraction approach has a positive effect on the recognition performance.Without the removal of black shadow regions, the average accuracy of VGG-type CNN recognition with ROI-depth is only 72.95%.In contrast, the improved data ROI-depth + binary-values derived from phase-1 binary image extraction and the improved data ROI-depth + binary-values + binary-values obtained from phase-2 binary image extraction give better performances for VGG-type CNN recognition of 78.54 and 78.01%, respectively.Note that in the phase-2 binary

Table 1 Fig. 9 .
Fig. 9. Continuous-time data streams of interaction hand gestures (Action-3, "Goodbye") with improved data of ROI-depth + binary-values for deep learning and recognition.

Fig. 10 .
Fig. 10.Continuous-time data streams of interaction hand gestures (Action-3, "Goodbye") with improved data of ROI-depth + binary-values + binary-values for deep learning and recognition.
addition, the recognition accuracy and loss value curves of VGG-16 CNN recognition with ROI-depth, ROI-depth + binary-values, and ROI-depth + binaryvalues + binary-values in the training phase of the deep learning models are also observed, as shown in Figs.11-13, respectively.It can be observed by comparing the figures that the improved depth data of ROI-depth + binary-values + binary-values have the most satisfactory performance, i.e., both the highest recognition accuracy and the lowest loss function value after a small number of iterations of model learning, followed by the improved depth data of ROI-depth + binary-values.The ROI-depth images, which still have the undesired black shadow regions, have the lowest performance.

Fig. 13 .
Fig. 13.(Color online) Accuracy and loss value curves of VGG-type CNN hand gesture recognition using ROIdepth + binary-values + binary-values over 60 iterations of model training in deep learning.

Fig. 12 .
Fig. 12. (Color online) Accuracy and loss value curves of VGG-type CNN hand gesture recognition using ROIdepth + binary-values over 60 iterations of model training in deep learning.

Fig. 11 .
Fig. 11.(Color online) Accuracy and loss value curves of VGG-type CNN hand gesture recognition using ROIdepth over 60 iterations of model training in deep learning.

Table 2
Recognition accuracy comparisons of VGG-type CNN hand gesture recognition using depth images with and without shadow effect removal.