Applying Depthwise Separated Neural Network with Color Space Adjustment to Auto-colorization of Thermal Infrared Images

.


Introduction
The IR camera uses infrared radiation to take images. The infrared spectrum is between the microwave and visible light ranges, and is invisible to the human eye. In general, infrared cameras are sensitive to wavelengths from 0.75 to 15 μm, which can be distinguished into four classes: near-infrared (NIR), short-wave infrared (SWIR), medium-wave infrared (MWIR), and long-wave infrared (LWIR) in the infrared spectral field, as shown in Fig. 1. Images are taken at different wavelengths in a variety of scenes, as shown in Fig. 2. Most night-vision cameras using active NIR illumination take pictures in poorly lit environments. However, these cameras have difficulty taking photos under foggy or smoky conditions, in a heavy rainfall environment, or under direct exposure to the sun. Thermal infrared (TIR) cameras that use LWIR are ideal for solving imaging problems in all-weather and complex environments. The SWIR camera can capture reflected light under peak sun illumination and can be used for daytime starlight imaging. The MWIR camera can detect invisible gas leaks to the human eye, and its imagery provides precise details. LWIR is widely used in infrared technology. Figure 2 shows that the LWIR camera is helpful in taking images through fog and smoke, or under high-humidity conditions. LWIR cameras are very affordable today, and they also have better performance for most applications.
TIR images, unfortunately, are mainly presented in grayscale and thus have limited applications. Accordingly, colorized infrared images, wherein the original color is restored, can improve their practicality in scientific, industrial, commercial, military, and medical fields. Three methods are used to colorize a grayscale image: interaction-based, exemplar-based, and learning-based. The interaction-based method is better for pointing out the regional color on a grayscale image. This colorization method requires heavy manual work, and Yatziv and Sapiro proposed a chrominance blending method to reduce human interaction. (2) Bugeau et al. and Zhang et al. developed different exemplar-based methods to colorize gray images. (3,4) Their proposed methods need a set of reference images used to guide the colorization. Finding a higher relational reference image with a target image that matches similar objects is essential for transferring color information. Although this approach reduces human effort, its performance, which relies on the complexity of the target image, is unsatisfactory. These two methods need some manual intervention and are only partially automatic.
The learning-based method uses many training samples to find the color features and apply them to a grayscale image for colorization. Manual scribbles and selected reference images are unnecessary in this method, but many diverse training images are required. Although the  training cost is high, this method can automatically colorize grayscale target images after learning. Larsson et al. applied the Visual Geometry Group (VGG) 16 deep convolutional network to predict hue and chroma distributions combined with lightness to generate a color image. (5) Zhang et al. used deep CNN to train over one million images and increased the diversity of colors by class-rebalancing. (6) After the Generative Adversarial Network (GAN) was developed by Goodfellow et al., (7) auto-coloring approaches based on the GAN have demonstrated impressive performance. (8)(9)(10)(11)(12)(13) In recent years, auto-coloring technologies have been applied to infrared image colorization. In some works, neural network models were trained directly using the grayscale infrared image and paired visible images as training sets. Berg et al. proposed to use an autoencoder structure to generate an RGB image directly from TIR input. (14) Tao et al. proposed using the encoderdecoder architecture with skip connections to convert infrared images to RGB images. (15) Kuang et al. applied the conditional GAN to TIR image colorization and obtained better results. (16) In other studies, grayscale IR images were used as the input domain and applied to a GAN-based training network to generate RGB images. (17)(18)(19) There are better options than the traditional deep neural network. However, these methods have a common weakness, that is, the grayscale image is directly applied to generate multichannel output. Some abnormal colors and artifact effects are produced in the colorized target images because some detailed information is lost during the training stage. Although Du et al. proposed the reference component matching module to choose proper color components as the input reference auxiliary for improving irregular colorization, the training model was more complicated. (11) To improve the TIR image colorization, we propose the Depthwise Separated Colorization Generative Adversarial Network (DSCGAN). The main innovation is the introduction of two stages that separate the training model and the application of the International Commission on Illumination LAB color space (CIELAB) conversion. In the first stage, the preprocessing light channel convolutional autoencoder (PLCAE) is proposed to generate the predicted L channel from the TIR image that restores and compensates for the loss of some luminance details caused by the LWIR camera. Then, this predicted L channel is used as input to the proposed Colorization Generative Adversarial Network (CGAN) to create the AB channel. Finally, the L, A, and B channels can be generated and converted to RGB color images. This proposed approach can simplify training complications. Each stage only predicts one or two channels that can speed up training model convergence. The CGAN is based on the cycle GAN that can solve fewer paired dataset issues for TIR and visible paired images. The unpaired images can be used as input to train the CGAN and generate more stable and suitable AB channels.

Cameras
The TIR camera used in this study is shown in Fig. 3. It is an LWIR micro-thermal camera module. The specifications of the FLIR Lepton 3.5 camera are shown in Table 1. The resolution of the LWIR images is 160 × 120. This TIR camera can take eight frames per second (FPS). The visible camera used for the deep learning model is shown in Fig. 4. The resolution of the RGB images is 1280 × 720. The C270 HD WEBCAM can take 30 photos per second.

Datasets and data collection methods
The KAIST-MS dataset (20) was used in this study to compare the infrared image autocoloring performance with the different paper experimental results. The KAIST-MS dataset is widely used to evaluate the auto-colorization effects of IR images. There are two types of IR images in this dataset, near and long-wavelength, and the dataset has an aligned pair of visible and thermal images captured from day/night traffic scenes. In addition, in this study, we also collected LWIR images day and night at the same location to find a way to reconstruct the daytime visible images from a dark environment. The graphical user interface (GUI) used to collect the visible and LWIR images in this study is shown in Fig. 5. The upper left and bottom left represent the real-time visible and LWIR images, respectively, and the refresh rate is set to 8 FPS, and the upper right and bottom right are the capture results after the user pushes the capture button. Since the TIR and visible cameras have different resolutions, the visible images are resized, cropped, and shifted using image processing to align with the TIR images in the GUI  calculation. In this study, the LWIR and visible images were separated from the luminance channel of the CIELAB color space to provide the PLCAE training samples and answers.

PLCAE methods
In deep learning, the neural network always extracts features to conduct the auto-colorization of TIR images. To obtain more features in the designed neural network, we proposed a preprocessing image approach to convert RGB into the CIELAB color space with the most features in the entire color space that people define. The RGB color space can be transformed to the XYZ color space using Eq. (1), and the XYZ color space can be converted to the LAB color space using Eqs. (2) and (3). (21) In Eq. (2), the range of the L * channel is between 0 and 100, that of the a * channel is between 127 and −128, and that of the b * channel is between 127 and −128. In Eq. (3), the function domain is divided into two ranges to avoid a gradient of infinity at t = 0.   Here, L * is luminance, a * is the chrominance that represents the red axis (positive) to the green axis (negative), b * is the chrominance that represents the yellow axis (positive) to the blue axis (negative), and X n , Y n , and Z n represent the tristimulus values of the reference white point.
The PLCAE structure and neural network model used in this study are shown in Figs. 6 and 7, respectively. The convolutional autoencoder (CAE) can make the input image for reconstructing the target image. (22) The CAE can extract the light features of the infrared image through the encoder and reassemble this information in the decoder. To transform the dark environment images into visible daytime colorized images, the L channel of the nighttime IR image needs to be reconstructed into the L channel of the same target day visible light images. This action is the same as other IR image conversion that applies PLCAE to convert the raw IR L channel into the L channel of the color visible image at the same position. The definition of the    4) and (5). The encoder and decoder processes are calculated using Eq. (6). The PLCAE uses the mean squared error (MSE) as the loss function for backpropagation. The calculation of the MSE loss function is shown in Eq. (7). In the PLCAE, the grayscale TIR image is copied three times to convert RGB into the L channel of CIELAB. The PLCAE input and output images are resized to 256 × 256 × 1. Moreover, this predicted L channel is used as input for the CGAN to generate the AB channel of CIELAB to achieve the auto-colorization of TIR images. : , : Here, X is the input vector, F is the feature vector, X' is the target vector, ψ represents the conversion of the input vector to feature vectors by the CAE, and φ represents the calculation of the feature vectors to the target images by the decoder of the CAE.
Here, σ and σ' represent the encoder and decoder activation functions, W and W' represent the encoder and decoder weights, and b and b' represent the encoder and decoder weighting biases, respectively.

CGAN methods
Auto-colorization is a well-known deep learning application, and the effective method is to use GAN and the L channel to predict the AB channel. (23) The CGAN is a type of cycle GAN. In the CGAN, the L channel represents X domains and the AB channel represents Y domains. (23) There are two generators mapping G: X → Y and F: Y → X. Also, there are two discriminators D X and D Y , where D X can distinguish between real and fake X domain images that the F generator predicts using Y domain data, and D Y can distinguish between real and fake Y domain images that the G generator predicts using X domain data. The G and D Y of the CGAN structure are shown in Fig. 8. The CGAN generator and discriminator model structure are shown in Fig. 9.
In this study, we denote X domain samples as { } 1 Here, the purpose is to minimize the generator G(y) and the D Y target, and the generator F(x) and the D X target.
, ,  Here, λ controls the weight of the two objectives.
( ) In the CGAN, the 256 × 256 × 1 L channel visible images are used with the G generator to predict the 256 × 256 × 2 AB visible images. Finally, we can combine the predicted L channel of the PLCAE and the AB channels of the CGAN generator output to produce the LAB color space of the targeted image. The LAB color space can be transformed to the XYZ color space using Eqs. (12) and (13), and the XYZ color space can be transformed to the RGB color space using Eq. (14).

Experimental Results and Discussion
In this study, we propose deep learning neural networks in two stages to convert TIR images into RGB visible photos. This section shows the results regarding the loss, accuracy, and peak signal-to-noise ratio (PSNR), which are used to evaluate the training performance.

PLCAE loss and accuracy graph
The PLCAE loss graph in Fig. 10 shows that the PLCAE model converges from a high point to a small value quickly, and the training process is very smooth during 2000 epochs. Our training samples were collected from random frames of video streaming, which makes these datasets have diverse scenes in each sampling. Therefore, the MSE value of the training loss function is sometimes increased at the beginning of the training process. However, various datasets improve the trained network generalization performance. The loss value can converge to 0.9206 at the end. The PLCAE accuracy graph in Fig. 11 shows that the accuracy converges to 0.9773. In other words, the PLCAE is a very appropriate model for converting TIR luminance images into RGB luminance images before auto-coloring.

CGAN loss and accuracy graph
The CGAN generator G loss graph in Fig. 12 shows an unstable loss curve in the training process of generator G. This often happens in GAN because the discriminator and generator compete with each other. The generator loss curve is more stable until above 10000 epochs, and the loss value finally converges to 0.7900.
In this study, we used the PSNR to evaluate the auto-colorization result objectively. We denoted the pixel value of the ground truth image as Y i and the pixel value of the predicted image as ˆi Y . The PSNR calculation formula is shown as Eqs. (15) and (16). The numerator in the PSNR formula is simplified as one because the training images have been normalized. The CGAN generator G PSNR graph is shown in Fig. 13. The G PSNR value can reach 26.03 dB.

Comparison of auto-colorization results of proposed method and other methods
The experimental results obtained using the FLIR dataset (24) are shown in Figs. 14 and 15, including the PLCAE model, CGAN, and LAB-to-RGB transform. In this study, the training sample contents included most of the scenery on the road. Figure 16 shows that the deep learning model cannot predict the color correctly. For example, Fig. 16 indicates that the red color sometimes is predicted to be a gray color. Nevertheless, it should be noted that using the PLCAE model for the prediction can yield superior results in reconstructing visible image details and textures, thereby achieving a higher PSNR than other methods, as shown in Table 2. The reconstructed visible L channel images predicted using the PLCAE model are shown in Fig. 17.  We compared the auto-coloring performance of infrared images with those of other methods using the KAIST-MS dataset. The PSNR results of the different methods are shown in Table 2,   which shows that the best PSNR is obtained by the proposed method. The prediction results obtained with the different auto-coloring approaches are demonstrated in Fig. 18, which shows that, compared with the other approaches, the image obtained by the proposed method is closest to the target visible images. The self-collected images for testing the performance of the proposed method are shown in Fig. 19. In the daytime images, the infrared auto-coloring result can achieve colorization similar to that observed in the KAIST-MS dataset. In contrast, the nighttime colorized result has a worse performance because the PLCAE module cannot fully reconstruct the L channel details of the related visible image. Figure 20 shows the performance of the proposed method for pictures taken under different bad weather conditions in the AAU RainSnow Traffic Surveillance Dataset. (25) While the proposed method performs well under rainy and foggy conditions, its effectiveness is limited under snowy conditions.

Discussion
In this study, nighttime auto-colorization did not perform as well as daytime autocolorization. The reason can be seen in Figs. 19 and 20. Figure 19 shows that there are many different structures between nighttime and daytime. Figure 20 shows that low-temperature environments during a snowfall can also affect the ability to reconstruct TIR details accurately. There are two solutions to address this problem. The first is to collect diverse TIR images of daytime scenes and other pictures under various weather conditions in the same location as training sets. The second is to separately train the PLCAE module for different weather conditions or times of the day. However, this would increase the complexity of network training.

Conclusions
In this work, TIR images were auto-colorized by two-stage deep learning. In the first-stage, the proposed PLCAE converted TIR luminance images into RGB luminance images to enhance luminance details before auto-coloring. The PLCAE accuracy can reach 0.9773. Then, in the second stage, the proposed CGAN used the predicted L channel to produce the AB channel. The TIR colorization result achieved a PSNR exceeding 26 dB. The experimental results showed that our proposed approach is better than the other methods. Moreover, the TIR auto-coloring technology can be applied to many different fields. This TIR image colorization can also help increase object detection accuracy for safe driving assistance in a poorly lit environment. Overall, the method of colorizing TIR images proposed in this paper is useful and has many potential applications.
Pei-Syuan Lu received her M.S. degree in electrical engineering from National Changhua University of Education, Taiwan, in 2022. She is currently working at Delta Electronics Inc. Taiwan. Her research interests are in machine learning, image processing, and power electronics. (angellups1999@gmail.com)