Image-to-image Translation via Contour-consistency Networks

In this paper, a novel framework for image-to-image translation, in which contour-consistency networks are used to solve the problem of inconsistency between the contours of generated and original images, is proposed. The objective of this study was to address the lack of an adequate training set. At the generator end, the original map is sampled by an encoder to obtain the encoder feature map; the attention feature map is then obtained using the attention module. Using the attention feature map, the decoder can ascertain where more conversions are required. The mechanism at the discriminator end is similar to that at the generator end. The map is sampled through an encoder to obtain the encoder feature map and then converted into the attention feature map. Finally, the map is classified by the classifier as real or fake. Experimental results demonstrate the effectiveness of the proposed method.


Introduction
In unmanned robot and self-driving car applications, (1)(2)(3)(4) labeling data is one of the most time-consuming tasks. For instance, when daytime and nighttime data are required, data must be collected during different periods and then staff must be assigned to label the data. Marking data is labor-intensive. Usually, the data available for training a deep learning model are inadequate; as such, convergence of the training process is poor. To address this problem, we propose a method for improving the similarity of transformed images; an already labeled database is converted into different domain textures to obtain differing domain style information for data augmentation.
With the progress of deep learning, several related areas of study have emerged, such as computer vision and speech recognition. In computer vision, image-to-image translation is vital. Goodfellow et al. (5) proposed a method called the generative adversarial network (GAN) for generating real-world images. Since then, the derivation of image-to-image translation and image restoration, processing, and coloring have been popular research topics.
Image-to-image translation can be performed by two methods: paired and unpaired training. The paired training method is a supervised learning method and requires paired images. For example, the method proposed by Isola et al. (6) requires the preparation of a database of paired images. However, the preparation of paired image data is difficult for some tasks, such as converting the season of landscape photographs.
The unpaired training method is an unsupervised learning method because it does not require the preparation of paired images. GAN loss alone is known to be insufficient and often leads to inadequate training and generates inferior images. Therefore, in 2017, Zhu et al. (7) proposed CycleGAN, which is a cycle consistency approach, to determine whether the images reconstructed from the target domain back to the source domain are the same. The cycle consistency approach helps ensure that specific features are unchanged after the transformation.
Using the cycle consistency approach, we analyzed the dataset of the video game Grand Theft Auto (GTA) version 5 and the Cityscapes Dataset and converted the computer-generated images into real-world images. Our objective was to preserve the original road structure and convert it into real-world texture information. As shown in Fig. 1, we discovered two crucial features that are related to street structure and can improve the stability of the generated images.
(1) Edge information: the similarity of the edges can be calculated to determine whether the edges of the images are the same before and after the conversion. (2) Depth information: the size and proximity of the objects are often neglected when performing object recognition for selfdriving vehicles.

Proposed Contour Consistency Network
For image-to-image translation, we propose a contour consistency network as a generic network that ensures the consistency of the image profile after conversion (Fig. 2). First, the original domain image is passed through the generator to obtain a photograph of the target domain. Then, the original and generated images are passed through the edge detection and depth estimation networks to obtain an edge map and a depth map, respectively. The loss is then calculated for the target domain photograph; the contour loss calculated using the contour consistency network is added to the original loss function to obtain superior results, with the function proposed by Chen et al. (8) being used for training.

Generator
The generator and discriminator used in this study are based on the unsupervised generative attentional networks with adaptive layer-instance normalization for image-to-image translation (U-GAT-IT) (9) approach. However, we integrate the attention module (which helps the model ascertain where more conversions are required) and the adaptive layer-instance normalization (Ada-LIN) method (which enables the model to maintain a favorable transformation style). In the Ada-LIN method, layer normalization helps retain more wind features whereas instance normalization helps retain more content features, thus enabling the model to learn adaptively. In summary, the model generation effect can be improved by combining it with the contour consistency network proposed in this paper. Generator: First, the original map is sampled by the encoder to obtain the encoder feature map; then, the attention feature map is obtained using the attention module. Using the attention feature map, the decoder can ascertain where more conversions are required. Discriminator: The structure of the discriminator is similar to that of the generator. The original map is sampled by the encoder to produce the encoder feature map, which is then converted into the attention feature map. Finally, it is classified as real or fake by the classifier.

Contour consistency network
To solve the problem of the contour of an image being dramatically altered after the conversion, a contour consistency network architecture is proposed for improving the conversion of computer-generated street views into real street views. As illustrated in Fig. 3, the original image and transformed image are passed through the edge detection and depth estimation branches, respectively, and their similarity is calculated. Edge detection branch: This branch uses edge detection to solve the image contour consistency problem. Edge detection offers several advantages: edge detection can be used to calculate the edges of objects and backgrounds in the image, and by analyzing the edges between the foreground and background, whether or not the texture after conversion is acceptable can be ascertained. We used the holistically nested edge detection (HED) (10) method, which is a modified edge detection algorithm based on the VGG16 architecture. This deep-learning-based algorithm yields better results than the traditional edge detection algorithm for generatorgenerated images because such images are usually blurry and traditional deep learning is less effective in determining edges. Therefore, in this branch, two edge maps are obtained from the original image and the transformed image obtained through HED, and the L1 loss algorithm is employed to calculate the similarity. Depth estimation branch: This branch uses a depth map, generated through depth estimation, to improve the consistency of image contours. The depth map, representing the depth information, is applied to our contour consistency network. Even if objects overlap in the depth map, the depths of the objects can be calculated to determine the contours. The Monodepth2 (11) algorithm is used to compute the original image and transformed image, and the L1 loss function is used to compute the similarity.
The losses from the two branches are added to obtain the contour loss, which can be expressed as follows: where x is the number of pixels in the original image, G(x) is the image converted by the generator, Edge() is the edge map generated through edge detection, and Depth() is the depth map generated through depth estimation; the results are added to obtain L contour .

Loss function
The original U-GAT-IT has four parameters: adversarial loss, cycle loss, identity loss, and cam loss. Their definitions are given briefly as follows. Adversarial loss: This L lsgan value predicted by the least-squares GAN (LSGAN) is used to calculate the distributions of the image before and after the conversion. Cycle loss: This L cycle value predicted by the LSGAN is used to ensure that the images converted into the target domain can be converted back into the source domain. Identity loss: This L identity value predicted by the LSGAN is used to ensure the color distribution of the image before and after the conversion. Cam loss: This L cam value predicted by the LSGAN is used to ascertain which areas require improvement before and after the conversion or where the biggest difference between the two domains is. Contour loss: This L Contour value predicted by the LSGAN is used to calculate the contour difference between the images before and after conversion. These four parameters are summed to determine the final loss. To adjust the parameters used in this study, we employed the ratios presented in the original paper [U-GAT-IT (9) ], λ 1 = 1, λ 2 = 10, λ 3 = 10, λ 4 = 1000, and contour loss λ 5 = 1, in our experiments.

Dataset
In this section, we introduce the two datasets used for experiments and demonstrations. We used GTA datasets to perform stylistic conversions with Cityscapes; these two datasets are often used in studies on image-to-image translation. GTA datasets: Richter et al. (12) extracted images from the open-world game, Grand Theft Auto, to create a dataset that comprises virtual city street scenes. This dataset contains 24966 images, each of which has a pixel-level semantic annotation; the dataset is therefore popular for semantic segmentation. In the present study, 2500 images were used for training and 100 for testing. Cityscapes datasets: Cordts et al. (13) collected street view data from various cities and created a dataset comprising 25000 images, including 5000 finely labeled images and 20000 roughly labeled images. This dataset is often used in semantic segmentation tasks, similar to the GTA dataset. In the present study, 2993 images were used as training data and 100 images were used as test data.
These two datasets were used because both show street scenes and include complexities found in real scenes; thus, these datasets can be used to evaluate the effectiveness of the proposed model (Fig. 4).

System performance evaluation
The model used in this experiment is based on the extension of the U-GAT-IT and was implemented using PyTorch with the standard initial values. The contour consistency network proposed herein was used for the training. VGG16 was the base network of the HED detection model and was pretrained using the Berkeley Segmentation Dataset 500. (14) For depth estimation, the Monodepth2 model was pretrained using the KITTI dataset, and the image resolution was set to 300000 iterations for each experiment.
To demonstrate the feasibility of our method, we employed the inception score (IS), (15) Frechet inception distance (FID), (16) and learned perceptual image patch similarity (LPIPS) (17) as evaluation indexes because these can be used to evaluate image diversity and quality. The IS was primarily used to assess the quality of the two indicators and image diversity. The higher the IS, the higher the monetary worth. The FID is also used to assess image quality and diversity. The lower the FID value, the better. The LPIPS is the difference between the features extracted by different network architectures; the smaller the value, the greater the similarity.
The results presented in Tables 1 and 2 demonstrate the effectiveness of the proposed method. We analyzed different network architectures and performed ablation experiments with the original U-GAT-IT, U-GAT-IT with edge loss, U-GAT-IT with depth loss, and our proposed approach with contour loss. The GTA to Cityscapes and Cityscapes to GTA conversions were analyzed. Table 3 shows the results of the conversion from GTA to Cityscapes. The conversion yielded different results in two directions. However, the addition of depth loss caused a considerable increase in the FID and a notable decrease in quality. For the Cityscapes to GTA conversion (Table 4), considerable improvement was observed when using only a single

Conclusion
In this study, we proposed a network for solving the problem of inconsistency between the contours of a generated image and the original image, resulting in generated images that are more similar to the original images. Edge detection and depth estimation are performed in a modular manner; thus, applying the proposed approach to all image-to-image translation conversions is straightforward. The model is used only in the training phase and hence does not increase the performance burden during actual testing. The contributions of this study can be summarized as follows: (1) we proposed a contour consistency network that minimizes the alteration of the transformed image profile; (2) we proposed two feature types that can be used to calculate image contours-edge features and depth features; and (3) the proposed method can generate images of different domains while retaining the original structure, thus solving the problem of time-consuming data collection and annotation. In future studies, we will determine suitable network models for different domain conversions. In addition, we will integrate different consistency networks into our current system.