Food Calorie Estimation System Based on Semantic Segmentation Network

above 84.95%.


Introduction
In recent years, with the aging of the population and changes in lifestyles, diabetes has gradually developed into a familiar and frequently occurring disease in several modern societies.In particular, type 2 diabetes is the most common type of diabetes in the 21st century. (1)Diabetes is a chronic metabolic disease characterized by increased blood sugar levels.Typical symptoms include polyuria, polydipsia, polyphagia, and weight loss. (2)After discovering elevated blood sugar and diagnosing diabetes, most patients tend to underestimate the dangers of diabetes because the symptoms are not obvious, and they do not pay attention to the treatment of diabetes.If there is poor health management in the early stage, diabetes can cause various serious complications later.
The application of artificial intelligence (AI) in the medical field has changed the mode of medical service and the concept of health management to a certain extent.This development is conducive to strengthening disease prevention, enhancing patient compliance, and making the management system more intelligent to improve the management efficiency of chronic diabetes.Moreover, it also inspires people's concept of a healthy life and fundamentally reduces the medical cost of the whole society.
In this paper, we propose an effective food calorie estimation approach based on deep learning image semantic segmentation.Three main tasks are performed in this approach: food recognition, volume calculation, and calorie conversion. (3)The deep learning neural network can perform food recognition and volume calculation well.According to a standard food nutrition table, we can quickly establish a conversion formula from the food category and volume to calories.In addition, an easy-to-use smartphone application (APP) prototype is designed and implemented for diabetic patients.The trained deep learning model has been deployed in the APP to perform the calorie estimation.Moreover, the APP can also provide some useful functions, such as sports management and blood sugar monitoring, to diabetic patients.
The rest of this paper is organized as follows.Related works are briefly described in Sect. 2. In Sect.3, we describe the proposed approaches in detail, including the backbone network and three different network models.In Sect.4, we show the experimental processes and results, including the image acquisition, preprocessing, setting, and performance.The APP prototype is demonstrated in Sect. 5. Finally, the conclusions are given in Sect.6.

Related Works
In recent years, AI has also penetrated into diabetes-related fields, and many new advances have been made in disease prediction, diagnosis, blood glucose monitoring, and complication screening.In diabetes management, AI technology plays an important role. (4,5)Studies have used AI to mine data in sequential patterns, determine the order of use between drugs, and accurately predict the next drug that the doctor may specify for the patient.In addition, the artificial pancreas with intensive learning can better control the blood sugar of diabetic patients and reduce the risk of hypoglycemia. (6)orouzi et al. (7) proposed a mobile APP for managing food nutrition for diabetic patients.It can provide the food plan according to the health status of the user.However, the user still inputs all health data by hand.An image-based automatic food energy estimation technique, which uses the generative adversarial network (GAN), has been proposed by Fang et al. (8) However, this approach has a high response delay in physical applications.
In the current field of AI in the management of diabetic patients, the diet monitoring of diabetic patients is important and necessary.Studies have shown that the manual reporting of food intake is inaccurate and usually impractical, (9) so an automated solution for diet monitoring needs to be sought.A diet monitoring system can be designed and implemented by image analysis.It requires users to use smartphones to take pictures of food and send the pictures to the server.The server analyzes the images, estimates the nutritional characteristics of the food, reports it to the user, and send it to the health professional. (10)It can be seen that the food image analysis system needs to solve image segmentation, food recognition and classification, food volume estimation, and calorie conversion.These tasks need to be linked and studied for the entire system to build a complete and accurate food image analysis system.In summary, the application of AI technology in the field of diabetes is not perfect, and there is still much room for performance improvement.On the basis of these previous studies, we proposed and implemented a novel food recognition and calorie conversion method in this work.

Backbone network
Considering that the client system to be constructed in the next step of this experiment is based on the WeChat applet, in this study, we use the MobileNet network structure to form the backbone network for image feature recognition. (11)MobileNet is a lightweight convolutional neural network model that can be applied to mobile terminals.It is a compromise between accuracy and response time.
The core technology of MobileNet is depthwise separable convolution.The overall convolution effect is similar to a standard convolution result.However, regarding the amount of calculation, the depthwise separable convolution can considerably reduce the model's parameters, reduce memory saturation, and improve training speed. (11)Unlike ordinary convolution, which considers channels and regions simultaneously, the idea of depthwise separable convolution is first to examine the region and then merge multiple channels to separate regions and channels.As a result, MobileNet uses deep separable convolution for the first time to significantly reduce the amount of calculation and is suitable for building lightweight networks for mobile deployment.When building the model, we use the DepthwiseConv2D layer in Keras to achieve deep separable convolution.

Semantic segmentation model based on SegNet + MobileNet
SegNet is a classic encoder-decoder structure in which the encoder draws on the convolutional layer structure of VGG-16 and removes the fully connected layer of CNN like FCN.The decoder uses the max-pooling index similar to the leading pooling backpropagation technology to record the corresponding up-sampling output value position, which improves the recognition effect of boundary features and reduces the amount of calculation. (12)he main structure of the semantic segmentation model based on SegNet + MobileNet constructed in this paper is shown in Fig. 1.It is mainly composed of deep separable convolutional blocks.First, in the encoder, the input image undergoes multiple separable convolutions in the backbone model to extract a layer with specific characteristics and then uses the UpSampling2D function in the decoder to perform threefold sampling.Finally, a layer of a certain width and height with the number of channels equal to that of categories, n classes , is obtained, which is the result of semantic segmentation.n classes is termed as the number of channels because it represents the number of categories each pixel belongs to.Here are some points to note: the input image is first zero-padding, and sometimes the edge of the input matrix is filled with zero values so we can filter the edge of the input image matrix.The significant advantage of zero padding is that it allows us to control the size of the feature map.BatchNorm keeps the input of each layer of the neural network in the same distribution during the deep neural network training process so as to avoid the problem of gradient disappearance.The activation function uses the ReLU function.
The network parameters are shown in Table 1.We resize the input image to 416 × 416 × 3, and after a series of depth separable convolutions, it becomes a feature map of 26 × 26 × 512.It becomes 28 × 28 × 512 after the zeropad layer, then becomes 208 × 208 × 128 after three upsamplings, becomes 210 × 210 × 128 after the zeropad layer, and becomes 208 after two convolutions and a batch Norm layer × 208 × 22.After the reshape and softmax layers, it becomes 43264 × 22. Twenty-two is the predicted object type (20 types of food, background, and coins).
Regarding the composition of the loss function, we need to know the predicted and true values.The first is the prediction result.In the conv2d_4(Conv2D) layer, the output results are height i , width i , and n classes , where height i represents the height of the input image and width i represents the width of the input image.Finally, the softmax function is used to calculate the probability of each category as a result of prediction.First, we resize the input image to the same size array as the predicted result, and then assign each pixel to its category in turn and store it in the array to obtain the real result.Finally, the cross entropy of the predicted and true results is calculated as the value of the loss function.

Semantic segmentation model based on UNet + MobileNet
UNet was firstly used in the field of medical image segmentation.It can be applied to data sets with a small amount of data and can achieve good segmentation results, so it is very popular.UNet mainly provides a set of data enhancement methods to maximize the use of data.The main structure of UNet is also similar to the convolutional encoder-decoder structure.The two parts of the encoder and decoder form a U-shaped network structure, as shown in Fig. 3.The encoder part of UNet is basically the same as the ordinary CNN network structure, using convolution and pooling and performing the conventional operation of extracting information between pixels.It is worth mentioning that the second half, the decoder part, is basically symmetrical in the form of the first half.Convolution is also used, but an up-sampling operation is introduced to output the segmentation result images of equal size.The most important thing is that the feature maps of each layer of the encoder are transmitted to those output by the decoder after being sampled by copying and appropriate cutting.They are connected to obtain more accurate context information and a different feature fusion method is used to improve the accuracy of segmentation. (13)  separable convolutions, and the layers obtained after each compression are denoted as f1, f2, f3, and f4.f4 is up-sampled once and then concatenated with f3, then up-sampled again, then concatenated with f2, then up-sampled again, and then concatenated with f1.At last, the number of channels is the number of categories, and the semantically segmented result of the image is output through convolution operation.
The network parameters are shown in Table 2. Similarly, the input image is resized to 416 × 416 × 3.After a series of depth separable convolutions and one up-sampling, it becomes a 52 × 52 × 512 feature map, which is concatenated with f3 to obtain a 52 × 52 × 768 feature map.After Zeropad and BatchNorm layers and up-sampling, a 104 × 104 × 256 feature map is obtained, which is connected in series with f2 to obtain a 104 × 104 × 384 feature map.After up-sampling, a 208 × 208 × 128 feature map is obtained, and f1 is connected in series to obtain a feature map of 208 × 208 × 192.After the zeropad layer, it becomes 210 × 210 × 192.Next, after two convolutions and a batch Norm layer, it becomes 208 × 208 × 22. Finally, after the reshape and softmax layer, it becomes 43264 × 22.The loss function is consistent with the model loss function based on SegNet + MobileNet.

Semantic segmentation model based on PspNet + MobileNet
In image semantic segmentation, the global information obtained by the convolution operation of different receptive fields and the context semantic relationship strongly correlate  with the generated error segmentation results.Therefore, the deep network appropriately pays attention to the scene features in the global scope, which helps to improve the accuracy of semantic segmentation significantly.PspNet belongs to such a type of network model.It is based on ResNet and FCN, uses multiscale feature fusion technology and pyramid pooling, and finally performs pixel-level segmentation prediction through convolution.
The pyramid pooling module extracts four characteristic regions of different scales. (14)The red block is a single bitmap generated after global average pooling, representing the roughest pyramid level.The other three pyramid levels actively divide the feature area into 2 × 2, 3 × 3, and 6 × 6 subregions.The finer the division, the more refined scene features can be mined.After the pooling operation is performed on each layer, features of different depths are obtained.Then, through 1 × 1 convolution, the feature dimensions are reduced and directly up-sampled to the same size as the shallow features.Next, the fusion's deep global features (context information) are combined with the shallow detailed features to obtain the final feature map.Finally, the final semantic segmentation prediction map is generated through a layer of the convolution operation.
As shown in Fig. 5, the encoder part of the semantic segmentation network based on PspNet + MobileNet constructed in this paper still uses the MobileNet network structure.The feature map is compressed five times through deep separable convolution, denoted as f5.In the decoder part, f5 is passed through four different average pooling layers of different lengths and sizes, and the result of pooling is adjusted by linear interpolation.Finally, the four resized feature maps are connected in series with f5.The product operation outputs an image whose number of channels is the number of categories, which is the result of semantic segmentation.
The network information is shown in Table 3.We resize the input image to 576 × 576 × 3, go through a series of depth separable convolutions, and compress it five times to obtain the feature map f5.The size is 18 × 18 × 1024.After f5 undergoes four different maximum pooling operations, as well as 1 × 1 convolution, BatchNorm, ReLU activation, and linear interpolation, we resize the operations, and four feature maps with a size of 18 × 18 × 512 are obtained and concatenated with f5.Then, after 1 × 1 convolution and 3 × 3 × n classes convolution, the resize    operation obtains a feature map with a size of 144 × 144 × 22, and then the image passes through the reshape and softmax layers to become 20736 × 22.The loss function is inconsistent with those of the previous two models.When calculating the prediction result, we need to first change the number of feature map channels after serial output to the number of categories and use softmax to calculate the category probability after resizing.

Method of food calorie conversion
When constructing the data set, we also collect the food weight of the i-th training sample and obtain the caloric value, denoted by K i , of the sample food according to the Chinese Food Nutrition Table. (15)In addition, the area ratio of the food to the coin of each sample food image can be obtained according to the label image, denoted by S i .Therefore, the corresponding relationship between the area feature information of the food image and the caloric value of the food can be established, denoted by K S .For n samples of the same class, the formula for calculating K S of this type of food is 1 1 Essentially, K S represents the caloric value contained in a unit of food, which conforms to the law of nutrition.The area ratio of the realistic food to the coin is termed as S r .According to the predicted food category and pixel area output by the model, the real caloric value, denoted by K r , of the food in the image can be calculated as

Image acquisition
Since the distance and angle between the camera and the food cannot be guaranteed in the actual capturing process, it is necessary to borrow a standard reference object to construct the corresponding relationship with the food object to compensate for the loss of information caused by the shooting method.In this study, we seek the corresponding relationship between the information of the food area in the image and the calories so as to convert the calories on the basis of the area information.Since the area of the standard reference object is fixed, we plan to use image processing to calculate the ratio between the food area and the reference object area to calculate the food area.The food object and camera position can be fixed in actual application scenarios to avoid shooting technique problems.
In this study, we choose a common one yuan (CNY) coin with a uniform specification as the standard reference object.The coin is placed near the food and separated from the food area, and an image is taken of the coin and food together.After the capturing, the dish sample is weighed with an electronic scale and the weight of the sample is recorded.The experimental equipment needed in the data collection includes a 300 ml disposable lunch box for holding dish samples, standard reference objects (coins), black shading plates for shading, and an electronic scale for weighing dish samples, as shown in Fig. 7.
When shooting images, we use different models of mobile terminals to shoot and there are no strict restrictions on the constraints, and it can be carried out under different lighting conditions, angles, and coin placement environments.However, it is necessary to ensure that the food and coin are exposed to the lens simultaneously.It can be ignored if the area covered by the disposable lunch box does not exceed 10%.If the area covered by the coin and food exceeds 10% of its own, it will be considered invalid data and will not be included in the data set.We chose 20 familiar dishes for data collection.
An example of image data is shown in Fig. 8.The dishes taken are familiar dishes, including vegetarian and meat dishes, single foods, and combined dishes.Food images visually present the following features: shape features include granular, block, filament, flake, and combination; color features are mainly white, yellow, black, brown, green, red, and other color systems.
To calculate the actual caloric value of food, it is necessary to know the actual weight of each dish.Therefore, after each food image is taken according to the specifications, we use an  electronic scale to weigh the dish, and the net weight of the dish (minus the weight of the standard disposable lunch box) is recorded.For the image generated by data enhancement, since the data enhancement does not change the area information of the food and coin in the image, the recorded weight is equal to the weight value of the original image.

Image preprocessing
The data reinforcement is used to obtain a certain amount of data for training.There are four image reinforcement techniques, namely, brightness adjustment, horizontal flip, vertical flip, and rotation.Figure 9 shows the physical effects of image reinforcement techniques.
After rigorous screening and data enhancement, 10000 image data were obtained.The Labelme labeling software is used to label all images, generate corresponding JSON labeling files, and generate corresponding labeled images through batch decompilation.After performing the corresponding classification, it is stored in the deep learning server.Figure 10 shows the original and labeled images.

Experimental setting
According to the ratio of 3:1:1, the data set is randomly divided into the training, validation, and test sets.Table 4 shows the distribution of the data set.During the experiment, we need to use the training set to train the food segmentation network model.To reduce the generalization error, we need to continuously train our model through the training set so that the model can learn more features and be closer to the real results.To prevent the over-fitting effect, the verification set is used in the training process.Therefore, the test set that does not participate in the training process is used to evaluate the final generalization ability of the model.
In the food calorie calculation task, calculating the area ratio of the food to the coin in the image is needed.The actual area ratio is denoted by Y t and the predicted area ratio is denoted by Y r .In this task, the error and accuracy are calculated as

Performance
The   Through the comparison of the training processes of the three models, we found that the models SegNet + MobileNet and UNet + MobileNet models are similar in convergence and far faster than the PspNet + MobileNet model.This means that these two models are more efficient in finding a more accurate solution.
Table 5 shows the accuracies of the three models in food recognition.In the food recognition, the UNet + MobileNet model has the highest accuracy in the test set, reaching 0.9795.The gap between the SegNet + MobileNet and UNet + MobileNet models is small (only 0.0013).The PspNet + MobileNet model has the lowest accuracy of 0.9726.However, the accuracy differences of the three models in the test set are not significant and all reach more than 0.97.Figures 14-16 show the semantic segmentation results obtained using the three models.After the segmentation, we mark the divided coin and dish areas with different colors in Figs.14-16.For each image, the contour extraction and area calculation are used to obtain the coin and food areas.Therefore, the food-to-coin area ratio can be obtained.The actual size of the coin is known and that of the dish can be estimated.Moreover, the predicted food-to-coin area ratios are shown in Table 6.
In calculating the area ratio of the food to the reference object in the image, it can be found that the accuracy of the SegNet + MobileNet model is the highest and much higher than those of the other two models.Combined with the model performance results in semantic segmentation,  we considered using the SegNet + MobileNet model for food recognition and regional area extraction.Although the performance of the UNet + MobileNet model is the best, that of the SegNet + MobileNet model is very close to it.The label map effect segmented by the SegNet + MobileNet model also does not show glitches or loss of feature information.The most important thing is the area ratio calculation.The accuracy rate is the highest in Table 7.

Implementation of APP
For the health management of diabetic patients, the APP for food calorie estimation is designed and implemented.At this stage, only Android smartphones are supported.According to the experimental result, the trained SegNet + MobileNet model is selected to be deployed in the APP. Figure 17 shows how to perform food calorie estimation.The test results of 24 dishes are shown in Fig. 18.Please refer to Fig. 6.

Conclusions
In this paper, we proposed a novel approach based on food image recognition and calorie conversion.It mainly elaborates on the establishment of a food image data set and the specific methods of image recognition and calculation of food calories based on food images.We constructed a model based on deep learning to predict the classification and segmentation of food images.Therefore, we calculated the food area via the image segmentation.In addition, we determined the corresponding relationship between the unit area of the food image and the food calorie value using various dishes.The experimental results showed that the semantic segmentation performance of SegNet + MobileNet is 0.9782 and that the accuracy of calculating the area ratio of the food to the reference object in the food calorie recognition task can reach 0.8495.
On this basis, we have also designed and implemented an APP for diabetic patients.This APP provides a function to calculate the food calories via a smartphone.In the future, more types of food information will be collected in more scenarios, and we will develop a method of calculating food volume without a reference object.
training curves of the three models are shown in Figs.11-13.The figures on the left are the accuracy curves of the training and validation sets during the training process, and the figures on the right are the loss curves.Because the early stopping mechanism is set, the PspNet model training stops at 16 steps, which does not affect the rationality of the training results.
Yu-Ze Wang is an undergraduate student majoring in medical information engineering at the University of Shanghai for Science and Technology.As a member of the University of Shanghai for Science and Technology, he studies and engages in smart medical software development and project management.Recently, his team has organized and developed a children's health management system, which won first prize in the Chinese college students' computer design competition.His research interests include medical image processing, software engineering, and their applications in medical services and hospital information sectors.(wang_yz0613@163.com)Rui-Yang Peng is a graduate student majoring in electronic information (biomedical engineering) at the University of Shanghai for Science and Technology.Her research direction is intelligent medical care and big clinical data.Currently, the Institute of Medical Informatics is engaged in big data analysis, data mining, and software development.Her research interests include intelligent medical care, big data technology, machine learning, and other related fields.At the same time, she has participated in and won many competitions during her school days, including winning the national second prize in the China Postgraduate Mathematical Modeling Competition and also the school second prize scholarship.(peng_ruiyang@163.com)Xin-Yue Li is an undergraduate student majoring in medical information engineering at the University of Shanghai for Science and Technology.She studies and participates in intelligent medical software development and project management and has won individual and team awards.Her research interests are mainly in medical information engineering, software engineering, and artificial intelligence.(2593045307@qq.com)Ying-Rui Lv is an undergraduate student majoring in business administration at the University of Shanghai for Science and Technology.She studies and participates in intelligent medical project management.Her research interests are mainly in corporate governance, programming and artificial intelligence.(771963328@qq.com)

Table 4
Distribution of data set.

Table 5
Food recognition accuracies of the three models.

Table 6
Y r predicted area ratios of the labeled food to the coin.