Automatic Identification of Tomato Pests Using Parallel Deep Learning Models

Owing to the increasingly serious greenhouse effect and rising global temperatures, pest reproduction and metabolism will accelerate, which will lead to significant reductions in crop yields. To date, many studies have applied deep learning models to pest identification tasks. However, there are many pest types with similar shapes, so in this study, we propose a parallel deep learning model with an attention mechanism module to improve the classification of tomato pest species. We used a public dataset and selected Bemisia tabaci , Helicoverpa armigera , Myzus persicae , Spodoptera exigua , Spodoptera litura , Thrips palmi , Tetranychus urticae , and Zeugodacus cucurbitae . These eight common tomato pests were selected with a total of 412 original images. The original images were enhanced to 1655 images through horizontal flipping and angle rotation. The proposed ECA-Xception-MobileNet (EXM-Net) extracted image features on the basis of Xception and MobileNetV2, added an Efficient Channel Attention (ECA) attention mechanism before the global average pooling layer, and then used the convolution operation to fuse the two model outputs to enhance model performance. The accuracy, precision, recall, F1-score, and PR-AUC score after data augmentation were 98.72, 98.44, 98.86, 99.41, and 99.76%, respectively. After experiments and testing on different datasets, it was confirmed that EXM-Net


Introduction
Insect pests are a major factor in reducing the economic value of crops.Therefore, strengthening pest monitoring and prevention capabilities is critical for food security and protecting agricultural economies.However, the key to achieving this goal is to identify pests rapidly and accurately so that effective recommendations can be made about where the infestation occurs and what subsequent measures to take.Traditional pest identification relies on experienced practitioners to identify pests on the basis of their external characteristics, which is cumbersome and time-consuming.With the rapid development of artificial intelligence and computer vision technology, researchers have proposed many automatic pest identification methods to alleviate these problems. (1)Ayan et al. (2) proposed an ensemble convolutional neural network model called GAEnsemble in 2020.The GAEnsemble model integrated the best three models including Inception-V3, Xception, and MobileNet from seven different pretraining models.The results showed that the proposed ensemble model achieved 98.81, 95.16, and 67.13%  accuracies in classifying D0, SMALL, and IP102 datasets, respectively.Khanramaki et al. (3) proposed an ensemble classifier to identify citrus pests in 2021.First, the original RGB image is converted into various spaces to form different feature subsets, and then an integrated classifier composed of AlexNet, VGG16, ResNet 50, and Inception-ResNet-v2 is used for classification.The results showed that the proposed model achieved an accuracy of 99.04% for the multiclassification of citrus plant pest images.
Wang et al. (4) proposed a combination of ConvNeXt and Swin Transformer, leveraging the different advantages of the two models to achieve complementary effects, and inputted the extracted feature maps into the multilayer residual block (MIX-Block) fusion to eliminate duplication of the two models to learn more complex features.The results showed that the proposed architecture achieved 76. 1, 93.1, and 98.5% accuracies in classifying IP102, insect, and D0 datasets, respectively.Abade et al. (5) proposed a soybean crop pest dataset called NemaDataset in 2022 and a convolutional neural network architecture called NemaNet.The architecture uses DenseNet121 and InceptionV3 to extract features.The feature maps extracted by the two models are combined and finally input to the fully connected layer for classification.The results showed that the transfer learning architecture achieved an accuracy of 98.82% for pest classification.It can be seen that the deep learning model can effectively identify pest species to help farmers make rapid decisions.
Tomato (Solanum lycopersicum) is one of the most important fruit and vegetable crops in the world because of its high nutritional value and it could be processed into various products.According to the Food and Agriculture Organization (FAO) of the United Nations, pests cause significant damage to the world's total crop production every year.Such economic losses can be reduced if pests and diseases affecting tomato crops are identified early, and appropriate tillage and pest control are provided to meet crop growth needs. (6)(11)(12)(13)(14)(15)(16)(17)(18) In this study, we took eight common tomato pests as research objects to assist farmers in identifying tomato pests.
Thus far, parallel models have been used in many fields, such as disease detection, (19,20) plant leaf disease identification, (21)(22)(23)(24) and pest identification. (25,26)The aforementioned studies have achieved good results in various fields.These parallel models only use original deep models and are not combined with other improvement methods, such as the attention mechanism and knowledge distillation, to achieve breakthroughs in model performance.In addition to proposing a parallel model XM-Net, we combined it with the attention mechanism module to form an EXM-Net with better performance on tomato pest identification.The contributions of this study are as follows: 1. We combined Xception and MobileNetV2 models to propose a parallel XM-Net model.2. We proposed a parallel EXM-Net model to optimize the parallel model structure by adding an attention mechanism module.3. The EXM-Net model exhibits both the advantages of the parallel model and attention mechanism module.
Figure 1 shows the research steps of this study.First, the dataset is split into training, validation, and test sets at a ratio of 60:20:20.The images in the training and validation sets are augmented through rotations and flips to expand the number of images.The performance characteristics of VGG16, ResNet50, EfficientNetB0, Xception, MobileNetV2, and InceptionV3 are compared to select the top two models to form the proposed parallel XM-Net model.Finally, the proposed EXM-Net model adds an attention module to the XM-Net model to enhance model performance.

Tomato pest dataset
The dataset used in this study was selected from Ref. 6. Eight common tomato pests, namely, Bemisia tabaci (BA), Helicoverpa armigera (HA), Myzus persicae (MP), Spodoptera exigua (SE), Spodoptera litura (SL), Thrips palmi (TP), Tetranychus urticae (TU), and Zeugodacus cucurbitae (ZC), were selected with a total of 412 original images.Table 1 shows the number of original images for training, validation, and test groups in each class.We applied horizontal flipping and several rotation angles to expand the number of images, increasing the original 412 images to 1655 images.Table 2 shows the number of images for training, validation, and test groups in each class after augmentation.Figure 2 shows examples of tomato pests, and Fig. 3 shows examples of data augmentation.

Attention networks
The attention mechanism selects important features on the basis of different weight scores to improve model performance.Initially, the attention mechanism was applied in natural language processing, and it has now been extended to image, speech, and other data processing tasks.There are three common attention mechanisms in the field of deep learning, namely, squeezeand-excitation network (SENet), efficient channel attention network (ECANet), and convolutional block attention module (CBAM).

SENet
SENet was proposed by the autonomous driving company Momenta in 2017 and won the championship in the annual ILSVR classification competition.This model improves the model

ECANet
An ECANet utilizes an ECA module to improve model performance.First, the ECA module performs global average pooling on the input feature map, and then uses a 1D convolution kernel to exchange information between local channels, so that the channel dimension is proportional to the size of the convolution kernel, thereby ensuring model calculation efficiency and performance.Figure 5 shows the architecture of ECANets. (28)

CBAM
The CBAM combines channel and spatial attention mechanisms and has been used in many common convolutional neural networks.It has been confirmed that it can effectively improve the performance of convolutional neural networks for tasks such as image classification and object detection.As shown in Fig. 6, the CBAM is mainly divided into a channel attention module and a spatial attention module.The details of the two modules are explained below. (29). Channel attention module: First, the feature map is input to the global maximum and average pooling layers, and then processed by the shared multilayer perceptron multilayer perceptron.
The two results are added and passed through the sigmoid activation function to generate the weight of each channel, and finally, we multiply the weight of each channel with the input feature map. 2. Spatial attention module: We extract the maximum and average values of each feature point in the feature map output by the channel attention mechanism for concatenation.Then, we use a convolution layer with one channel to reduce the dimension and pass the sigmoid activation function to generate a spatial attention feature map.Finally, the feature map is multiplied by the input feature map to obtain the final feature map.

VGGNet
VGGNet is a deep convolutional neural network jointly developed by the Oxford University Computer Vision Research Group and DeepMind.The network not only has good performance, but also has strong scalability and excellent generalization. (30)The network structure is simple and easy to implement, including VGG16 with 13 convolutional layers, three fully connected layers, and one output layer.VGG19, with three additional convolutional layers, adds a maximum pooling layer after the convolutional layer and ReLU activation function to avoid the vanishing gradient problem.VGGNet replaces larger convolution kernels by stacking multiple smaller convolution kernels to maintain the size of the receptive field.It can also reduce the number of parameters while increasing the number of nonlinear mapping. (30)Figure 7 shows the architecture of VGG16.

ResNet
The success of ResNet lies in the use of the residual network to solve the problem of performance degradation as the network depth increases. (31)Residual modules are mostly composed of two or three convolution modules, identity mapping, and skip connections.By using skip connections, information can flow directly from shallow layers to deep layers to alleviate the vanishing gradient problem.The ResNet model architecture is based on VGG19 and is modified and added with a residual module.The model has been pretrained in the ImageNet large-scale image dataset, so it is often used for tasks such as image classification and target detection in deep learning. (31)Figure 8 shows the architecture of ResNet50.

EfficientNet
Google proposed a new composite network scaling method in 2019.For high-resolution images, the deeper the network, the better the reception field, and the wider the network, the more detailed the features.EfficientNetB0 draws on MnasNet for a multi-objective neural structure search and uses the same MBConv as MobileNetV2 as the backbone while optimizing the model performance and floating point operations (FLOPS). (32)Figure 9 shows the architecture of EfficientNetB0.

Xception
On the basis of the InceptionV3 model, Xception uses the depth-wise separable concatenate to replace the original Inception module to separate the space and channels, which reduces the network complexity, improves the model performance, and reduces the model parameters.The same as InceptionV3, Xception consists of Entry, Middle, and Exit modules.Similar to the ResNet model, the residual modules enable Xception fast convergence to reduce the training time. (33)The architecture of Xception is shown in Figure 10.

MobileNetV2
On the basis of MobileNetV1, MobileNetV2 adds a residual network to reduce model calculation costs, accelerate model convergence, and maintain feature extraction capabilities.MobileNetV2 uses a linear bottleneck in the residual network to ensure that the model has higher feature transfer and learning capabilities.MobileNetV2 is composed of three convolution layers, seven bottleneck residual blocks, and one average pooling layer. (34)Figure 11 shows the architecture of MobileNetV2.

InceptionV3
The main improvement of InceptionV3 is the use of smaller and asymmetrical convolutional layers to increase the width of the model architecture to generate high-dimensional feature maps.Compared with GoogleNet, InceptionV3 uses only one auxiliary classifier at the end of the model to achieve a function similar to dropout regularization, helping the model to be more efficient and stable.InceptionV3 consists of the Inception module, convolution layers, and maxpooling layer. (35)Figure 12 shows the architecture of InceptionV3.

Hyperparameter optimization
In this study, we used a hyperparameter optimization method called Tree-structured Parzen Estimator (36) to study the optimal parameter combination.This method is improved on the basis of the Bayesian optimization method by constructing two Gaussian mixture models to simulate the probabilities of good and bad results, and to evaluate the quality of the combination.We repeat this process until the optimal hyperparameter combination or a certain number of iterations is found.Several related research results have confirmed that this algorithm can find better hyperparameter combinations in fewer evaluations, making it computationally efficient to perform optimization tasks.Owing to the limitation of the memory capacity of the hardware device, we only consider batch sizes from 4 to 8, epochs from 30 to 50, and learning rates from 0.01 to 0.00001.

Gradient-weighted class activation mapping (Grad-CAM)
In this study, we used gradient-weighted class activation mapping (Grad-CAM) to visualize the feature map of the convolutional neural network.Grad-CAM performs backpropagation on the final feature map of the model, calculates the gradient information of the same size as the feature map, and sums up the weights of different channels as the result to obtain a heat map of the concerned area.

Computer equipment
The equipment used in the experiment is an Intel ® Core TM i7-10700 2.90 GHz CPU with 32 GB of memory and an NVIDIA GeForce RTX 2070 GPU.The whole experiment process is performed using Python 3.8 (Python Software Foundation, Fredericksburg, Virginia, USA), which contains Keras 2.6 and Tensorflow GPU 2.6.0.

Model performance characteristics
The model performance characteristics before data augmentation are shown in Table 3. Except InceptionV3, the accuracies from the other models are above 0.70.The top two accuracies are 0.8205 and 0.7948, which originate from Xception and MobileNetV2, respectively.The model performance characteristics after data augmentation are shown in Table 4. Compared with the results in Table 3, all the performance characteristics are improved significantly after data augmentation.The same as the finding in Table 3, except InceptionV3, the accuracies from the other models are above 0.90 in Table 4.The top two accuracies are 0.9615 and 0.9615, which originate from Xception and MobileNetV2, respectively.The above two models were selected to form the proposed parallel XM-Net model (see Sec. 2.3.7).

Proposed parallel XM-Net
The performance characteristics of VGG16, ResNet50, EfficientNetB0, Xception, MobileNetV2, and InceptionV3 were compared to select the top two models (Xception and MobileNetV2) to form the proposed parallel XM-Net model.Different from the ensemble model training basic models individually, the parallel model concatenates feature maps outputted from basic models to increase the execution speed.Figure 13 shows the design of the proposed parallel XM-Net model.

Adding attention mechanisms
Three common attention mechanisms, namely, SE, ECA, and CBAM, were combined with the proposed parallel XM-Net model.The attention mechanisms were added before the average pooling layer for both Xception and MobileNetV2 to form SE-XM-Net, ECA-XM-Net, and CBAM-XM-Net.Among these three models, the model with highest performance became the final proposed EXM-Net model.
Results of the proposed parallel XM-Net model and the three common attention mechanisms, namely, SE, ECA, and CBAM, added to the proposed parallel XM-Net model are shown in Table 5.The accuracies from the proposed parallel XM-Net, SE+XM-Net, CBAM+XM-Net, and ECA+XM-Net models are 0.9743, 0.8974, 0.9743, and 0.9872, respectively.Adding the SE module degrades the performance of the proposed parallel XM-Net model, whereas there is no significant difference in the effect of the model with or without adding the CBAM module.
Adding the ECA attention mechanism to the proposed parallel XM-Net model achieves the highest performance, and finally, the combination becomes the proposed EXM-Net model in this study.The performance plots after adding the attention mechanism for the proposed EXM-  Net model are shown in Fig. 14; Fig. 15 shows the architecture of the proposed parallel EXM-Net model.

Ablation experiments
In this section, we compare the model performance by adding Dropout and the attention mechanism ECA using images after augmentation.The accuracy was increased from 0.9487 to 0.9615 for Xception by adding Dropout.The accuracy for the parallel model Xception+ MobileNetV2 reached 0.9872 when Dropout and ECA were used (Experiment 4 in Table 6).As shown in Table 6, the accuracy, loss, confusion matrix, and PR-AUC from experiments 3 and 4 are clearly higher than those from experiments 1 and 2. Adding Dropout prevents overfitting problems at the end of Xception, thereby improving the model performance.After the two models are paralleled, the model learns more detailed features by complementing each other's missing features, and the accuracy is 0.0385 higher than that of a single model.It can be seen from Table 6 that the parallel model adding the ECA attention module achieves the highest accuracy of 0.9872.

Results of hyperparameters
The optimal combination for the dataset before data augmentation after five cross-validations was batch size = 4, epochs = 50, and learning rate = 1.3777832919010947e−05, which results in  the higher accuracy, recall, F1-score, precision, and PR-AUC of 0.8461, 0.8290, 0.8362, 0.8571, and 0.8964, respectively.For the augmented dataset, the optimal combination for the dataset before data augmentation after five cross-validations was batch size = 8, epochs = 50, and learning rate = 1.5151670794642941e−06, but there were only slight improvements in accuracy and F1-score.Table 7 shows the results of the models with hyperparameters.

Results of feature visualization
It can be seen from the visualization results in Fig. 16 that the focus of EXM-Net is more comprehensive than that of a single model.The two-branch model provides rich pest features instead of redundant features in the background, which improves EXM-Net for identifying tomato pests while reducing the attention of complex backgrounds.

Comparison with SOTA
Table 8 presents the performance characteristics from related studies.Sun et al. (38) improved the SE attention mechanism according to the architecture of the fire module in SqueezeNet by changing the convolution kernel size to adjust the extracted features and formed S1 and S2 modules with different sizes.The two are joined to SqueezeNet to form SSNet.After experiments on module placement and quantity, the results showed that SSNet achieved a good accuracy of 98.06%.However, the authors did not consider the impact of other attention mechanisms on SqueezeNet.Chen et al. (39) proposed the feature positioning module EFLM and the adaptive filtering fusion module AFFM to improve pest identification capabilities.The results showed that the accuracy of the network with ResNet50 as the backbone reached 100%.Although the network performance is amazing, the architecture requires high computer power, which is not conducive to easy application by farmers.Huang et al. (6) constructed an architecture ResNet50+DA that combines a deep learning model and a machine learning classifier, and used Bayesian optimization methods to find the optimal hyperparameter combination.The results showed that the architecture achieved an accuracy of 0.9712.However, in this study, the authors only used a self-created tomato pest dataset and did not use other datasets to confirm that the proposed architecture can be widely used in different fields.

Conclusion
In this study, we proposed two parallel models, XM-Net and EXM-Net, to identify common tomato pests.XM-Net parallels two models, Xception and MobileNetV2, to extract different features.In addition, EXM-Net adds an ECA attention mechanism module to XM-Net to focus on more delicate features to further improve the model performance.The results showed that the accuracy, precision, recall, F1-score, and PR-AUC score after data augmentation from the proposed EXM-Net model reached 0.9872, 0.9844, 0.9886, 0.9941, and 0.9967, respectively.
Currently, we only selected eight common types of tomato pest, and other pests still need to be collected and identified for a more comprehensive pest control in a future study.In addition, the proposed XM-Net and EXM-Net models are only designed to achieve high accuracy, and the model parameters and model size are not considered in the experiment.In the future, a lightweight model is encouraged to be matched with mobile devices to construct a real-time and efficient pest identification system.

Compliance with Ethical Standards
Conflict of Interest: All the authors declare that they have no conflict of interest.Ethical approval: This article does not contain any studies with human participants performed by any of the authors.
accuracy by selecting important features.Because the SE module is easy to implement, it is often applied in various research projects.As illustrated in Fig.4, the SE module is mainly divided into three parts: squeeze, excitation, and reweight.The details of these parts are explained below.(27)

Table 2
Number of images in each class after augmentation.

Table 1
Number of original images in each class.

Table 3
Model performance characteristics before data augmentation.

Table 5
Model performance characteristics after adding attention mechanism.

Table 6
Results of ablation experiments.

Table 8
Comparison with SOTA.