Published in advance: March 25, 2024
Flexible Temporal Correlation Learning for Human, Animal, and Interactor Detection in Videos [PDF] Yanjun Feng and Jun Liu (Received April 8, 2023; Accepted December 15, 2023) Keywords: object detection, video understanding, attention, temporal learning
Video object detection is a key technology for detecting and tracking humans and animals in
behavior-understanding tasks. Furthermore, detecting small-scale interactors involved in human
activities is challenging. Exploiting the temporal context relationship is important for continuous
understanding. Temporal object detection has been the subject of significant attention, but most
commonly used detection methods fail to fully leverage the abundant temporal information in
videos. In the paper, we propose a novel approach to detect humans and animals in videos, called
attentional temporal You Only Look Once (ATYOLO), which exploits the attention mechanism
and convolutional long short-term memory. We use the proposed attentional module to integrate
a pyramidal feature hierarchy temporally and design a unique structure that includes a low-level
temporal unit and a high-level unit for multiscale feature maps. We have developed an innovative
temporal analysis group with a temporal attention mechanism tailored for background and scale
suppression. This attentional group integrates attention-aware features over time. Extensive
comparisons are conducted to evaluate the detection capability of the proposed approach, and its
superiority has been confirmed. As a result, the developed ATYOLO achieves fast speed and
overall competitive performance in video detection, including ImageNet Video (VID) and
Stanford Drone Dataset (SDD).
Corresponding author: Jun Liu |