pp. 4549-4566
S&M3817 Research Paper of Special Issue https://doi.org/10.18494/SAM5238 Published: October 29, 2024 Retrieval-augmented-generation-enhanced Dense Video Caption for Human Indoor Activities: Disambiguating Caption Using Spatial Information Beyond Field of View Constraints [PDF] Bin Chen, Yugo Nakamura, Shogo Fukushima, and Yutaka Arakawa (Received July 16, 2024; Accepted September 9, 2024) Keywords: depth camera, 3D reconstruction, 3D detection, human activity recognition, dense video caption, retrieval-augmented generation, large language model
Dense video captioning aims to extract every goal event and its corresponding period, and has gathered significant attention owing to its potential and valuable applications in smart homes, human care, security monitoring, and more. However, current methods do not sufficiently reduce ambiguity in the generated captions and have a limited field of view, making it difficult to integrate the relationship between people and their surrounding environment into the captions. The limitations restrict their applicability in smart home or indoor security systems. These systems require clear distinctions between normal and abnormal human actions, as similar actions can have different interpretations depending on the surrounding environment. For instance, in indoor security systems, ambiguous captions might fail to distinguish between harmless activities such as a person walking and potentially concerning behaviors such as unauthorized access attempts. In this article, we propose a retrieval-augmented generation (RAG)-based system to enhance existing methods, making them suitable for recording human activity indoors. Our key ideas are as follows. First, we collect information about the house environment using a 3D reconstruction and 3D detection process to build a knowledge base. Second, we design a RAG procedure to extract the relevant environmental context. Third, we develop a spatial information enhancement query based on human detection results from RGB and depth image pairs. We utilize the summarization and reasoning capabilities of a large language model to fuse all information, thereby obtaining spatially enhanced dense captions of human indoor activities. We evaluate our method by comparing it with PDVC (end-to-end Dense Video Captioning with Parallel Decoding), GVL (Learning Grounded Vision-Language Representation for Versatile Understanding in Untrimmed Videos), and SG-PDVC (Scene Graph-enhanced PDVC) on a custom video dataset collected from two houses with eight different camera positions, using Recall, Precision, Bilingual Evaluation Understudy (BLEU), Metric for Evaluation of Translation with Explicit ORdering (METEOR), Recall-Oriented Understudy for Gisting Evaluation (ROUGE), and Consensus-based Image Description Evaluation (CIDEr) as metrics. Our method outperforms the three compared methods in BLEU-3, METEOR, CIDEr, and Precision metrics, but shows a small decline in ROUGE-L and Recall metrics compared with GVL. These results demonstrate that our method effectively incorporates spatial information and reduces ambiguity for indoor human activity caption applications.
Corresponding author: Bin ChenThis work is licensed under a Creative Commons Attribution 4.0 International License. Cite this article Bin Chen, Yugo Nakamura, Shogo Fukushima, and Yutaka Arakawa, Retrieval-augmented-generation-enhanced Dense Video Caption for Human Indoor Activities: Disambiguating Caption Using Spatial Information Beyond Field of View Constraints, Sens. Mater., Vol. 36, No. 10, 2024, p. 4549-4566. |