Sensors and Materials

Young Researcher Paper Award 2025
🥇Winners

Notice of retraction
Vol. 32, No. 8(2), S&M2292

Print: ISSN 0914-4935
Online: ISSN 2435-0869
Sensors and Materials
is an international peer-reviewed open access journal to provide a forum for researchers working in multidisciplinary fields of sensing technology.

Tweets by Journal_SandM Sensors and Materials
is covered by Science Citation Index Expanded (Clarivate Analytics), Scopus (Elsevier), and other databases.

Instructions to authors
English 日本語

Instructions for manuscript preparation
English 日本語

Template
English

Publisher
MYU K.K.
Sensors and Materials
1-23-3-303 Sendagi,
Bunkyo-ku, Tokyo 113-0022, Japan
Tel: 81-3-3827-8549
Fax: 81-3-3827-8547

MYU Research, a scientific publisher, seeks a native English-speaking proofreader with a scientific background. B.Sc. or higher degree is desirable. In-office position; work hours negotiable. Call 03-3827-8549 for further information.

MYU Research
(proofreading and recording)

MYU K.K.
(translation service)

The Art of Writing Scientific Papers
(How to write scientific papers)
(Japanese Only)

Sensors and Materials, Volume 36, Number 10(3) (2024)
Copyright(C) MYU K.K.

pp. 4549-4566
S&M3817 Research Paper of Special Issue
https://doi.org/10.18494/SAM5238
Published: October 29, 2024

Retrieval-augmented-generation-enhanced Dense Video Caption for Human Indoor Activities: Disambiguating Caption Using Spatial Information Beyond Field of View Constraints [PDF]

Bin Chen, Yugo Nakamura, Shogo Fukushima, and Yutaka Arakawa

(Received July 16, 2024; Accepted September 9, 2024)

Keywords: depth camera, 3D reconstruction, 3D detection, human activity recognition, dense video caption, retrieval-augmented generation, large language model

Dense video captioning aims to extract every goal event and its corresponding period, and has gathered significant attention owing to its potential and valuable applications in smart homes, human care, security monitoring, and more. However, current methods do not sufficiently reduce ambiguity in the generated captions and have a limited field of view, making it difficult to integrate the relationship between people and their surrounding environment into the captions. The limitations restrict their applicability in smart home or indoor security systems. These systems require clear distinctions between normal and abnormal human actions, as similar actions can have different interpretations depending on the surrounding environment. For instance, in indoor security systems, ambiguous captions might fail to distinguish between harmless activities such as a person walking and potentially concerning behaviors such as unauthorized access attempts. In this article, we propose a retrieval-augmented generation (RAG)-based system to enhance existing methods, making them suitable for recording human activity indoors. Our key ideas are as follows. First, we collect information about the house environment using a 3D reconstruction and 3D detection process to build a knowledge base. Second, we design a RAG procedure to extract the relevant environmental context. Third, we develop a spatial information enhancement query based on human detection results from RGB and depth image pairs. We utilize the summarization and reasoning capabilities of a large language model to fuse all information, thereby obtaining spatially enhanced dense captions of human indoor activities. We evaluate our method by comparing it with PDVC (end-to-end Dense Video Captioning with Parallel Decoding), GVL (Learning Grounded Vision-Language Representation for Versatile Understanding in Untrimmed Videos), and SG-PDVC (Scene Graph-enhanced PDVC) on a custom video dataset collected from two houses with eight different camera positions, using Recall, Precision, Bilingual Evaluation Understudy (BLEU), Metric for Evaluation of Translation with Explicit ORdering (METEOR), Recall-Oriented Understudy for Gisting Evaluation (ROUGE), and Consensus-based Image Description Evaluation (CIDEr) as metrics. Our method outperforms the three compared methods in BLEU-3, METEOR, CIDEr, and Precision metrics, but shows a small decline in ROUGE-L and Recall metrics compared with GVL. These results demonstrate that our method effectively incorporates spatial information and reduces ambiguity for indoor human activity caption applications.

Corresponding author: Bin Chen

This work is licensed under a Creative Commons Attribution 4.0 International License.

Cite this article
Bin Chen, Yugo Nakamura, Shogo Fukushima, and Yutaka Arakawa, Retrieval-augmented-generation-enhanced Dense Video Caption for Human Indoor Activities: Disambiguating Caption Using Spatial Information Beyond Field of View Constraints, Sens. Mater., Vol. 36, No. 10, 2024, p. 4549-4566.