Young Researcher Paper Award 2023
🥇Winners

Notice of retraction
Vol. 34, No. 8(3), S&M3042

Notice of retraction
Vol. 32, No. 8(2), S&M2292

Print: ISSN 0914-4935
Online: ISSN 2435-0869
Sensors and Materials
is an international peer-reviewed open access journal to provide a forum for researchers working in multidisciplinary fields of sensing technology.
Sensors and Materials
is covered by Science Citation Index Expanded (Clarivate Analytics), Scopus (Elsevier), and other databases.

Instructions to authors
English    日本語

Instructions for manuscript preparation
English    日本語

Template
English

Publisher
 MYU K.K.
 Sensors and Materials
 1-23-3-303 Sendagi,
 Bunkyo-ku, Tokyo 113-0022, Japan
 Tel: 81-3-3827-8549
 Fax: 81-3-3827-8547

MYU Research, a scientific publisher, seeks a native English-speaking proofreader with a scientific background. B.Sc. or higher degree is desirable. In-office position; work hours negotiable. Call 03-3827-8549 for further information.


MYU Research

(proofreading and recording)


MYU K.K.
(translation service)


The Art of Writing Scientific Papers

(How to write scientific papers)
(Japanese Only)

Sensors and Materials, Volume 36, Number 10(3) (2024)
Copyright(C) MYU K.K.
pp. 4549-4566
S&M3817 Research Paper of Special Issue
https://doi.org/10.18494/SAM5238
Published: October 29, 2024

Retrieval-augmented-generation-enhanced Dense Video Caption for Human Indoor Activities: Disambiguating Caption Using Spatial Information Beyond Field of View Constraints [PDF]

Bin Chen, Yugo Nakamura, Shogo Fukushima, and Yutaka Arakawa

(Received July 16, 2024; Accepted September 9, 2024)

Keywords: depth camera, 3D reconstruction, 3D detection, human activity recognition, dense video caption, retrieval-augmented generation, large language model

Dense video captioning aims to extract every goal event and its corresponding period, and has gathered significant attention owing to its potential and valuable applications in smart homes, human care, security monitoring, and more. However, current methods do not sufficiently reduce ambiguity in the generated captions and have a limited field of view, making it difficult to integrate the relationship between people and their surrounding environment into the captions. The limitations restrict their applicability in smart home or indoor security systems. These systems require clear distinctions between normal and abnormal human actions, as similar actions can have different interpretations depending on the surrounding environment. For instance, in indoor security systems, ambiguous captions might fail to distinguish between harmless activities such as a person walking and potentially concerning behaviors such as unauthorized access attempts. In this article, we propose a retrieval-augmented generation (RAG)-based system to enhance existing methods, making them suitable for recording human activity indoors. Our key ideas are as follows. First, we collect information about the house environment using a 3D reconstruction and 3D detection process to build a knowledge base. Second, we design a RAG procedure to extract the relevant environmental context. Third, we develop a spatial information enhancement query based on human detection results from RGB and depth image pairs. We utilize the summarization and reasoning capabilities of a large language model to fuse all information, thereby obtaining spatially enhanced dense captions of human indoor activities. We evaluate our method by comparing it with PDVC (end-to-end Dense Video Captioning with Parallel Decoding), GVL (Learning Grounded Vision-Language Representation for Versatile Understanding in Untrimmed Videos), and SG-PDVC (Scene Graph-enhanced PDVC) on a custom video dataset collected from two houses with eight different camera positions, using Recall, Precision, Bilingual Evaluation Understudy (BLEU), Metric for Evaluation of Translation with Explicit ORdering (METEOR), Recall-Oriented Understudy for Gisting Evaluation (ROUGE), and Consensus-based Image Description Evaluation (CIDEr) as metrics. Our method outperforms the three compared methods in BLEU-3, METEOR, CIDEr, and Precision metrics, but shows a small decline in ROUGE-L and Recall metrics compared with GVL. These results demonstrate that our method effectively incorporates spatial information and reduces ambiguity for indoor human activity caption applications.

Corresponding author: Bin Chen


Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License.

Cite this article
Bin Chen, Yugo Nakamura, Shogo Fukushima, and Yutaka Arakawa, Retrieval-augmented-generation-enhanced Dense Video Caption for Human Indoor Activities: Disambiguating Caption Using Spatial Information Beyond Field of View Constraints, Sens. Mater., Vol. 36, No. 10, 2024, p. 4549-4566.



Forthcoming Regular Issues


Forthcoming Special Issues

Applications of Novel Sensors and Related Technologies for Internet of Things
Guest editor, Teen-Hang Meen (National Formosa University), Wenbing Zhao (Cleveland State University), and Cheng-Fu Yang (National University of Kaohsiung)
Call for paper


Special Issue on Advanced Sensing Technologies for Green Energy
Guest editor, Yong Zhu (Griffith University)
Call for paper


Special Issue on Room-temperature-operation Solid-state Radiation Detectors
Guest editor, Toru Aoki (Shizuoka University)
Call for paper


Special Issue on International Conference on Biosensors, Bioelectronics, Biomedical Devices, BioMEMS/NEMS and Applications 2023 (Bio4Apps 2023)
Guest editor, Dzung Viet Dao (Griffith University) and Cong Thanh Nguyen (Griffith University)
Conference website
Call for paper


Special Issue on Advanced Sensing Technologies and Their Applications in Human/Animal Activity Recognition and Behavior Understanding
Guest editor, Kaori Fujinami (Tokyo University of Agriculture and Technology)
Call for paper


Special Issue on Piezoelectric Thin Films and Piezoelectric MEMS
Guest editor, Isaku Kanno (Kobe University)
Call for paper


Copyright(C) MYU K.K. All Rights Reserved.