Privacy-preserving multimodal reasoning for internet of things: A retrieval-augmented large language and vision assistant framework

Author Identifier (ORCID)

Wei Ni: https://orcid.org/0000-0002-4933-594X

Abstract

The Internet of Things (IoT) generates massive multimodal data that can potentially power sophisticated analytics using Multimodal Large Language Models (MLLMs) and give rise to new paradigms where users interact with IoT systems through natural language prompts and query IoT applications for complex insights. However, these models pose significant privacy risks when sensitive data, such as personal identifiers or proprietary information, is shared for model training or finetuning at scale. In this paper, we propose a novel privacy-enhanced Visual Question Answering (VQA) framework tailored for IoT scenarios. Unlike prior VQA pipelines that omit explicit privacy controls or rely on coarse box masking, our framework applies semantic segmentation and labeling to isolate sensitive content, which is removed or obfuscated via diffusion inpainting before model processing. Our framework applies semantic segmentation and labeling to isolate sensitive content, which is removed or obfuscated via diffusion inpainting before model processing. We then leverage parameter-efficient Low-Rank Adaptation (LoRA) to finetune a language–vision assistant and incorporate Ghost Clipping to reduce gradient-based data leakage. For questions that require context beyond the processed IoT images, we introduce Retrieval-Augmented Generation (RAG) to draw upon external knowledge sources and deliver insights. Experimental results on a customized IoT-oriented subset of the OK-VQA dataset constructed via a CLIP similarity score to an IoT phrase bank and a light COCO super-category filter illustrate that our framework preserves privacy while achieving competitive accuracy, even outperforming standard non-private baselines. Protecting privacy at both the data preprocessing and finetuning stages, this work offers a scalable solution that meets the stringent privacy demands of real-world IoT environments. On the OK-VQA, which contains 14,055 images and QA pairs, our framework attains 76.1% accuracy without privacy consideration, and 49.3%–75.2% for a privacy budget ranging from 0.5 to 5, even surpassing a non-private LLaVA-7B baseline (73.7%).

Document Type

Journal Article

Date of Publication

1-1-2026

Publication Title

IEEE Internet of Things Magazine

Publisher

IEEE

School

School of Engineering

Comments

Ni, T., Yuan, X., Li, S., & Ni, W. (2026). Privacy-preserving multimodal reasoning for Internet of Things: A retrieval-augmented large language and vision assistant framework. IEEE Internet of Things Magazine. Advance online publication. https://doi.org/10.1109/MIOT.2026.3650753

Copyright

subscription content

Share

 
COinS
 

Link to publisher version (DOI)

10.1109/MIOT.2026.3650753