Research outputs 2022 to 2026

Privacy-preserving multimodal reasoning for internet of things: A retrieval-augmented large language and vision assistant framework

Author Identifier (ORCID)

Wei Ni: https://orcid.org/0000-0002-4933-594X

Abstract

The Internet of Things (IoT) generates massive multimodal data that can potentially power sophisticated analytics using Multimodal Large Language Models (MLLMs) and give rise to new paradigms where users interact with IoT systems through natural language prompts and query IoT applications for complex insights. However, these models pose significant privacy risks when sensitive data, such as personal identifiers or proprietary information, is shared for model training or finetuning at scale. In this paper, we propose a novel privacy-enhanced Visual Question Answering (VQA) framework tailored for IoT scenarios. Unlike prior VQA pipelines that omit explicit privacy controls or rely on coarse box masking, our framework applies semantic segmentation and labeling to isolate sensitive content, which is removed or obfuscated via diffusion inpainting before model processing. Our framework applies semantic segmentation and labeling to isolate sensitive content, which is removed or obfuscated via diffusion inpainting before model processing. We then leverage parameter-efficient Low-Rank Adaptation (LoRA) to finetune a language–vision assistant and incorporate Ghost Clipping to reduce gradient-based data leakage. For questions that require context beyond the processed IoT images, we introduce Retrieval-Augmented Generation (RAG) to draw upon external knowledge sources and deliver insights. Experimental results on a customized IoT-oriented subset of the OK-VQA dataset constructed via a CLIP similarity score to an IoT phrase bank and a light COCO super-category filter illustrate that our framework preserves privacy while achieving competitive accuracy, even outperforming standard non-private baselines. Protecting privacy at both the data preprocessing and finetuning stages, this work offers a scalable solution that meets the stringent privacy demands of real-world IoT environments. On the OK-VQA, which contains 14,055 images and QA pairs, our framework attains 76.1% accuracy without privacy consideration, and 49.3%–75.2% for a privacy budget ranging from 0.5 to 5, even surpassing a non-private LLaVA-7B baseline (73.7%).

Keywords

Differential privacy (DP), edge computing, Internet of Things (IoT), Multimodal Large Language Model (MLLM), Privacy-preserving, Retrieval-Augmented Generation (RAG), Visual Question Answering (VQA)

Document Type

Journal Article

Date of Publication

1-1-2026

Publication Title

IEEE Internet of Things Magazine

Publisher

IEEE

School

School of Engineering

Comments

Ni, T., Yuan, X., Li, S., & Ni, W. (2026). Privacy-preserving multimodal reasoning for Internet of Things: A retrieval-augmented large language and vision assistant framework. IEEE Internet of Things Magazine. Advance online publication. https://doi.org/10.1109/MIOT.2026.3650753

Copyright

subscription content

Link to Full Text

COinS

Link to publisher version (DOI)

10.1109/MIOT.2026.3650753

Research outputs 2022 to 2026

Privacy-preserving multimodal reasoning for internet of things: A retrieval-augmented large language and vision assistant framework

Author Identifier (ORCID)

Abstract

Keywords

Document Type

Date of Publication

Publication Title

Publisher

School

Comments

Copyright

Link to publisher version (DOI)

Search

Links

Browse

Author Information

Article Locations

Research outputs 2022 to 2026

Privacy-preserving multimodal reasoning for internet of things: A retrieval-augmented large language and vision assistant framework

Authors/Creators

Author Identifier (ORCID)

Abstract

Keywords

Document Type

Date of Publication

Publication Title

Publisher

School

Comments

Copyright

Share

Link to publisher version (DOI)

Search

Links

Browse

Author Information

Article Locations