Recent Research in Computer Science and Engineering

Mürsel Ozan İncetaş; Mert Yağcıoğlu; İrem Şenyer Yapıcı; Nazan Kemaloğlu Alagöz; Rukiye Uzun Arslan; Mürsel Ozan İncetaş; Murat Meriçelli; İrem Cakcak; Burhan Duman; Utku Köse; Emine Betül Sürücü; Furkan Atlan; Mahmut Kılıçaslan

doi:10.58830/ozgur.pub1230

Visual Language Models in Multimodal AI: Architectural Foundations and Industry Applications
Chapter from the book: İncetaş, M. O. (ed.) 2026. Recent Research in Computer Science and Engineering.

Return to Book

İrem Cakcak

Isparta University of Applied Sciences

Burhan Duman

Isparta University of Applied Sciences

Downloads

Read Chapter Download

Synopsis

Recent advances in artificial intelligence have enabled the effective utilization of Vision-Language Models (VLMs), which integrate textual and visual data into a shared representation space, across various disciplines. Transcending traditional single-modal systems, these innovative models offer holistic approaches that support contextual interpretation, multimodal reasoning, and task-oriented generation processes. While this study thoroughly examines the fundamental architectural components of Large Language Models (LLMs) and VLM-based approaches, its application dimension aims to focus predominantly on the sectoral integration of vision-language models. First, the study technically analyzes the Transformer architecture—the foundation of both model families—through mechanisms like self-attention, tokenization, and positional encoding to elucidate the operating principles of LLMs. Subsequently, the focus shifts entirely to multimodal frameworks, detailing the specific application areas of VLMs and their action-augmented Vision-Language-Action (VLA) counterparts. In this context, current approaches are evaluated in domains including end-to-end planning in autonomous driving, spatial grounding in robotics, structured information extraction in healthcare, remote sensing and domain-specific diagnostics in agriculture, and human-centric visual analysis tasks. The findings indicate that VLM-based systems are evolving from passive information processors into decision-support components providing high-level semantic guidance. However, bottlenecks encountered during real-world deployment, such as computational costs, real-time constraints, and hallucination risks, are discussed, concluding that future trends will pivot towards parameter-efficient and domain-specific hybrid architectures.

Keywords:

Computer Science Computer Engineering Artificial Intelligence Data Science Image Processing

How to cite this book

Cakcak, İ. & Duman, B. (2026). Visual Language Models in Multimodal AI: Architectural Foundations and Industry Applications. In: İncetaş, M. O. (ed.), Recent Research in Computer Science and Engineering. Özgür Publications. DOI: https://doi.org/10.58830/ozgur.pub1230.c4967

License

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Published

March 18, 2026

DOI

https://doi.org/10.58830/ozgur.pub1230.c4967

Visual Language Models in Multimodal AI: Architectural Foundations and Industry Applications Chapter from the book: İncetaş, M. O. (ed.) 2026. Recent Research in Computer Science and Engineering.