AI Glossary

Multimodal Model

An AI model that can process and generate multiple types of data -- text, images, audio, video -- within a single unified architecture.

How They Work

Multimodal models encode different data types into a shared embedding space using modality-specific encoders (vision encoder for images, audio encoder for speech). A unified transformer then reasons across all modalities.

Examples

GPT-4V/o: Text + images. Gemini: Text + images + audio + video. Claude: Text + images + PDFs. LLaVA: Open-source vision-language model.

Applications

Visual question answering, document understanding (charts, tables, diagrams), video analysis, multimodal search, and accessibility tools that describe images for visually impaired users.

← Back to AI Glossary

Last updated: March 5, 2026