Multimodal Model
An AI model that can process and generate multiple types of data -- text, images, audio, video -- within a single unified architecture.
How They Work
Multimodal models encode different data types into a shared embedding space using modality-specific encoders (vision encoder for images, audio encoder for speech). A unified transformer then reasons across all modalities.
Examples
GPT-4V/o: Text + images. Gemini: Text + images + audio + video. Claude: Text + images + PDFs. LLaVA: Open-source vision-language model.
Applications
Visual question answering, document understanding (charts, tables, diagrams), video analysis, multimodal search, and accessibility tools that describe images for visually impaired users.