Visual Question Answering
GenAI agent that answers natural language questions about images using multimodal architecture.
The Problem
Users need to ask questions about images (accessibility, education, customer service) but most AI systems only handle text. Bridging vision and language requires specialized architectures.
The Solution
A multimodal agent combining BLIP-2 for image understanding with LLMs for language reasoning. Includes RAG components for contextual enrichment and prompt chains for multi-step reasoning.
System Architecture
Results