# Llama 3.2 Vision 11B / 90B: Cross-Attention Adapter + Multi-Image FT

> Source: https://sukruyusufkaya.com/en/learn/fine-tuning-cookbook/ftc-llama-3.2-vision-cross-attention
> Updated: 2026-05-14T14:42:53.991Z
> Category: Fine-Tuning Cookbook (Model-by-Model)
> Module: Part VI — Vision-Language Multimodal FT
**TLDR:** Llama 3.2 Vision — Meta's cross-attention adapter approach (different from LLaVA's MLP). Vision encoder ViT-H/14 joins LLM via **interleaved cross-attention layers**. Multi-image FT, image+text interleave format. 11B QLoRA marginal on RTX 4090 (~22 GB), 90B cloud only.

