Skip to content
Technical GlossaryComputer Vision

Image-Text Contrastive Learning

An approach that learns multimodal representations by bringing related image-text pairs together and pushing unrelated pairs apart in a shared space.

Image-text contrastive learning is one of the most effective representation learning strategies in modern vision-language systems. It allows the model to connect images and natural language descriptions within a shared semantic space. Zero-shot classification, semantic visual search, and multimodal retrieval systems are built on this foundation. It is a successful example of learning strong general representations from large-scale weakly labeled data.