Intro
In recent years, Vision-Language Models (VLMs) have achieved remarkable success in connecting the domains of vision and language. Models such as CLIP and ALIGN, trained on vast datasets of image-text pairs, have enabled highly efficient cross-modal retrieval systems. However, these breakthroughs have primarily been limited to English and a few other widely spoken languages. Persian, is among the underserved languages in this regard, with limited resources and tools available for advanced tasks such as image-text retrieval.
PTIR Demo
General Works in the Paper
While existing multilingual VLMs attempt to bridge the gap for non-English languages, their performance in Persian remains suboptimal due to the lack of high-quality datasets and models tailored for the language. Specifically, there are no comprehensive Persian text-image retrieval systems that can effectively handle detailed queries or retrieve relevant images in diverse real-world scenarios.
Furthermore, widely adopted systems like CLIP are challenging to fine-tune or adapt to such specialized domains due to their reliance on large-scale datasets and resources unavailable in Persian.
This paper proposes a pioneering approach to Persian Text-Image Retrieval (PTIR), marking a significant advancement in the field. Our contributions include:
A. Dataset
Our work introduces a groundbreaking dataset of 1.2 million Persian image-caption pairs, setting a new standard for Persian text-image retrieval with detailed, high-quality captions. Data collection involved aggregating diverse sources, generating captions using advanced Vision-Language Models, and refining them for cultural and linguistic accuracy. Unlike existing datasets, our captions are rich in detail, describing key visual elements like object counts, shapes, colors, environmental context, and unique attributes such as age groups and animal breeds, ensuring comprehensive and precise image descriptions.
D. Model Development
PTIR’s captioning model uses DINOv2-base as a vision encoder and GPT2-fa as a Persian-specific text decoder. The two-phase training strategy first fine-tunes the text decoder, then the entire model, leading to 30-40% improvement in captioning performance over existing Persian models.
C. Retrieval Framework
Our retrieval pipeline integrates an image captioning model with a scalable vector database to enable efficient and accurate image-text retrieval. The process involves three key steps:
- Embedding Generation: Captions generated by the captioning model are transformed into dense embeddings using a sentence embedding model to capture semantic meaning.
- Vector Database: After evaluating several options, Milvus was selected for its scalability, speed, and integration ease, supporting large-scale similarity searches with dense embeddings.
- Query Processing: Text queries are embedded using the same model as captions, enabling consistent top-k retrieval of similar images via Milvus.
This modular framework is highly adaptable for domain-specific applications, such as medical imaging or cultural heritage, where specialized datasets can improve performance beyond general-purpose models like CLIP.
D. Retrieval Evaluation
We evaluated PTIR using Hit@K, where it outperforms Persian baselines and CLIP-based models. PTIR achieves 22% Hit@1 and 80% Hit@200, demonstrating strong retrieval performance.
E. Computational Efficiency and Scalability
PTIR optimizes efficiency with 3ms retrieval latency and fast embedding storage. It is more scalable and resource-efficient compared to English-centric models like CLIP and SigLIP, making it practical for real-world Persian applications.
F. Future Work
Our work contributes a novel Persian Text-Image Retrieval framework that advances the state of the art for this underrepresented language. We demonstrate the system’s effectiveness in real-world scenarios and envision future improvements, including broader dataset expansions and optimizations for real-time applications.