• Built a fine grained image retrieval / metric learning system for a car accessories dataset (413 classes), learning 512
D embeddings for visual similarity search and ranking.
• Uses ConvNeXtV2 (tiny/base) backbones with multi scale feature extraction and a learned embedding head to
output compact retrieval vectors.
• Applies Pairwise Cross Attention (PWCA) for distractor aware features and GeM pooling to strengthen salient
regions before embedding projection.
• Trains embeddings with Proxy Anchor loss using learnable class proxies and cosine similarity.
• Uses a 3 stage curriculum (224 320 384 px) with differential LRs, gradient accumulation, and increasing
augmentation strength to stabilize fine tuning
• Evaluates retrieval quality with Recall@1/5/10 and mAP, using cosine similarity matrices of embeddings