COCO Caption Classification

Results dashboard for caption matching with CLIP

Zero-shot and few-shot performance on a curated COCO subset with 12-caption candidate sets. The model selects the single correct caption against 11 distractors.

Best zero-shot accuracy: 90.17% (RN50) Best few-shot accuracy: 90.00% (ViT-B/32, 8-shot) Eval set: 11,640 samples Trainable params: 65,536 to 262,144
Best Zero-shot F1 90.16% RN50 backbone
Few-shot F1 90.00% ViT-B/32 with residual adapters

Release artifacts

  • models/best_rn50_8shot.pth
  • models/best_vit_b32_8shot.pth
  • docs/plot/*.json Plotly curves

Dataset processing

Curated COCO subset for caption matching with strict train and evaluation separation.

Model configuration

Reference settings used for training and evaluation discussions.

Dataset

  • Subset dir: coco_subset_images
  • Images: coco_subset_images/images
  • Metadata: coco_multimodal_subset.json
  • Distractors: 11 (N=12 total captions)

Training

  • Epochs: 25
  • Batch size: 32
  • Learning rate: 3e-4
  • Weight decay: 0.01
  • Label smoothing: 0.2
  • Dropout: 0.6
  • Early stop: patience 2, min delta 0.002
  • k-shots: 8 (current demo)

Models

  • Backbones: ViT-B/32, RN50
  • Adapters: residual dual (image + text)
  • Trainable params: 65,536 to 262,144
  • Checkpoints: models/best_*.pth

Tracking

  • Device: cuda
  • WandB project: coco_caption_classification
  • WandB entity: thailearncoding-ho-chi-minh-university-of-technology
  • Eval set: 11,640 samples

Exploratory data analysis (EDA)

Category distribution and sample pairs from the curated subset.

Data distribution

Data Distribution Bar Chart

Samples with captions

Sample Image Grid

Model architecture

Frozen CLIP backbones with residual adapters for few-shot alignment.

Architecture of CLIP model Residual Dual for few-shot using CLIP

Training phase (loss curves)

Training curves per model (loss, train accuracy, eval accuracy, and F1 scores).

Evaluation

Zero-shot and 8-shot results across backbones.

Zero-shot accuracy 90.17%

RN50 backbone on 11,640 samples.

Few-shot accuracy 90.00%

ViT-B/32 with 8-shot adapters.

Inference speed 12.26 ms

Avg per sample (ViT-B/32, 8-shot).

Trainable params 65k to 262k

Lightweight residual adapters only.

Model comparison

Model Shots Mode Accuracy F1 Macro F1 Micro F1 Weighted Time (s) Time / Sample (ms)
RN50 0 Zero-shot 0.901718 0.901566 0.901718 0.901697 153.556601 13.192148
RN50 8 Few-shot 0.891924 0.891877 0.891924 0.891938 152.817952 13.128690
ViT-B/32 0 Zero-shot 0.899656 0.899521 0.899656 0.899657 143.099318 12.293756
ViT-B/32 8 Few-shot 0.900000 0.900034 0.900000 0.900013 142.681351 12.257848

Evaluation curves by model

Each plot groups metrics for a single backbone across k-shot values.

Attention maps

GradCAM and attention rollouts show where the model focuses.

GradCAM interpretations

Hard cases

Misclassified or ambiguous samples for error analysis.

Hard case 1 Hard case 2 Hard case 3 Hard case 4