We include BabyLLaVA-V2 together with several popular closed-source and open-source models on DevCV Toolbox. We also include human performance collected from several college students as an upper bound, the procedure of collecting human infants' responses is currently under review by IRB. Please see the main experiment results below.
Performance comparison of different models on DevCV Toolbox. Different background colors denote different model families. We report accuracy (%) for all tasks; the higher, the better.
DevCV Toolbox can Differentiate Model Capability
Crucially, DevCV Toolbox exhibits a clear performance hierarchy:
Human adults provide a strong upper bound (93.0%),
Large proprietary models perform best among AI systems,
Compact open-source models occupy a middle range,
Random guessing establishes a non-trivial lower bound.
This confirms that DevCV Toolbox is challenging yet solvable, and capable of meaningfully differentiating models by cognitive capability rather than scale alone.
Strong In-Domain Performance of BabyLLaVA-V2
Across all tasks, BabyLLaVA-V2 achieves 55.2% average accuracy, substantially outperforming random guessing (31.8%) and matching or exceeding several open-source models of comparable scale. While proprietary models (GPT-5, Gemini-2.5-Pro) remain strongest overall, BabyLLaVA-V2 demonstrates competitive performance despite being trained on orders of magnitude less data.
Intriguing Findings
We also draw some intriguing "byproduct" findings from main results table, which can improve our understanding of the proprietary GPT and Gemini models.
GPT models struggle to count. Object Counting requires a model to count objects in an image (between 1 and 12), and GPT-4o can hardly count beyond 5 (see the figure below).
GPT-4o and our model's counting performance by different object numbers.
BabyLLaVA-V2 can match or outperform GPT-4o on some cognitive tasks. On Spatial Details and Who Has More, BabyLLaVA-V2 is on par with the four latest GPT and Gemini models. Moreover, it even outperforms GPT-4o on the math tasks of Object Counting and Who Has More. The figure above also shows that BabyLLaVA-V2 counts better than GPT-4o given six or more objects.
GPT vs. Gemini. In general, the proprietary models give rise to similar results on DevCV Toolbox. However, when we zoom into the individual tasks, GPT-5 is significantly better than the rest on Spatial Details, while Gemini models are better at Object Counting than the GPT models.