BabyVLM-V2

BabyVLM-V2: Toward Developmentally Grounded Pretraining and Benchmarking of Vision Foundation Models

Shengao Wang^*†1, Wenqi Wang^*1, Zecheng Wang^*1, Max Whitton^*1,
Michael Wakeham¹, Arjun Chandra¹, Joey Huang¹, Pengyue Zhu¹,
Helen Chen^‡1, David Li^‡1, Jeffrey Li^‡1, Shawn Li^‡1, Andrew Zagula^‡1, Amy Zhao^‡1, Andrew Zhu^‡1,
Sayaka Nakamura², Yuki Yamamoto², Jerry Jun Yokono²,
Aaron Mueller¹, Bryan A. Plummer¹, Kate Saenko¹, Venkatesh Saligrama¹, Boqing Gong¹

¹Boston University, ²Sony Group Corporation

^*Equal contribution. ^†Project lead. ^‡Equal contribution; work done as interns at Boston University.

BabyVLM-V1 arXiv Code (in progress)

Abstract

Early children's developmental trajectories set up a natural goal for sample-efficient pretraining of vision foundation models. We introduce BabyVLM-V2, a developmentally grounded framework for infant-inspired vision-language modeling that extensively improves upon BabyVLM-V1 through a longitudinal, multifaceted pretraining set, a versatile model, and, most importantly, DevCV Toolbox for cognitive evaluation. The pretraining set maximizes coverage while minimizing curation of a longitudinal, infant-centric audiovisual corpus, yielding video-utterance, image-utterance, and multi-turn conversational data that mirror infant experiences. DevCV Toolbox adapts all vision-related measures of the recently released NIH Baby Toolbox^® into a benchmark suite of ten multimodal tasks, covering spatial reasoning, memory, and vocabulary understanding aligned with early children's capabilities. Experimental results show that a compact model pretrained from scratch can achieve competitive performance on DevCV Toolbox, outperforming GPT-4o on some tasks. We hope the principled, unified BabyVLM-V2 framework will accelerate research in developmentally plausible pretraining of vision foundation models.

Overview

Given a longitudinal, infant-centric audiovisual sample of early children's sensory experiences, can we learn a foundation model (FM) that is as versatile and capable as the early children's perception? As a further challenge, can we leverage principles of developmental psychology to create a benchmark as an initial step toward artificial developmental intelligence (ADI), in both what it is and how to achieve it, within the constraints of early children's limited sensory intake?

While our previous work, BabyVLM-V1, sets up a basic framework towards this question, it lacks crucial elements. In response, this work extends BabyVLM-V1 to a more comprehensive, extensive, and developmentally plausible framework, by improving in three key aspects:

Richer Pretraining Data: We expand the pretraining set to leverage the full extent of SAYCam's recordings, covering a more substantial portion of the total visual intake time of a three-year-old since birth. We also construct several different formats of training data to support diverse downstream tasks.
Baseline Models: By leveraging the new training dataset and introducing an additional finetuning stage, We developed BabyLLaVA-V2, a compact VLM capable of taking multi-image input, having multi-turn conversation, and following natural language instructions, purely from scratch.
Psychology-Grounded Benchmark: We replace intuitive task design with a benchmark adapted from developmental psychology evaluations, enabling age-aligned assessment of early perceptual and cognitive abilities.

BabyVLM-V2: An extensive, versatile, and developmentally plausible framework for research in vision foundation models. Its (a) pretraining set is diverse in format (video, image-utterance, and multiple turns), enabling (b) a flexible model. Its (c) benchmark developmentally aligns with the pretraining set's age span by grounding on the newly released NIH Baby Toolbox^®.

Dataset

The dataset of BabyVLM-V2 is built from the longitudinal SAYCam corpus: egocentric audiovisual recordings from infants collected weekly from roughly 6 to 32 months, totaling 478 hours. We maximize coverage while keeping curation minimal: we transcribe caregiver speech and derive a mixed-format pretraining set that mirrors infant experience in multiple ways—video–utterance pairs, image–utterance pairs, and interleaved multi-turn sequences. Concretely, we segment speech-aligned clips to form about 181k video–utterance pairs, sample frames to obtain about 768k image–utterance pairs, and construct about 63k interleaved sequences to support multi-turn, multi-image style interaction. Beyond pretraining, BabyVLM-V2 additionally releases an instruction fine-tuning dataset (~150k examples) so models can follow natural-language prompts with a unified language interface during evaluation.

Examples of the BabyVLM-V2 training data.

BabyLLaVA-V2

BabyLLaVA-V2 is a compact, fully generative vision–language model trained from scratch using the dataset mentioned above. It uses a unified language interface to handle diverse input structures—single images, multiple images, videos, and multi-turn interactions—matching the mixed-format experience in BabyVLM-V2's training data.

Architecture: Similar to LLaVA, BabyLLaVA-V2 consists of (1) a ViT-L/16 vision encoder (~300M parameters), (2) a LLaMA-1.1B language backbone (~1.1B parameters), and (3) a lightweight MLP connector that projects visual features into the language embedding space.

Training recipe: BabyLLaVA-V2 is trained in two stages. Its pretraining stage follows the same LLaVA-style recipe as BabyLLaVA-V1. On top of that, BabyLLaVA-V2 adds an instruction fine-tuning stage, making it explicitly designed for unified language generation (rather than logit post-processing) and enabling it to respond to natural-language prompts for evaluation across diverse tasks.

DevCV Toolbox

Evaluating vision–language models trained under developmental constraints requires benchmarks that are both in-domain and developmentally meaningful. Rather than designing tasks heuristically, we introduce DevCV Toolbox, a benchmark grounded in established developmental psychology, providing a principled reference for what early visual abilities are measured and how they are assessed.

NIH Baby Toolbox^® as a Developmental Reference

To anchor our benchmark in established practice, we adopt the NIH Baby Toolbox^®—a standardized assessment framework developed by the National Institutes of Health to measure early development of infants. The Toolbox defines age-aligned measures across multiple domains and is widely used in developmental and clinical research, making it a natural reference for modeling early visual abilities.

The key ability assessed by the NIH Baby Toolbox^®.

From Human Assessment to Computer Vision Tasks

While the NIH Baby Toolbox^® provides a principled developmental reference, its original tests are designed for human infants: they involve small sample sizes, interactive protocols, and often cartoon-based stimuli. To make these measures suitable for vision–language models, we systematically adapt several vision-related measures into scalable, in-domain computer vision tasks. Each adapted sub-task preserves the core cognitive demand of the original assessment while replacing human interaction with a standardized visual input–output format, enabling evaluation at scale on naturalistic egocentric data.

Below is a demonstration of one of the task (Visual Delayed Response) from the official NIH Baby Toolbox^® iPad application, and its adaptation into DevCV Toolbox:

NIH Baby Toolbox^® (Human Assessment)

Infant age: 22-42 months
Key ability assessed: Executive function, Memory, Attention
Description: A creature appears in the middle and then hides in the left/right box. The curtain closes and music plays. Then the child needs to select which box the creature hid
Metric: Accuracy, reaction time

DevCV Toolbox (Model Evaluation)

Input:
- Video: A short video clips where an object moving toward an direction
- A text prompt: "Which direction does the [object] go?"
Desired output: One of 8 possible directions (top, bottom, left, right, top left, top right, bottom left, bottom right)
Metric: Accuracy

We preserve the core cognitive demand while replacing interactive human protocols with standardized, scalable model I/O.

Intotal, we selected 10 sub-tasks from the NIH Baby Toolbox^® and adapted them to formulize our benchmark, DevCV Toolbox. The illustrations of each task are shown below.

Left: DevCV Toolbox tasks. Right: The original NIH Baby Toolbox^® tasks.

Experiment

We include BabyLLaVA-V2 together with several popular closed-source and open-source models on DevCV Toolbox. We also include human performance collected from several college students as an upper bound, the procedure of collecting human infants' responses is currently under review by IRB. Please see the main experiment results below.

Performance comparison of different models on DevCV Toolbox. Different background colors denote different model families. We report accuracy (%) for all tasks; the higher, the better.

DevCV Toolbox can Differentiate Model Capability

Crucially, DevCV Toolbox exhibits a clear performance hierarchy: Human adults provide a strong upper bound (93.0%), Large proprietary models perform best among AI systems, Compact open-source models occupy a middle range, Random guessing establishes a non-trivial lower bound. This confirms that DevCV Toolbox is challenging yet solvable, and capable of meaningfully differentiating models by cognitive capability rather than scale alone.

Strong In-Domain Performance of BabyLLaVA-V2

Across all tasks, BabyLLaVA-V2 achieves 55.2% average accuracy, substantially outperforming random guessing (31.8%) and matching or exceeding several open-source models of comparable scale. While proprietary models (GPT-5, Gemini-2.5-Pro) remain strongest overall, BabyLLaVA-V2 demonstrates competitive performance despite being trained on orders of magnitude less data.

Intriguing Findings

We also draw some intriguing "byproduct" findings from main results table, which can improve our understanding of the proprietary GPT and Gemini models.

GPT models struggle to count. Object Counting requires a model to count objects in an image (between 1 and 12), and GPT-4o can hardly count beyond 5 (see the figure below).

GPT-4o and our model's counting performance by different object numbers.

BabyLLaVA-V2 can match or outperform GPT-4o on some cognitive tasks. On Spatial Details and Who Has More, BabyLLaVA-V2 is on par with the four latest GPT and Gemini models. Moreover, it even outperforms GPT-4o on the math tasks of Object Counting and Who Has More. The figure above also shows that BabyLLaVA-V2 counts better than GPT-4o given six or more objects.

GPT vs. Gemini. In general, the proprietary models give rise to similar results on DevCV Toolbox. However, when we zoom into the individual tasks, GPT-5 is significantly better than the rest on Spatial Details, while Gemini models are better at Object Counting than the GPT models.

BibTeX

@misc{wang2025babyvlmv2developmentallygroundedpretraining, title={BabyVLM-V2: Toward Developmentally Grounded Pretraining and Benchmarking of Vision Foundation Models}, author={Shengao Wang and Wenqi Wang and Zecheng Wang and Max Whitton and Michael Wakeham and Arjun Chandra and Joey Huang and Pengyue Zhu and Helen Chen and David Li and Jeffrey Li and Shawn Li and Andrew Zagula and Amy Zhao and Andrew Zhu and Sayaka Nakamura and Yuki Yamamoto and Jerry Jun Yokono and Aaron Mueller and Bryan A. Plummer and Kate Saenko and Venkatesh Saligrama and Boqing Gong}, year={2025}, eprint={2512.10932}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2512.10932}, }