BabyVLM: Data-Efficient Pretraining of VLMs Inspired by Infant Learning

Abstract

Human infants rapidly develop visual reasoning skills from minimal input, suggesting that developmentally inspired pretraining could significantly enhance the efficiency of vision-language models (VLMs). Although recent efforts have leveraged infant-inspired datasets like SAYCam, existing evaluation benchmarks remain misaligned—they are either too simplistic, narrowly scoped, or tailored for large-scale pretrained models. Additionally, training exclusively on infant data overlooks the broader, diverse input from which infants naturally learn. To address these limitations, we propose BabyVLM, a novel framework comprising comprehensive in-domain evaluation benchmarks and a synthetic training dataset created via child-directed transformations of existing datasets. We demonstrate that VLMs trained with our synthetic dataset achieve superior performance on BabyVLM tasks compared to models trained solely on SAYCam or general-purpose data of the SAYCam size. BabyVLM thus provides a robust, developmentally aligned evaluation tool and illustrates how compact models trained on carefully curated data can generalize effectively, opening pathways toward data-efficient vision-language learning paradigms.

Overview

We propose a novel framework, BabyVLM, for data-efficient pretraining of vision-language models (VLMs). To this end we introduce methods for creating minimal yet naturalistic data—akin to the input human infants receive—as well as comprehensive in-domain evaluation benchmarks. By carefully curating the training data, we show that our method yields more robust, baby-like representations compared to training on general-purpose corpora, and can further serves as a template for resource-efficient model training in other specialized domains.

To bridge this evaluation gap and realize our goal of data-efficient, developmentally aligned VLM pretraining, we offer three main contributions:

In-Domain Evaluation Tasks. We design three novel evaluation tasks derived from the SAYCam dataset. These tasks are tailored to reflect the cognitive and perceptual abilities typical of early human development, enabling comprehensive and meaningful evaluation of compact models trained on developmentally plausible data.
Synthetic Data Augmentation. We introduce a data distillation approach to address the inherent limitations of existing small-scale datasets. By synthesizing simplified, child-directed versions of existing datasets like CC3M using GPT-4o, we create training data that more closely mirrors the linguistic and visual complexity encountered by infants.
BabyLLaVA: Generative Model Trained from Scratch. We present BabyLLaVA, the first generative VLM trained entirely on developmentally plausible data. BabyLLaVA demonstrates that compact generative models, when trained on intentionally constrained and naturalistic data, can produce robust, baby-oriented responses from the input of baby viewpoints.

We introduce BabyVLM, a developmentally inspired framework derived from SAYCam, consisting of the original SAYCam dataset, a transferred training dataset, a generative baseline VLM, and four evaluation benchmarks.

Dataset

Filtered SAYCam Dataset

The SAYCam dataset provides egocentric audiovisual recordings of infants aged 6–32 months. We utilize this dataset as our primary source of developmentally plausible data, filtering and processing it to create a clean training corpus that captures the natural learning environment of infants.

Transferred Synthetic Training Dataset

To address the inherent limitations of existing small-scale datasets, we introduce a data augmentation approach. By synthesizing simplified, child-directed versions of existing datasets like CC3M using GPT-4o, we create training data that more closely mirrors the linguistic and visual complexity encountered by infants. This transferred dataset helps bridge the gap between the limited SAYCam data and the broader, diverse input from which infants naturally learn.

Pipeline for generating the transferred dataset. Step 1: We prompt GPT-4o to check whether an input caption is describing something a child would see in daily life, and transfer the original image captions into simpler, child-directed utterances. Step 2: We use CLIP similarity score as a metric to represent the distance between two images, then conduct Hungarian matching to select a small subset of the transferred dataset that is visually aligned with SAYCam images.

Below are a few examples from our dataset:

Examples of the original SAYCam dataset and the transferred dataset.

BabyLLaVA

Inspired by recent methods, we present BabyLLaVA, the first generative VLM trained entirely on developmentally plausible data. BabyLLaVA follows the architecture and training strategy of LLaVA, consisting of a language backbone, a vision backbone, and a two-layer MLP connector. The model demonstrates that compact generative models, when trained on intentionally constrained and naturalistic data, can produce robust, baby-oriented responses from the input of baby viewpoints.

For the language backbone, we train a small GPT-2 model with 7M parameters from scratch using the language portion of our training corpus. The vision backbone is based on a ResNeXt-50 model with 23M parameters, trained from scratch using DINOv2 on all SAYCam video clips. The connector is a simple two-layer MLP, identical to LLaVA-v1.5.

Evaluation Tasks

Illustrations of in-domain evaluation benchmarks in the BabyVLM framework. Labeled-S: The category label must be matched to the target referent among 4 candidates. Visual Two-Word Test: The positive phrase must be matched to the image. Positive and negative phrases are generated by GPT-4o. Baby Winoground: The positive and negative phrases must be matched with their corresponding images. Negative images are generated by Stable Diffusion, with prompts enhanced by GPT-4o. SAYCam Caption: The generated image caption must match the ground truth image caption. All image-caption pairs come from a de-duplicated subset of the SAYCam test split.

Labeled-S

The Labeled-S benchmark specifically targets SAYCam data, evaluating object classification tasks within the infant's visual environment.

Visual Two-Word Test

We construct VTWT by sub-sampling the SAYCam test split and using GPT-4 to generate candidate two-word phrases through structured prompts. These prompts incorporate few-shot examples and Chain-of-Thought guidance to enhance phrase quality. Each sample is manually reviewed by expert annotators to ensure quality and accuracy.

BabyWinoground

We construct Baby Winoground using the test samples from VTWT, modifying the original images such that the modified image is associated exclusively with the negative phrase while preserving most of the original content. This task evaluates the model's ability to understand and reason about visual compositions.

SAYCam Caption

The SAYCam Caption task evaluates the model's ability to generate appropriate captions for images from the infant's perspective. This task assesses the model's understanding of visual scenes in a developmentally appropriate context.

Experiment

We begin by evaluating multiple models, including baby models trained purely on SAYCam (BabyLLaVA, CVCL) and larger upper bound models that are either directly used out of the box (LLaVA-v1.5-7B, CLIP-large) or further fine-tuned on our SAYCam data (LLaVA-v1.5-7B-ft). These models are assessed on four in-domain benchmarks: Labeled-S, Visual Two-Word Test (VTWT), Baby Winoground, and SAYCam Caption. The table below summarizes these results.

Notably, CVCL—a contrastive model—consistently outperforms the generative BabyLLaVA model across most tasks. This observation aligns with existing literature, suggesting that contrastive models may be better suited to discriminative tasks, possibly due to their direct objective of learning joint visual-textual alignment. However, generative models like BabyLLaVA demonstrate reasonable performance on simpler compositional tasks such as VTWT, indicating substantial potential for improvement on more sophisticated compositional tasks like Baby Winoground. In particular, Baby Winoground reveals a stark asymmetry: baby models perform above chance when reasoning from in-distribution (positive) context, but below chance from out-of-distribution (negative) context, highlighting a systematic failure under distribution shift. Moreover, generative captioning, measured by SAYCam Caption scores, remains challenging for all models, emphasizing the additional complexity inherent in generating full linguistic descriptions from minimal data.

A primary aim of our approach is ensuring baby models align with the cognitive and linguistic limitations of early-stage learners. To empirically validate this property, we explicitly assess baby models on tasks that exceed typical infant-level developmental capacities, such as advanced visual reasoning (Winoground) and general-purpose tasks (VQA and BLiMP). As shown in the table below, baby models (e.g., BabyLLaVA, CVCL) perform significantly below upper-bound models, affirming their constrained generalization capabilities. This limitation ensures developmental authenticity, preventing baby models from inadvertently solving complex tasks beyond its intended cognitive stage.

Interestingly, we find that the performance gap between BabyLLaVA and the larger LLaVA-v1.5-7B model is significantly greater on these complex, out-of-domain tasks compared to simpler, in-domain tasks such as VTWT. This indicates that observed differences in performance cannot be attributed solely to differences in model capacity (i.e., parameter count), but also arise from the complexity and alignment of tasks and datasets with the developmental stage being modeled. Thus, baby models’ constraints are multidimensional, encompassing not only architectural limitations but also deliberate choices in task design and dataset construction.

BibTeX

@misc{wang2025babyvlmdataefficientpretrainingvlms, title={BabyVLM: Data-Efficient Pretraining of VLMs Inspired by Infant Learning}, author={Shengao Wang and Arjun Chandra and Aoming Liu and Venkatesh Saligrama and Boqing Gong}, year={2025}, eprint={2504.09426}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2504.09426}, }