Filtered SAYCam Dataset
The SAYCam dataset provides egocentric audiovisual recordings of infants aged 6–32 months. We utilize this dataset as our primary source of developmentally plausible data, filtering and processing it to create a clean training corpus that captures the natural learning environment of infants.
Transferred Synthetic Training Dataset
To address the inherent limitations of existing small-scale datasets, we introduce a data augmentation approach. By synthesizing simplified, child-directed versions of existing datasets like CC3M using GPT-4o, we create training data that more closely mirrors the linguistic and visual complexity encountered by infants. This transferred dataset helps bridge the gap between the limited SAYCam data and the broader, diverse input from which infants naturally learn.

Pipeline for generating the transferred dataset. Step 1: We prompt GPT-4o to check whether an input caption is describing something a child would see in daily life, and transfer the original image captions into simpler, child-directed utterances. Step 2: We use CLIP similarity score as a metric to represent the distance between two images, then conduct Hungarian matching to select a small subset of the transferred dataset that is visually aligned with SAYCam images.
Below are a few examples from our dataset:

Examples of the original SAYCam dataset and the transferred dataset.