In-2-4D: Inbetweening from Two Single-View
Images to 4D Generation

1Simon Fraser University, Canada 2Tel Aviv University, Israel


In-2-4D turns two burst-mode photos captured by your phone into free-viewpoint Motion Videos.



Abstract

We propose a new problem, In-2-4D, for generative 4D (i.e., 3D + motion) inbetweening from a minimalistic input setting: two single-view images capturing an object in two distinct motion states. Given two images representing the start and end states of an object in motion, our goal is to generate and reconstruct the motion in 4D.

We utilize a video interpolation model to predict the motion, but large frame-to-frame motions can lead to ambiguous interpretations. To overcome this, we employ a hierarchical approach to identify keyframes that are visually close to the input states and show significant motion, then generate smooth fragments between them. For each fragment, we construct the 3D representation of the keyframe using Gaussian Splatting. The temporal frames within the fragment guide the motion, enabling their transformation into dynamic Gaussians through a deformation field. To improve temporal consistency and refine 3D motion, we expand the self-attention of multi-view diffusion across timesteps and apply rigid transformation regularization. Finally, we merge the independently generated 3D motion segments by interpolating boundary deformation fields and optimizing them to align with the guiding video, ensuring smooth and flicker-free transitions.



Video

Top-Down Divide

We adaptively generate intermediate keyframes between input images using a hierarchical approach that detects rapid motion changes. This simplifies complex motion trajectories into quasi-static segments. A video interpolation module, guided by feature correspondence, selects keyframes.

Bottom-Up Merge

We utilize the keyframes to generate quasi-static motions using video diffusion model. We then lift them to 4D using multi-view diffusion with piece-wise rigid assumptions to form 4D fragments using the keyframes as canonical pose. We then merge these independent generations into our final 4D video.


4D Animation

Comparison to Baselines

We compare our approach with baseline methods. All the baselines are based on latest Video Interpolation and Video-to-4D Generation models. Our method shows more consistent and motions with accurate textures


Generation Gallery

We show novel-view generations of complex motion categories using our In-2-4D method.



Application: 4D Motion Editing

In contrast to most existing 4D generation methods that depend on SDS, our approach improves controllability and motion diversity. While BLIP is used by default to extract motion prompts, users can input custom prompts to generate 4D motions for the same initial and final state. As shown below, both jumping and walking motions of a dog are synthesized under identical start and end states.Despite motion complexity, our bottom-up 3D optimization ensures artifact-free novel view generation






BibTeX

@article{nag2025in24d,
  author    = {Nag, Sauradip and Cohen-Or, Daneil and Zhang, Hao. and Mahdavi-Amiri, Ali},
  title     = {In-2-4D: Inbetweening from Two Single View Images to 4D Generation},
  journal   = {Arxiv},
  year      = {2025},
}