We propose a new problem, In-2-4D, for generative 4D (i.e., 3D + motion) inbetweening from a minimalistic
input setting: two
single-view images capturing an object in two distinct motion states. Given two images representing the
start and end states of an object in motion, our goal is to generate and reconstruct the motion in 4D.
We utilize a video interpolation model to predict the motion, but large frame-to-frame motions can lead to
ambiguous interpretations. To overcome this, we employ a hierarchical approach to identify keyframes that
are visually close to the input states and show significant motion, then generate smooth fragments between
them. For each fragment, we construct the 3D representation of the keyframe using Gaussian Splatting. The
temporal frames within the fragment guide the motion, enabling their transformation into dynamic Gaussians
through a deformation field. To improve temporal consistency and refine 3D motion, we expand the
self-attention of multi-view diffusion across timesteps and apply rigid transformation regularization.
Finally, we merge the independently generated 3D motion segments by interpolating boundary deformation
fields and optimizing them to align with the guiding video, ensuring smooth and flicker-free transitions.