HiMoR: Monocular Deformable Gaussian Reconstruction with Hierarchical Motion Representation

Abstract

We present Hierarchical Motion Representation (HiMoR), a novel deformation representation for 3D Gaussian primitives capable of achieving high-quality monocular dynamic 3D reconstruction.

The insight behind HiMoR is that motions in everyday scenes can be decomposed into coarser motions that serve as the foundation for finer details. Using a tree structure, HiMoR's nodes represent different levels of motion detail, with shallower nodes modeling coarse motion for temporal smoothness and deeper nodes capturing finer motion. Additionally, our model uses a few shared motion bases to represent motions of different sets of nodes, aligning with the assumption that motion tends to be smooth and simple. This motion representation design provides Gaussians with a more structured deformation, maximizing the use of temporal relationships to tackle the challenging task of monocular dynamic 3D reconstruction.

We also propose using a more reliable perceptual metric as an alternative, given that pixel-level metrics for evaluating monocular dynamic 3D reconstruction can sometimes fail to accurately reflect the true quality of reconstruction. Extensive experiments demonstrate our method's efficacy in achieving superior novel view synthesis from challenging monocular videos with complex motions.

Method overview

Left: The proposed hierarchical motion representation (HiMoR) is defined in the canonical frame with 3D Gaussian primitives. HiMoR uses a tree structure where each node represents the relative motion to its parent node, with the root node representing stationary motion that is fixed to the world coordinate origin.
Top right: Child nodes that belong to the same parent node share a set of SE(3) motion bases, and the motion of each child node is obtained by weighting the motion bases with its own coefficients. The motion of leaf nodes relative to the world coordinate is iteratively computed based on the hierarchy of HiMoR.
Bottom right: The deformation of each Gaussian is derived by weighting the motion of its K-nearest neighbor (KNN) leaf nodes within the canonical frame.

Comparisons

Comparisons of dynamic 3D reconstruction from monocular videos.

Input	Ground Truth	HyperNeRF	Marbles	SoM	Ours

BibTeX

@inproceedings{liang2025himor,
    author    = {Liang, Yiming and Xu, Tianhan and Kikuchi, Yuta},
    title     = {{H}i{M}o{R}: Monocular Deformable Gaussian Reconstruction with Hierarchical Motion Representation},
    booktitle = {CVPR},
    year      = {2025},
}

HiMoR: Monocular Deformable Gaussian Reconstruction with Hierarchical Motion Representation

CVPR 2025

Given a single video clip, HiMoR enables novel view synthesis with significant viewpoint changes.

Abstract

Method overview

Comparisons

BibTeX