This is a Plain English Papers summary of a research paper called Video Normals from Diffusion: Detail & Consistency for Open-World Footage. If you like these kinds of analysis, you should join AImodels.fyi or follow us on Twitter.

Introduction: Consistent Surface Normals from Videos - Why It Matters

Surface normals serve as fundamental descriptors of 3D scene geometry, providing crucial information about the orientation of surfaces. They enable a wide range of applications including 3D reconstruction, relighting, video editing, and mixed reality. Despite significant progress in estimating surface normals from static images, ensuring temporal consistency in video-based normal estimation remains challenging due to variations in scene layouts, illuminations, camera motions, and scene dynamics.

Recent approaches to surface normal estimation from monocular images have embraced both discriminative and generative paradigms. While discriminative approaches are limited by training data scale and quality, generative methods like StableNormal leverage pre-trained diffusion priors to achieve state-of-the-art performance on open-world images. However, these methods are designed for static imagery and often produce temporal inconsistency or flickering when applied to videos.

NormalCrafter addresses this gap by generating temporally consistent normal sequences with rich, fine-grained details from unconstrained open-world videos of arbitrary lengths. Instead of incrementally adding temporal layers to image-based estimators, the model harnesses the inherent temporal priors of video diffusion models for a more robust approach.

NormalCrafter generates temporally consistent normal sequences with fine-grained details from open-world videos with arbitrary lengths. Compared to results from state-of-the-art image normal estimators, our results exhibit both higher spatial fidelity and temporal consistency, as shown in the frame visualizations and temporal profiles (marked by the red lines and rectangles).
Figure 1: NormalCrafter generates temporally consistent normal sequences with fine-grained details from open-world videos with arbitrary lengths. Compared to results from state-of-the-art image normal estimators, Marigold-E2E-FT, our results exhibit both higher spatial fidelity and temporal consistency, as shown in the frame visualizations and temporal profiles (marked by the red lines and rectangles).

The key innovations of NormalCrafter include:

  1. Semantic Feature Regularization (SFR), which aligns diffusion features with semantic cues, encouraging the model to focus on the intrinsic semantics of the scene
  2. A two-stage training protocol that leverages both latent and pixel space learning to preserve spatial accuracy while maintaining long temporal context

Through extensive evaluation, NormalCrafter demonstrates superior performance in generating temporally consistent normal sequences with intricate details from diverse videos, significantly outperforming existing methodologies.

Related Work: Building on Previous Research

NormalCrafter builds upon three primary research areas: discriminative surface normal estimation, diffusion-based surface normal estimation, and video diffusion models.

Discriminative surface normal estimation has evolved from early approaches using hand-crafted features to deep learning methods. Early work used classification with discretized normals, while modern CNN-based approaches have drastically improved performance. Recent advancements include DSINE, which incorporates per-pixel ray directions and models relationships between neighboring normals, providing a strong baseline for surface normal estimation.

Diffusion-based surface normal estimation represents the cutting edge in this field. Methods like Marigold fine-tuned Stable Diffusion for dense prediction tasks conditioned on images. Concurrently, GeoWizard fine-tuned SD to output both depth and normal maps. To address the computational overhead of iterative denoising, some works replaced multi-step denoising with a single-step approach, sacrificing detailed geometry. Despite their strong priors, these methods overlook temporal context and often produce flickering artifacts in videos.

Video diffusion models have seen significant advances in generating temporally coherent frames. Building on Latent Diffusion Models, recent approaches have added temporal convolution and attention layers to enable high-quality video generation. Stable Video Diffusion (SVD) further refined this approach and serves as a model prior for diverse video-related tasks, including NormalCrafter.

Naively repurposing video diffusion models, e.g. SVD, for normal estimation produces over-smoothed predictions, due to insufficient high-level semantic cues in SVD features. By leveraging Semantic Feature Regularization (SFR) to align diffusion features with DINO, our approach yields sharper and more fine-grained normal predictions.
Figure 2: Naively repurposing video diffusion models, e.g. SVD, for normal estimation (Ours w/o SFG) produces over-smoothed predictions, due to insufficient high-level semantic cues in SVD features. By leveraging Semantic Feature Regularization (SFR) to align diffusion features with DINO, our approach yields sharper and more fine-grained normal predictions.

While Learning Temporally Consistent Video Depth has made progress in consistent video depth estimation, NormalCrafter focuses specifically on the challenge of normal estimation, which requires preserving high-frequency, semantics-driven details inherent in surface normals.

Method: The NormalCrafter Framework

NormalCrafter is a video normal estimator derived from video diffusion models (VDMs). Given a video c ∈ ℝ^(F×W×H×3) with frame number F, the objective is to generate normal estimations n ∈ ℝ^(F×W×H×3) that are both spatially accurate and temporally consistent.

Overview of our NormalCrafter. We model the video normal estimation task with a video diffusion model conditioned on input RGB frames. We propose Semantic Feature Regularization (SFR) to align the diffusion features with robust semantic representations from DINO encoder, encouraging the model to concentrate on the intrinsic semantics for accurate and detailed normal estimation. Our training protocol consists of two stages: 1) training the entire U-Net in the latent space with diffusion score matching and SFR; 2) fine-tuning only the spatial layers in pixel space with angular loss and SFR.
Figure 3: Overview of NormalCrafter. The model uses a video diffusion model conditioned on input RGB frames. Semantic Feature Regularization (SFR) aligns diffusion features with semantic representations from DINO encoder. The training uses a two-stage approach: first training the entire U-Net in latent space, then fine-tuning spatial layers in pixel space.

Normal Estimator with VDMs

Modern VDMs typically operate in a compressed latent space using a Variational Autoencoder (VAE) for efficient encoding and decoding of video frames. Since normal maps share the same dimensions as RGB frames, NormalCrafter utilizes the same VAE for both normal maps and the corresponding video.

The VAE encoder and decoder are defined as:

z^x = ℰ(x), x̂ = 𝒟(z^x)

Where ℰ and 𝒟 denote the encoder and decoder, x may represent either normal maps n or video c, and x̂ is the reconstructed counterpart of x. Since most VAEs are pre-trained on RGB frames, which is suboptimal for normal maps, NormalCrafter fine-tunes the VAE decoder specifically on normal data to improve reconstruction quality.

In the diffusion framework, normal estimation is formulated as a transformation between a simple noise distribution and a target data distribution p(z^n|z^c) conditioned on the input video latents z^c. The forward diffusion process injects Gaussian noise with increasing variance into the latent normal sequence, while the reverse denoising process gradually transforms noise into the target normal maps.

Similar to approaches in VCR-Gaus, this framework allows the model to effectively capture the distribution of normal maps conditioned on input videos.

Semantic Feature Regularization

When directly repurposing video diffusion models like SVD for normal estimation, the results often exhibit semantic ambiguity and over-smoothed predictions. As shown in Figure 2, SVD features struggle to preserve detailed geometric structures, such as the stone region in the background.

To address this issue, NormalCrafter introduces Semantic Feature Regularization (SFR), which aligns diffusion features with robust semantic representations from an external encoder (DINO). This alignment encourages the model to focus on intrinsic scene semantics, producing more accurate and detailed normal maps.

The SFR process works by:

  1. Deriving DINO features h_dino from input video frames
  2. Extracting intermediate features h_l from the l-th layer of the diffusion model
  3. Projecting these features into the DINO feature space using a learnable MLP
  4. Regularizing the projected features to align with DINO features by maximizing patch-wise cosine similarities

This approach introduces computational overhead only during training, with no extra costs during inference, making it highly efficient.

Two-Stage Training Protocol

While training NormalCrafter in the latent space is feasible, it may not yield optimal results in terms of accuracy or efficiency. Fine-tuning the model in a single end-to-end step for normal estimation directly in image space can achieve superior spatial fidelity, but this drastically increases memory requirements for video processing, especially for long sequences.

To balance the need for long temporal context modeling with high-precision spatial fidelity, NormalCrafter uses a two-stage training protocol:

  1. Stage 1: Training the full model in the latent space to effectively capture long-term temporal context. The sequence length is randomly sampled from [1,14], enabling the model to adapt to diverse video durations.

  2. Stage 2: Fine-tuning only the spatial layers in the pixel space to improve spatial accuracy while preserving the capacity for long sequence inference. During this stage, the sequence length is limited to [1,4] frames to ease GPU memory constraints.

This innovative approach allows NormalCrafter to benefit from end-to-end fine-tuning while maintaining its ability to process extensive sequences with consistent results.

Experiment: Proving NormalCrafter's Effectiveness

Experimental Setup

NormalCrafter builds upon the SVD model, with a three-layer perceptron for feature projection in the SFR component. The model was trained in two stages: first training the U-Net for 20,000 iterations with a learning rate of 3×10^-5, followed by end-to-end fine-tuning for 10,000 iterations with a learning rate of 1×10^-5. The VAE decoder was fine-tuned separately for 20,000 iterations.

Training utilized five carefully selected datasets, including both single-frame and video data with high-resolution frames and ground-truth normal maps from synthetic environments:

  • Single-frame datasets: 49,494 images from Replica (indoor scenes) and 45,620 frames from 3D Ken Burns (outdoor scenes)
  • Video datasets: Hypersim (indoor scenes), MatrixCity (outdoor scenes), and Objaverse (object sequences)

Evaluations

NormalCrafter was evaluated on four benchmarks: NYUv2 and iBims-1 for single-image normal estimation, and ScanNet and Sintel for video sequences. Performance was measured using angular deviations between estimated normal maps and ground truth, with both mean and median angular errors reported alongside the proportion of pixels with errors below certain thresholds (11.25°, 22.5°, and 30°).

Method NYUv2 31 iBims 22
mean $\downarrow$ med $\downarrow$ $11.25^{\circ} \uparrow$ $22.5^{\circ} \uparrow$ $30^{\circ} \uparrow$ Rank $\downarrow$ mean $\downarrow$ med $\downarrow$ $11.25^{\circ} \uparrow$ $22.5^{\circ} \uparrow$ $30^{\circ} \uparrow$ Rank $\downarrow$
DSINE [2] 16.4 8.4 59.6 77.7 83.5 4.8 17.1 6.1 67.4 79.0 82.3 5.0
GeoWizard [15] 18.6 12.0 46.4 76.1 83.0 7.0 20.5 10.9 51.5 75.2 80.1 7.0
GenPercept [36] 16.4 8.0 60.9 78.3 83.7 3.2 16.3 6.3 69.5 81.1 84.1 2.4
StableNormal [37] 17.7 10.3 54.2 78.1 84.1 4.6 17.0 7.0 68.0 80.9 84.2 3.4
Lotus-D [16] 16.2 8.4 59.8 78.0 83.9 3.4 17.1 6.8 66.4 79.4 83.0 5.2
Marigold-E2E-FT [26] 16.2 7.6 61.4 77.9 83.5 2.8 15.8 5.5 69.9 80.6 83.9 1.8
Ours 15.4 7.9 61.4 79.4 85.1 1.2 16.1 5.9 68.9 80.4 83.7 3.0
Method Scannet 9 Sintel 6
mean $\downarrow$ med $\downarrow$ $11.25^{\circ} \uparrow$ $22.5^{\circ} \uparrow$ $30^{\circ} \uparrow$ Rank $\downarrow$ mean $\downarrow$ med $\downarrow$ $11.25^{\circ} \uparrow$ $22.5^{\circ} \uparrow$ $30^{\circ} \uparrow$ Rank $\downarrow$
DSINE [2] 15.5 8.0 62.4 79.5 84.9 5.4 34.9 28.1 21.5 41.5 52.7 4.6
GeoWizard [15] 18.9 13.1 41.7 75.1 83.0 7.0 37.6 32.0 11.7 32.8 46.8 6.4
GenPercept [36] 14.5 7.2 66.0 81.8 86.7 3.4 34.6 26.2 18.4 43.8 55.8 3.6
StableNormal [37] 15.9 10.0 57.0 81.9 87.0 4.4 38.8 32.7 17.9 36.1 46.6 6.6
Lotus-D [16] 14.3 7.1 65.6 81.4 86.5 3.8 32.3 25.5 22.4 44.9 57.0 2.0
Marigold-E2E-FT [26] 14.1 6.3 67.6 81.7 86.4 2.6 33.5 27.0 21.5 43.0 54.3 3.6
Ours 13.3 6.8 67.4 82.9 87.9 1.4 30.7 23.9 23.5 47.5 60.1 1.0

Table 1: Quantitative evaluations. The top section shows results on single-image benchmarks, while the bottom section shows results on video benchmarks. NormalCrafter outperforms existing approaches, particularly on video datasets.

The quantitative results in Table 1 demonstrate that NormalCrafter achieves state-of-the-art performance on all video datasets, surpassing existing approaches by a considerable margin. On the Sintel dataset, characterized by substantial camera motion and fast-moving objects, NormalCrafter outperforms the second-best method across all metrics, most notably improving mean angular error (1.6°), median angular error (1.6°), and the proportion of pixels with angular errors below certain thresholds.

Even on single-image benchmarks, NormalCrafter demonstrates competitive or superior performance, outperforming other methods on the NYUv2 dataset in terms of mean angular error and the proportions of pixels with errors below 22.5° and 30°.

Qualitative comparisons. The input videos are sampled from the DAVIS dataset and Sora-generated videos. To highlight the temporal consistency, the y-t slices at the designated red line positions are displayed in red boxes.
Figure 4: Qualitative comparisons between StableNormal, Marigold-E2E-FT, and NormalCrafter on the DAVIS dataset and Sora-generated videos. The y-t slices (red boxes) highlight temporal consistency across frames.

The qualitative results in Figure 4 further illustrate NormalCrafter's superior performance. Compared to StableNormal and Marigold-E2E-FT, NormalCrafter produces normal maps that are not only more accurate spatially but also more consistent temporally, as evidenced by the smoother temporal profiles in the y-t slices.

Ablation Study

Several ablation studies were conducted to validate the contribution of each component in NormalCrafter.

Method Scannet 9 Sintel 6
mean $\downarrow$ med $\downarrow$ $11.25^{\circ} \uparrow$ $22.5^{\circ} \uparrow$ $30^{\circ} \uparrow$ Rank $\downarrow$ mean $\downarrow$ med $\downarrow$ $11.25^{\circ} \uparrow$ $22.5^{\circ} \uparrow$ $30^{\circ} \uparrow$ Rank $\downarrow$
Ours w/o VAE-FT 13.4 6.8 67.3 82.8 87.8 1.8 30.8 23.9 23.4 47.4 60.1 1.8
Ours w/o Stage1 13.4 6.8 67.3 82.8 87.6 2.0 30.7 24.2 21.5 46.9 60.0 2.6
Ours w/o Stage2 14.2 8.1 63.7 82.0 87.4 4.8 31.6 25.3 19.7 44.6 57.9 5.0
Ours w/o SFR 13.7 7.0 67.1 82.5 87.4 4.0 31.1 24.7 21.4 45.8 58.9 4.0
Ours 13.3 6.8 67.4 82.9 87.9 1.0 30.7 23.9 23.5 47.5 60.1 1.0

Table 2: Ablation study on the effectiveness of Semantic Feature Regularization (SFR), Two-Stage Training strategy, and fine-tuning VAE decoder.

Effectiveness of Semantic Feature Regularization (SFR): As shown in Table 2, NormalCrafter consistently outperforms the variant without SFR across all metrics on both ScanNet and Sintel datasets.

Ablation results with Semantic Feature Regularization (SFR). Red boxes highlight the significant differences.
Figure 5: Ablation results with Semantic Feature Regularization (SFR). Red boxes highlight how SFR improves the detail and accuracy of normal predictions.

The qualitative comparison in Figure 5 further illustrates the benefits of SFR, demonstrating its capability to direct the diffusion model to concentrate on intrinsic semantics, thereby enabling more accurate and detailed normal predictions.

Effectiveness of two-stage training: The ablation results in Table 2 show that the model without Stage 2 (w/o Stage2) performs significantly worse in terms of spatial accuracy. On the other hand, although the model without Stage 1 (w/o Stage1) performs comparably with the full model in spatial accuracy, it falls short in temporal consistency.

Qualitative Ablation Results of two-stage fine-tuneing stategy. Without Stage1, the model suffers from temporal consistency due to the limited number of frames in training.
Figure 6: Qualitative ablation results of the two-stage fine-tuning strategy. Without Stage 1, the model suffers from temporal inconsistency due to the limited number of frames used in training.

Influence of SFR location: The U-Net consists of four encoder blocks ("Down0-3"), one middle block ("Mid"), and four decoder blocks ("Up0-3"). The impact of SFR location was investigated by applying SFR at different layers.

Method Scannet [9]
mean $\downarrow$ med $\downarrow$ $11.25^{\circ} \uparrow$ $22.5^{\circ} \uparrow$ $30^{\circ} \uparrow$ Rank $\downarrow$
w/o SFR 13.7 7.0 67.1 82.5 87.4 7.0
Down1 13.5 6.8 67.6 82.5 87.4 3.8
Down2 13.6 6.8 67.2 82.4 87.3 6.4
Down3 13.5 6.8 67.4 82.5 87.5 4.2
Mid 13.5 6.8 67.5 82.7 87.7 3.4
Up0 13.4 6.8 67.5 82.8 87.8 2.4
Up1(Ours) 13.3 6.8 67.4 82.9 87.9 2.0
Up2 13.4 6.7 67.8 82.9 87.8 1.4
Method Sintel [6]
mean $\downarrow$ med $\downarrow$ $11.25^{\circ} \uparrow$ $22.5^{\circ} \uparrow$ $30^{\circ} \uparrow$ Rank $\downarrow$
w/o SFR 31.1 24.7 21.4 45.8 58.9 6.2
Down1 31.0 24.3 21.4 46.6 59.6 3.8
Down2 31.2 24.9 21.2 45.7 58.7 7.8
Down3 31.1 24.5 21.2 46.2 59.3 6.0
Mid 31.0 24.3 21.5 46.6 59.7 3.2
Up0 30.7 23.8 21.5 47.5 60.5 1.4
Up1(Ours) 30.7 23.9 23.5 47.5 60.1 1.4
Up2 31.1 24.3 21.7 46.7 59.5 3.6

Table 3: Influence of SFR location. The table shows performance when applying SFR at different locations in the U-Net architecture, from "Down1" to "Up2".

As shown in Table 3, the performance improvement peaks at "Up1", indicating that the optimal location for SFR is in the middle of the network. This suggests that shallow layers primarily capture low-level information, while deeper layers have too few subsequent layers to effectively map semantics to normal maps.

Effectiveness of fine-tuning VAE: Fine-tuning the VAE decoder reduced the mean angular error from 5.75 to 4.07 and improved PSNR from 25.58 to 28.00. As shown in Table 2, this improvement in the decoder positively affects the training of the normal estimator.

Limitations

Despite achieving state-of-the-art performance in terms of spatial accuracy and temporal consistency in video normal estimation, NormalCrafter's large parameter size poses challenges for deployment on mobile devices. Future work could focus on optimizing the model's efficiency through techniques such as model pruning, quantization, and distillation.

Conclusion: Advancing Temporal Normal Estimation

NormalCrafter represents a significant advancement in video normal estimation, capable of generating temporally consistent normal sequences with fine-grained details for open-world videos of arbitrary lengths. The model achieves this through two key innovations:

  1. Semantic Feature Regularization (SFR), which aligns diffusion features with semantic cues to focus on intrinsic scene semantics and preserve detailed geometric structures
  2. A two-stage training protocol that balances the need for long temporal context with high spatial accuracy

Extensive evaluations demonstrate that NormalCrafter achieves state-of-the-art performance in open-world video normal estimation under zero-shot settings, significantly outperforming existing methodologies on both video and single-image benchmarks.

The research provides valuable insights for future investigations in this domain, particularly in balancing spatial accuracy with temporal consistency in video-based geometric understanding tasks.

Click here to read the full summary of this paper