Rethinking Inductive Biases for Surface Normal Estimation

CVPR 2024 (Oral)

Gwangbin Bae Andrew J. Davison

Dyson Robotics Lab, Imperial College London

Paper arXiv Code

TL;DR

We discuss the inductive biases needed for surface normal estimation and propose to (1) utilize the per-pixel ray direction and (2) estimate the surface normals by learning the relative rotation between nearby pixels.
With the right inductive biases, models can be trained with much less number of images. Our model is trained only on 160K images, for 12 hours, on a single NVIDIA 4090 GPU. In comparison, Omnidata V2 (which is based on DPT architecture) is trained on 12M images, for 2 weeks, on four NVIDIA V100 GPUs.

Demo

The input videos are from DAVIS. The predictions are made per-frame (we recommend watching in 4K).

Motivation

In recent years, the usefulness of surface normal estimation methods has been demonstrated in various areas of computer vision, including image generation, object grasping, multi-task learning, depth estimation, simultaneous localization and mapping, human body shape estimation, and CAD model alignment. However, despite the growing demand for accurate surface normal estimation models, there has been little discussion on the right inductive biases needed for the task.

What inductive biases do we need for surface normal estimation?

In this paper, we propose to use the per-pixel ray direction as an additional input to the network.

Ray direction provides an important cue for the pixels near occluding boundaries as the normal should be perpendicular to the ray.

It also gives us the range of normals that would be visible, effectively halving the output space. We incorporate such a bias by introducing a Ray ReLU activation.

We also propose to recast surface normal estimation as rotation estimation. At first, this may sound like we are over-complicating the task. Why should we estimate $\mathbf{R} \in SO(3)$, which has three degrees of freedom, instead of estimating $\mathbf{n} \in S^2$, which only has two degrees of freedom?

Let's start by parameterizing $\mathbf{R}$ using the axis-angle representation

$$\boldsymbol{\theta} = \theta \boldsymbol{e}$$

where a unit vector $\textbf{e}$ represents the axis of rotation and $\theta$ is the angle of rotation.

For most pairs of pixels, $\theta$ would simply be $0$ or $\pm 90^\circ$. Plus, the angle between the normals, unlike the normals themselves, are independent of the viewing direction, making it easier to learn.

Finding the axis of rotation is also straightforward. When two (locally) flat surfaces intersect at a line, the normals rotate around that intersection. As the image intensity generally changes sharply near such intersections, the task can be as simple as edge detection.

Modeling the relative change in surface normals is not just useful for flat surfaces. In this example, the relative angle between the normals of the yellow pixels can be inferred from that of the red pixels by assuming circular symmetry.

Please refer to our paper for additional information on how to incorporate the aforementioned inductive biases.

Results

Here we provide a comparison between Omnidata V2 (left) and ours (right). The input images (shown at the top-left corner) are in-the-wild images from the OASIS dataset. Despite being trained on significantly fewer images, our model shows stronger generalization capability.

Acknowledgement

Research presented in this paper was supported by Dyson Technology Ltd. The authors would like to thank Shikun Liu, Eric Dexheimer, Callum Rhodes, Aalok Patwardhan, Riku Murai, Hidenobu Matsuki, and members of the Dyson Robotics Lab for insightful feedback and discussions.

BibTeX

@inproceedings{bae2024dsine,
    title={Rethinking Inductive Biases for Surface Normal Estimation},
    author={Gwangbin Bae and Andrew J. Davison},
    booktitle={IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    year={2024}
}