CVPR 2024 (Oral)

- We discuss the inductive biases needed for surface normal estimation and propose to (1) utilize the
**per-pixel ray direction**and (2) estimate the surface normals by**learning the relative rotation between nearby pixels**. - With the right inductive biases, models can be trained with much less number of images. Our model is trained only on
**160K images, for 12 hours, on a single NVIDIA 4090 GPU**. In comparison, Omnidata V2 (which is based on DPT architecture) is trained on**12M images, for 2 weeks, on four NVIDIA V100 GPUs**.

The input videos are from DAVIS. The predictions are made per-frame (we recommend watching in 4K).

In recent years, the usefulness of surface normal estimation methods has been demonstrated in various areas of computer vision, including image generation, object grasping, multi-task learning, depth estimation, simultaneous localization and mapping, human body shape estimation, and CAD model alignment. However, despite the growing demand for accurate surface normal estimation models, there has been little discussion on the right inductive biases needed for the task.

In this paper, we propose to use the **per-pixel ray direction** as an additional input to the network.

Ray direction provides an important cue for the pixels near **occluding boundaries** as the normal should be **perpendicular to the ray**.

It also gives us the range of normals that would be **visible**, effectively **halving the output space**. We incorporate such a bias by introducing a **Ray ReLU** activation.

We also propose to recast surface normal estimation as **rotation estimation**. At first, this may sound like we are over-complicating the task.
Why should we estimate $\mathbf{R} \in SO(3)$, which has **three** degrees of freedom, instead of estimating $\mathbf{n} \in S^2$, which only has **two** degrees of freedom?

Let's start by parameterizing $\mathbf{R}$ using the **axis-angle** representation

$$\boldsymbol{\theta} = \theta \boldsymbol{e}$$

where a unit vector $\textbf{e}$ represents the **axis** of rotation and $\theta$ is the **angle** of rotation.

For most pairs of pixels, $\theta$ would simply be $0$ or $\pm 90^\circ$. Plus, the angle between the normals, unlike the normals themselves, are **independent of the viewing direction**, making it easier to learn.

Finding the **axis** of rotation is also straightforward. When two (locally) flat surfaces intersect at a line, the normals rotate around that intersection. As the image intensity generally changes sharply near such intersections, the task can be as simple as edge detection.

Modeling the relative change in surface normals is not just useful for flat surfaces. In this example, the relative angle between the normals of the yellow pixels can be inferred from that of the red pixels by assuming circular symmetry.

Please refer to our paper for additional information on how to incorporate the aforementioned inductive biases.

Here we provide a comparison between Omnidata V2 (left) and ours (right). The input images (shown at the top-left corner) are in-the-wild images from the OASIS dataset. Despite being trained on significantly fewer images, our model shows stronger generalization capability.

Research presented in this paper was supported by Dyson Technology Ltd. The authors would like to thank Shikun Liu, Eric Dexheimer, Callum Rhodes, Aalok Patwardhan, Riku Murai, Hidenobu Matsuki, and members of the Dyson Robotics Lab for insightful feedback and discussions.

@inproceedings{bae2024dsine, title={Rethinking Inductive Biases for Surface Normal Estimation}, author={Gwangbin Bae and Andrew J. Davison}, booktitle={IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, year={2024} }