ContraNeRF: 3D-Aware Generative Model via Contrastive Learning
with Unsupervised Implicit Pose Embedding
arXiv 2023
- Mijeong Kim Seoul National University
- Hyunjoon Lee Kakao Brain
- Bohyung Han Seoul National University
Abstract
Although 3D-aware GANs based on neural radiance fields have achieved competitive performance, their applicability is still limited to objects or scenes with the ground truths or prediction models for clearly defined canonical camera poses. To extend the scope of applicable datasets, we propose a novel 3D-aware GAN optimization technique through contrastive learning with implicit pose embeddings. To this end, we first revise the discriminator design and remove dependency on ground-truth camera poses. Then, to capture complex and challenging 3D scene structures more effectively, we make the discriminator estimate a high-dimensional implicit pose embedding from a given image and perform contrastive learning on the pose embedding. The proposed approach can be employed for the dataset, where the canonical camera pose is ill-defined because it does not look up or estimate camera poses. Experimental results show that our algorithm outperforms existing methods by large margins on the datasets with multiple object categories and inconsistent canonical camera poses.
Method
Step 1: Discriminator The pose-conditioned discriminator in (a) EG3D utilizes camera pose information as input, where the ground truth pose should be given for each training image. On the other hand, ours does not use such extra information, but instead, they learn to pose information in a self-supervised way, contrastive learning on implicit pose embeddings.
Step 2: Contrastive learning on the pose embedding space. The ‘positive’ and ‘negative’ images denote images rendered in the same or different directions with the ‘anchor’ image, respectively. The distance between pose embeddings of positive pairs is learned to be closer than those of negative pairs.
Comparison
• EG3D: It only can be evaluated on a few datasets, e.g., human or cat faces datasets, where ground-truth or estimated camera poses are available.
• PRNeRF (ours): Precursor of ContraNeRF. It is an interim and naive method to remove ground-truth pose dependency based on pose regression loss. However, it sometimes fails to reconstruct 3D structures properly, especially on complex domains such as LSUN-Bedroom.
• ContraNeRF (ours): Simple but effective solution via contrastive learning. It allows our model to learn 3D structures of scenes with ill-defined canonical poses due to heterogeneous geometric configurations.