Background
With the recent progress in the field of generative AI, the community observes new opportunities for 3D content generation exploiting novel neural network architectures pre-trained on large-scale 2D/3D datasets. This article aims to derive the picture of major components that hundreds of recently published papers work on and contribute to this field.
Introduction
Input and output X-to-3D, in a nutshell, aims to convert some input information to 3D content. To expand it, input is mainly considered as text and image, while an image can be a single-view or multi-view of the same physical object. For 3D, the content can be a tiny object, or an indoor/outdoor scene, each of which faces different challenges. While we mainly discuss the object level since it has great value to the industry and could be the easiest one to handle, some approaches can be extended to work on scenes for potential AR/VR applications.
The input is easy to understand. The input is text/image which can be encoded as latent features. Images can be directly used as supervision for renderings considering the RGB and depth. Note that the image could also be potentially associated with a camera pose.
3D representation
While the input is quite clear, 3D content can be represented in different forms: