An Overview of Image-to-3D, From Neural Rendering and Surface Reconstruction to Universal Generation

Prologue

This is a document I wrote last year (2025). It started because a few game developer friends were super interested in 3D generation. At the time, they were already using text-to-image tools in their workflows to generate various character models, props, and even 3DGS-based scene assets. They really wanted to understand how these productivity-liberating tools worked under the hood, so I hastily scribbled down this document for them, (hopefully) explaining things in plain English. It was written in a rush, very casually, and pretty much stream-of-consciousness—writing whatever came to mind.

During a recent holiday gathering, this topic came up again, and everyone dug it out to reminisce (and critique) it a bit. Then I realized that AIGC is developing way too fast; in less than a year, we’ve basically gone through another Great Leap Forward in tech, and 3D reconstruction has sprinted straight into the era of large models lol. “Top-tier conference papers are flooding in endlessly…” Yet, I don’t really have the energy to do a complete overhaul. This time, I just made a few tweaks and additions to the original draft. It’s not a formal paper anyway, so I’m just dumping it directly on Zhihu for our future reference (definitely not just using it as a free image host). The text is full of wild hot takes, so take them with a grain of salt.

[2026 Update - Start*/]
Image-to-3D Capability Test — Hunyuan 3D 3.1 Results: https://3d.hunyuan.tencent.com/

Input a picture of a truck, and output the truck mesh model:

Output Model - White Model (Clay):

Output Model - Topology:

Output Model - Textures:

Input a picture of Louis XVI (jk):

Output a complete Louis XVI, a gift from fate ():

Input a sculpture of horses (testing freeform surfaces) “Eight Steeds”, output the white model and topology, along with something weird mixed in ():

Input a photo of a barn scene (failure case), the output model has severe missing parts. This might be related to the larger scale of the scene, the fact that the input image doesn’t fully cover it, and the presence of many interfering elements:

In my opinion, the results are quite stunning. For small-scale objects, it’s virtually flawless and ready to use out of the box. (Off-topic: How do we protect the copyright of these generated models? I heard the official platform might embed hidden watermarks in the models for tracing? But adding watermarks to standard format models feels a bit unrealistic.) It’s basically unusable for large scenes, though for those, you could arguably brute-force it with methods like 3DGS. Last year I tested version 2.0, and it was nowhere near as good as this 3.1 version.
[2026 Update - End/*]
[2026 Edit]====== Deleted Hunyuan 2.0 results =================

Image-to-3D generally has five stages:
Stage 1: The Antiquity Era (Traditional Multi-View Geometry) [2010-2020]: Using SfM (Structure from Motion) and MVS (Multi-View Stereo) to recover camera poses from images, then generating dense point clouds. This is very mature now, with excellent open-source tools like COLMAP and OpenMVG, plus a series of works in visual SLAM (if you count that as reconstruction).
Stage 2: The Deep Learning Era [2018-2021]: Utilizing massive amounts of 2D data for monocular depth estimation, as well as experimenting with 3D GAN generation using GAN models.
Stage 3: The Neural Rendering Era [2020-2023]: Using NeRF’s differentiable volume rendering to overfit a single scene and output a neural radiance field. This subsequently spawned more efficient explicit/hybrid representations like 3DGS.
Stage 4: The 2D Prior (Diffusion) Era [2022-2024]: The explosive popularity of 2D diffusion models made researchers realize the feasibility of distilling 2D knowledge to supervise 3D generation. Aside from 3D-to-2D rendering, we could do “2D-training-3D” (the prime example being DreamFusion, which introduced the SDS loss).
Stage 5: The Native 3D Large Model (LRM/DiT) Era [2023-Present]: With the validation of 3D VAEs, the refinement of the DiT architecture, the accumulation of massive 3D data, and the widespread application of highly efficient structures like Triplanes, direct “3D supervising 3D” has become the mainstream. Image-to-3D has truly entered the feed-forward generation large model era.

The rest of this article is divided into these sections: Prerequisites, Neural Rendering, Surface Reconstruction, and Generative Models. The prerequisites will cover basic 3D representations, how to extract meshes from implicit spaces, and some mainstream model paradigms and architectures. The Neural Rendering section details the mechanics of NeRF and 3DGS. Surface Reconstruction discusses the attempts made to obtain high-quality meshes. The Generative Models section primarily summarizes the representative works from Stages 4 and 5 mentioned above.

1. Prerequisites

1.1 Basic 3D Representations

a. Explicit Representations: Polygon Meshes, Point Clouds, Voxels

Explicit representations (like meshes) are very intuitive to humans, but extremely difficult for neural networks to generate and optimize (because the number of vertices and faces, as well as their connectivity, are constantly changing and discontinuous).

b. Implicit Representations
Neural networks prefer continuous, differentiable expressions—namely, using implicit functions to represent 3D surfaces. This means defining a function $f(x,y,z)$ in 3D space. When the value of this function equals a specific constant $c$ (usually $0$), the set of these points forms the surface of the object (Level set).

SDF (Signed Distance Fields) is a special type of implicit representation. For any point $(x, y, z)$ in space, the SDF outputs a scalar value $D$: the magnitude represents the absolute shortest distance from this point to the object’s surface, and the sign (positive or negative) indicates whether the point is inside ($D < 0$) or outside ($D > 0$) the object. $D = 0$: The point is exactly on the surface of the object (Zero-level set).

Depending on task requirements, SDFs can also be adapted:

TSDF (Truncated SDF): Often, we only care about the details near the surface of the object; space that is too far away is meaningless. TSDF sets a threshold $t$ to limit the distance to between $[-1, 1]$. Anything further away is simply truncated to save computational power (commonly used in real-time 3D reconstruction with depth cameras like Kinect).
Occupancy Fields: The most aggressively simplified version. It doesn’t care about the distance; it only cares about “Yes” or “No”. If a point is inside or on the surface ($SDF \le 0$), the value is $1$; if it is outside, the value is $0$. (This introduces the concept of a field, which will be covered in Chapter 2.)

c. Parametric Representations
For example, parametric curves and surfaces like Bezier curves, Catmull-Rom splines, B-spline curves, NURBS, etc. However, these aren’t highly relevant to our topic today, so we’ll skip them.

Of course, NeRF and 3DGS are technically 3D representations too, but they are far from “basic” :)

1.2 Marching Cubes

TL;DR: Marching Cubes is an Isosurface Extraction scheme designed to reconstruct explicit geometric boundaries from a discrete Occupancy grid or scalar field. Various modern neural methods ultimately output an occupancy field or an SDF; you can think of it as simply plugging in this component to generate an explicit mesh.

2D Scenario: Degrades to Marching Squares (MS)
Marching Squares divides 2D space into a regular square grid. For each grid cell, the algorithm evaluates the function values of its four vertices and binarizes them based on their sign (positive or negative, representing outside or inside). The four vertices yield $2^4 = 16$ state combinations (middle of the image below). By querying a pre-built lookup table containing these 16 cases, the algorithm determines which edges of the grid boundaries have a sign change. Subsequently, exact boundary intersection coordinates are calculated based on linear interpolation, and these intersections are connected to form closed 2D isolines (like the far left of the image below).

3D Scenario: Marching Cubes (MC)
Space is discretized into a 3D grid, with the basic unit being a cube. Each cube has 8 vertices, resulting in $2^8 = 256$ state combinations. By applying rotation and mirror symmetry, these 256 combinations are simplified into 15 base cases (right side of the image above). The algorithm independently iterates through all cubes in space, using a lookup table to determine how triangle faces connect inside the cube. The local independence of this algorithm drastically increases computational speed, making it highly suitable for parallel processing.

Flaws and Improvement Schemes

Flaw 1: Topological ambiguity exists during the execution of both Marching Squares and Marching Cubes. When vertices on the diagonal of a grid cell have the same state, while adjacent vertices have different states, the system lacks enough information to uniquely determine the topological connectivity of the faces (i.e., whether they should connect or disconnect—see the bottom middle of the image above for a 2D example). Among the 15 base cases of Marching Cubes, 6 are ambiguous. This can result in holes or incorrect topologies in the final generated mesh.
- Improvement: The Marching Tetrahedra algorithm solves this by changing how space is discretized. It divides 3D space into tetrahedra instead of cubes. Since a tetrahedron only has 4 vertices, the isosurface intersections within it are much simpler. This completely eliminates state combination ambiguity from a geometric and mathematical standpoint, ensuring the generated mesh has strict watertightness and correct topology.
Flaw 2: Traditional Marching Cubes determines vertex positions via linear interpolation along the grid edges. This mechanism inherently has a smoothing effect, meaning the algorithm cannot accurately restore sharp features (like right angles or creases) from the original geometric model, often producing stair-step artifacts at the edges.
- Improvement: Dual algorithms, represented by Dual Marching Cubes and Dual Contouring (DC), restructured the vertex generation logic. Instead of placing vertices on the grid edges, these algorithms use internal intersection data (usually combined with normal data, by minimizing a Quadratic Error Function, QEF) to calculate a feature vertex inside each boundary-containing grid cell. Then, the algorithm connects the feature vertices of adjacent cells to form polygonal faces. This dual mechanism can capture and reconstruct sharp geometric edges with extreme precision.

How do these basic geometric representations convert between each other?

Polygon Mesh → Point Cloud: First, perform polygon triangulation (ear clipping), then sample points on the triangular faces (based on barycentric coordinates).
Implicit Surface → Mesh: Use the Marching Cubes algorithm to extract the isosurface.
Point Cloud → Implicit Surface:
- Local fitting methods: MLS, RBF.
- Global methods: Poisson Surface Reconstruction.
- Voxel methods: TSDF-based fusion (e.g., KinectFusion).
Point Cloud → Mesh:
- Indirect methods: Point Cloud → Implicit Surface → Mesh.
- Direct methods: Triangulation (Delaunay: suitable for 2D or locally parameterized scenes, along with other triangulation methods).

1.3 Mainstream Model Paradigms and Architectures

MAE, VAE, GAN, Transformer, ViT, Diffusion, DiT
[2026 Edit]====== Deleted this section =================

2. Neural Rendering

All of this has to start with NeRF (Neural Radiance Field), published at ECCV in 2020. Neural Radiance Field = Neural + Radiance + Field.

What is a Field: If there is a clearly defined physical quantity or value at every coordinate point in space (or time), this is called a “field”. For example, the temperature in your room—every point in space has a temperature value, which makes it a “temperature field” (scalar field); when the wind blows, every point has a wind speed and direction (both magnitude and direction), making it a “vector field”. A piece of audio (1D time field), a photograph (2D color field), magnetic/wind fields (3D vector fields), 3D parameterized surfaces (NURBS, explicit surfaces), and 3D Signed Distance Fields (SDF, implicit surfaces) are all fields.

What is a Neural Field: If the values of a “field” are calculated and represented by a neural network, it’s called a “neural field”. Specifically, if the SDF mentioned earlier is implemented via a neural network—meaning you input the 3D spatial coordinates $(x, y, z)$ and the neural network outputs a number (the distance from that point to the object’s surface)—the intermediate mapping relationship $f$ is realized through the neural network. This is a neural field, specifically called a Neural SDF.
Why use Neural Fields? Natural signals in the real world are continuous waveforms; traditional explicit representations (pixels, voxels) are sampled, discrete signals that lose detail when magnified. A neural field, however, uses a neural network as a continuous parameterized function, which can smoothly fit true natural signals at an infinite resolution.
What is a Neural Radiance Field? Simply put: NeRF is a special type of neural field that uses a neural network to represent a 3D scene. You input position and angle, and it outputs information like the light field and color (the so-called “radiance”). This is a neural radiance field. Let’s break it down in detail below.

2.1. What exactly is NeRF doing?

Novel View Synthesis. What is Novel View Synthesis?

Input: A set of 2D RGB images with known camera parameters (position, orientation).
Output: A representation of a 3D scene. Based on this representation, you can freely move a virtual camera to render photos from entirely new angles that were never captured.

2.2. How does NeRF do it?

Three core components: Neural Volumetric Representation + Differentiable Volume Rendering + Rendering-Based Optimization.

Neural Volumetric Representation: Abandoning traditional meshes or voxels, it uses a continuous neural network to implicitly represent the entire 3D space.
Differentiable Volume Rendering: How do you turn 3D space into a 2D photo? By “shooting rays” to sample and calculate colors. The key here is “differentiable,” meaning the errors generated during the rendering process can be directly passed back to the neural network for correction via backpropagation.
Rendering-Based Optimization: The training logic is extremely pure. The network “guesses” an image (the generated photo) -> compares it with the real input image to calculate the error -> updates the network weights.

Q: What does “volumetric” mean here?
A: As long as a continuous density function $\sigma(x,y,z)$ is defined in 3D space, this space can be considered a volume, also known as a density field or volumetric field.

Let’s dismantle these components one by one:

2.2.1. Neural Volumetric Representation

NeRF strictly defines a 3D scene as a 5D to 4D mapping function $F_\Theta$:

5D Input: Contains the 3D spatial coordinates $(x, y, z)$ and the 2D viewing direction (yaw and pitch, i.e., $\theta, \phi$).
Network Body: A very classic MLP (Multilayer Perceptron) with 9 layers, each having 256 channels.
4D Output: Outputs the emitted color $(r, g, b)$ of that spatial point from the specific viewing angle, as well as the volume density $(\sigma)$ of the point itself. (We’ll explain what volume density is later).

PS: Why do we need to include the viewing direction $(\theta, \phi)$? Because objects in the real world often have specular highlights, reflections, or metallic textures. For the exact same spatial point $(x, y, z)$, the color $(r, g, b)$ it reflects into your eyes will look different depending on whether you look at it from the left or the right. Introducing the viewing angle as an input gives NeRF the ability to express realistic, complex lighting and shadows.

2.2.2. Differentiable Volume Rendering

Calculating Rays: To render a pixel on a 2D photo, we need to shoot a ray from the center of the virtual camera, passing through the screen position of that pixel, out into 3D space. The final color of the pixel is the accumulation of the colors of all the points along this ray.
Modeling Rendering as a Probability Problem: Traditional rendering (like ray tracing) needs to calculate exactly which hard surface (specific polygon) a ray “hits”. But mathematically, this is discontinuous and non-differentiable (you either hit it or you don’t). NeRF treats the entire space as a cloud of semi-transparent fog (Volumetric). When a ray travels through space, there are no absolute collisions, only the accumulation of density. This “soft” continuous model perfectly matches the gradient-based optimization methods of neural networks.
Replacing Traditional Meshes: There is no actual voxel grid storing data in space; all scene properties (color, density) are entirely compressed and encoded within the weights of that 9-layer MLP neural network.

Detailed Derivation of Volume Rendering

1. Defining the Ray Equation
First, define the equation for a ray emitted from the camera:

$\mathbf{r}(t) = \mathbf{o} + t\mathbf{d}$

Where $\mathbf{o}$ is the ray origin (camera position) and $\mathbf{d}$ is the emission direction. $t$ is a scalar representing distance or time; every point along the path can be represented by $t$ (the larger $t$ is, the further the ray has traveled).

2. Spatial (Fog) Properties
When the ray travels to distance $t$, it encounters two properties predicted by the NeRF neural network:

Color $\mathbf{c}(t)$: The RGB color at that position.
Volume Density $\sigma(t)$: Represents how thick this fog is. The rigorous mathematical definition is a probability—when the ray travels through an infinitesimally small step $dt$ near $t$, the probability of hitting a particle is $\sigma(t)dt$.

3. Transmittance $T(t)$ and the Continuous Rendering Integral
Suppose the ray encounters red fog at distance $t$; will it bring the color red back to the camera? Not necessarily! If there’s an extremely thick black wall right in front (closer to the camera), the ray would have been blocked entirely! Therefore, we need to calculate the Transmittance $T(t)$, which is the survival probability that the ray flies from the origin to distance $t$ without hitting anything along the way. Based on the logic “probability of surviving to $t + dt$ = probability of surviving to $t$ $\times$ probability of not hitting anything in $dt$”, we get $T(t + dt) = T(t) \times (1 - \sigma(t)dt)$. Through Taylor expansion and solving the differential equation, we get the transmittance formula:

$T(t) = \exp\left(-\int_{t_0}^t \sigma(s)ds\right)$

(Note: Meaning the survival rate of the ray reaching $t$ equals the sum of all accumulated densities along the way, wrapped in a negative exponential. The thicker the fog along the path, the exponentially faster the survival rate drops.)

To calculate the final pixel color $C$ on the screen, the ray at $t$ must simultaneously satisfy two conditions: survive to reach $t$ (probability $T(t)$) AND exactly hit a particle at $t$ (probability $\sigma(t)dt$). Accumulating the contributions of all positions along the entire ray gives us the continuous rendering equation:

$C = \int_{t_n}^{t_f} T(t)\sigma(t)\mathbf{c}(t) dt$

4. Discretization (Code Implementation)
Computers cannot directly calculate continuous and infinite calculus, so we need to convert the continuous formula into a discrete addition. We chop the ray into $N$ small segments (the $i$-th segment having a length of $\delta_i$):

Opacity $\alpha_i$: The probability of the ray being blocked when passing through the $i$-th segment, $\alpha_i = 1 - \exp(-\sigma_i \delta_i)$.
Discrete Transmittance $T_i$: The probability of the ray surviving to reach the $i$-th segment (the product of the probabilities of all previous segments not blocking it), $Ti = \prod{j=1}^{i-1} (1 - \alpha_j)$.
Final Discrete Summation Formula: $\hat{C} \approx \sum_{i=1}^N T_i \alpha_i \mathbf{c}_i$ (Final Color = $\sum$ Survival Rate $\times$ Current Segment Opacity $\times$ Current Segment Color)

5. Rendering Weights $w$ and Depth Extension
The $T_i \alpha_i$ in the formula is called the Rendering weights.

Occlusion Principle: If there’s an opaque wall in front (where $\alpha$ is huge somewhere), the transmittance $T_i$ for all points behind the wall will instantly drop to zero, and the weights will correspondingly drop to zero. This means anything behind the wall won’t participate in the color calculation.
Depth Extraction: By multiplying the distance $t_i$ of each point along the path by its weight $T_i \alpha_i$ and summing them up, we can calculate approximately how far away the ray “hits a wall”. This is the principle behind how NeRF extracts a 3D depth map (specifically, it can be divided into expected depth and median depth).
Physical Quantity Extension: By the same logic, if we replace $\mathbf{c}_i$ with any other physical quantity $\mathbf{v}_i$ in 3D space (such as semantics, normal vectors, etc.), we can render out corresponding feature maps.

2.2.3. Training Method

Calculate the synthesized image via forward propagation, then calculate the loss between the synthesized image and the real image, backpropagate the gradients (because the rendering process is differentiable), and optimize the MLP parameters.

Subsequently, academia has proposed a massive number of improvement schemes for NeRF. There are optimizations based on positional encoding, compressing and speeding up rendering, targeting unbounded scenes and unknown camera poses, processing dynamic scenes, dealing with complex lighting and few-shot generalization, and so on and so forth.

2.3 How does 3DGS do it?

Surface Splatting technology has actually been around since 2001: To solve the problem of sparse holes when rendering point clouds, each point in the point cloud is turned into a Gaussian kernel (in 1D it’s a Gaussian basis function, in 2D it’s an ellipse, in 3D it’s an ellipsoid). When countless such Gaussian kernels are packed together, their blurry edges blend into each other (continuous signal reconstruction), and the original gaps between points are perfectly filled. When we look at it from the camera’s perspective, it looks as if these 3D Gaussian spheres are squashed and splatted onto the 2D screen grid. To calculate a single pixel, NeRF has to sample hundreds or thousands of times along a ray, computing endless integrals. But with splatting technology, each Gaussian sphere is independent, and the GPU can process millions of spheres in parallel in one breath, making rendering blazing fast.

If splatting tech existed back in 2001, why did it take until 2023 to blow up? Firstly, because early on there was a lack of differentiable schemes. It wasn’t until a series of works represented by “Differentiable Surface Splatting” appeared around 2019, proving that the surface splatting rendering process is mathematically differentiable. This meant we could use backpropagation to let a computer look at 2D photos and automatically adjust the positions and attributes of hundreds of thousands of primitives in space. Secondly, the groundwork laid by NeRF. 3DGS cleverly inherited the Spherical Harmonics (SH) and alpha blending logic validated by NeRF, and upgraded the primitives to anisotropic 3D Gaussian kernels (with position, color, scale, and opacity all having fully differentiable degrees of freedom). The authors of 3DGS executed extreme engineering: the idea is very simple, the results are stunning, and they hand-rolled highly efficient CUDA rasterization operators based on Tile partitioning, perfectly solving the sorting and blending problem for millions of Gaussian kernels.

2.3.1 3DGS Training Process:

Initialization: First, use traditional photogrammetry algorithms (SfM, like COLMAP) to extract a bunch of sparse 3D coordinate points from the photos.
Generating 3D Gaussians: Turn every single sparse 3D point into a semi-transparent “3D Gaussian ellipsoid”.
Rasterization and Gradient Backpropagation (Differentiable Tile Rasterizer): “Squash” these 3D spheres onto a 2D screen, compare them with real photos, and calculate the error. Then, via gradient backpropagation, adjust the shape, color, and position of each sphere.
Adaptive Density Control: This is a very hand-crafted mechanism. During training, if the network finds that details in a certain area are highly complex (like hair), it will automatically split a large sphere into several smaller ones; if it finds an area is just a solid color background, it will clone/merge small spheres, or prune completely transparent ones.

The parameters for each 3D Gaussian sphere: $p$: Center position coordinates; $\Sigma$: Covariance matrix (controls scale and rotation); $o$: Opacity; $SH$: Spherical Harmonics (records different colors seen from different angles, i.e., reflection/specular effects).

2.3.2 The Connection Between 3DGS and NeRF:

Traditional Surface Splatting: Its color blending method is a weighted average ($c = \frac{\sum c_i w_i}{\sum w_i}$). This cannot handle “occlusion relationships”.
3DGS Splatting: It introduces the concept of Opacity ($o$), changing the formula to front-to-back Alpha compositing. $c = \sum_{i \in N} c_i o_i w_i \prod_{j=1}^{i-1} (1 - o_j w_j)$

The $\prod_{j=1}^{i-1} (1 - o_j w_j)$ in the 3DGS formula is completely mathematically equivalent to NeRF’s “Transmittance $T(t)$”. In mathematical essence, 3DGS is just calculating NeRF’s volume rendering integral. It’s just that NeRF “samples step-by-step along a ray in continuous space”, whereas 3DGS directly turns space into discrete spheres and uses the GPU to directly compute the overlapping coloring of these spheres. The goal is identical, but the calculation speed of 3DGS is orders of magnitude faster. In other words, 3DGS is using an explicit 3D ellipsoid data structure to run the implicit NeRF volume rendering formula. It has both the extreme photorealism of NeRF (the ability to learn a scene) and the blazing fast rendering speeds of traditional polygons.

2.3.3 3DGS Rendering Steps:

Sort: To correctly calculate the NeRF-like “transmittance” mentioned above, we must know who is in front and who is behind. So the first step is to line up the millions of 3D spheres from front to back based on depth.
Splat: Project (squash) these 3D ellipsoids onto the 2D screen to calculate their 2D shapes.
Blend: Using the Alpha compositing formula above, layer the sorted spheres on top of each other to calculate the final pixel color.

3DGS is such a complete and refined piece of work that subsequent papers feel more like minor patches and tweaks. However, this purely hand-crafted “clone, split, prune” mechanism feels very rigid and clunky; it’s not as mathematically elegant as NeRF. Some steps feel like they were off-the-cuff engineering brainwaves tested via trial-and-error.

3. Surface Reconstruction

Surface reconstruction here aims to extract explicit Mesh models—ready for direct use in traditional rendering pipelines—from continuous neural fields (like volume density fields, SDFs) or discrete primitives.

3.1. Isosurface Extraction Based on Volume Density Fields

In the original NeRF architecture, the neural network predicts the color and volume density ($\sigma$) of any point in space. Volume density reflects the attenuation rate (or local termination probability) of a ray being blocked by physical entities at a certain point in space. Since the interior of an object usually has extremely high opacity while the exterior is free space, the location of an abrupt change in density can be treated as the object’s surface.

Extraction Mechanism: By manually setting a fixed density threshold $\tau$, the continuous 3D space can be implicitly divided into occupied states ($\sigma > \tau$) and unoccupied states ($\sigma < \tau$). This division forms the Isosurface. Subsequently, by applying classic algorithms like Marching Cubes (MC), intersection points can be located within the grid space to generate triangular faces.
Limitations: This density-based extraction scheme lacks explicit constraints on the underlying surface geometry. Due to the ill-posed nature of multi-view image reconstruction and view-dependent effects, the network tends to generate high-density discrete regions in free space to reduce photometric errors. As a result, the extracted meshes usually contain a large number of floaters, holes, and high-frequency noise. Furthermore, the selection of the threshold $\tau$ is highly empirical and lacks theoretical guarantees for generalization.

3.2. Shifting from Density to SDF

Compared to the flaws of volume density fields, the zero-level set of an SDF can represent continuous surfaces naturally, accurately, and in accordance with strict mathematical definitions.

VolSDF: Attempts to introduce the SDF framework into volume rendering. This algorithm establishes an analytical mapping between the SDF and volume density through a Cumulative Distribution Function (CDF). However, its empirical model has a certain bias in the density integral estimation near the surface, which affects the recovery of fine structures.
NeuS: Addressing the bias issue in VolSDF, NeuS proposes an unbiased volume rendering formulation. Specifically, while VolSDF essentially uses the CDF of the Laplace distribution, NeuS uses the Logistic distribution to derive its unbiased weights. Using the weight function induced by the SDF, NeuS ensures that the rendering weights during ray tracing computation strictly reach their global peak at the SDF zero-level set. This mathematical property greatly improves the smoothness and geometric fidelity of the reconstructed surface.
Neuralangelo: Subsequent works accelerated the SDF optimization process by introducing hybrid representations (like Tensor-based architectures using tensor decomposition). Taking this further, NVIDIA proposed Neuralangelo, which combines multi-resolution hash grid encoding (the Instant-NGP architecture) with numerical gradient optimization strategies. It successfully overcomes the surface roughness caused by hash tables, making it possible to reconstruct surface topologies with extreme photorealism and minute details directly from neural fields.

3.3. Differentiable Mesh Extraction and Rasterization

Although SDF-based implicit methods improved surface quality, extracting the Mesh is still a post-processing step performed after forward rendering is complete. Traditional Marching Cubes algorithms are non-differentiable because they involve discrete vertex state judgments and topological combination enumeration. To enable the joint optimization of 3D mesh vertices directly using supervision signals from the 2D image space, differentiable mesh extraction technologies were proposed. These are often combined with differentiable rasterizers (like NVIDIA’s nvdiffrast) to achieve a fully differentiable pipeline from mesh generation to image rendering:

DMTet (Differentiable Marching Tetrahedra): This algorithm abandons the traditional cubic grid and adopts tetrahedral spatial partitioning to eliminate topological ambiguity. DMTet not only predicts the SDF values of spatial nodes but also predicts the geometric deformation offsets of the vertices. By differentiably calculating the coordinates of isosurface intersections, the network can directly receive reprojection error gradients from the rasterized image, thereby finely adjusting the vertex positions.
FlexiCubes: NVIDIA further proposed the FlexiCubes algorithm. It returns to the cubic partitioning paradigm but introduces higher flexibility in mesh parameterization design (introducing additional vertex weight parameters). Compared to DMTet, FlexiCubes demonstrates superior dynamic topological adaptability and can more sharply and accurately fit sharp features and complex boundaries in industrial-grade models.

3.4. Surface Reconstruction Based on Gaussian Splatting

Unlike NeRF, which relies on continuous implicit fields for ray tracing, 3D Gaussian Splatting uses an explicit collection of 3D Gaussian ellipsoids to represent the scene, achieving real-time rendering speeds. However, because the original 3DGS lacks explicit geometric constraints, the Gaussian ellipsoids often exhibit a highly disordered and overlapping distribution in space. Directly extracting a mesh from them (e.g., by converting them into a density field and applying the MC algorithm) results in surfaces that are extremely rough, full of noise, and suffer from obvious thickness artifacts. To solve this pain point, recent research has begun introducing strict surface constraints into the explicit Gaussian representation.

SuGaR (Surface-Aligned Gaussian Splatting): The SuGaR algorithm introduces a clever regularization mechanism that forces the 3D Gaussian ellipsoids to flatten out and tightly align with the true surface of the scene during optimization. Through this alignment operation, SuGaR can convert the discrete Gaussian point cloud into a smooth surface indicator field. It then utilizes classic point cloud algorithms like Poisson Surface Reconstruction, and by binding the Gaussian primitives to the mesh faces for joint optimization, efficiently extracts a Mesh model that boasts both high-quality topological structure and real-time rendering capabilities.
2D Gaussian Splatting (2DGS): Huang et al. pointed out that 3D volumetric ellipsoids have inherent multi-view inconsistencies when representing extremely thin or flat surfaces. Therefore, 2DGS discards 3D ellipsoids and directly models the scene as 2D oriented discs (2D Surfels). This “dimensionality reduction” ensures that each Gaussian primitive naturally possesses clear surface properties and normal directions. Combined with depth distortion regularization and photometric consistency constraints, 2DGS can obtain extremely accurate geometric appearances and normal fields, fundamentally eliminating the thickness and floater artifacts of 3DGS to achieve high-fidelity mesh generation.

[2026 Edit] Deleted the quantitative comparison of mesh quality.

4. 3D Generative Models

Directly outputting 3D representations from images (aka, inverse graphics) is obviously a highly ill-posed problem, as single-view inputs severely lack geometric structure and depth information. Earlier, we explained how to perform neural rendering using multi-view images and extract Meshes from volume density fields or SDFs. But this is merely a “per-scene, per-model” optimization process; it is essentially overfitting, strictly a laboratory product, and the industry would never use these schemes. When faced with few-shot or even single-view inputs, how exactly do we optimize? Can locally missing views be recovered? How can we make the model possess cross-instance generalization capabilities, generating entire 3D models directly from a single image or a piece of text? This is the core proposition of 3D generation. From early test-time optimization to today’s feed-forward generation, 3D reconstruction has begun to step into the era of large models.

4.1 Vision Models

Due to the scarcity of high-quality native 3D data, 3D generative models in their initial stages relied heavily on semantic and geometric priors provided by 2D vision models. In Image-to-3D tasks, a powerful 2D feature extraction capability is the absolute foundation for 3D reconstruction, and it was the only data that could be relied upon in the early days.

Semantic and Representation Priors: Self-supervised learning-based Vision Transformers like MAE (Masked Autoencoder), CLIP, and DINOv2 can provide dense feature maps that are highly sensitive to object structures and semantic edges, acting as robust visual backbones.
Geometric and Depth Priors: CroCo implicitly learns stereo vision through a cross-view masked autoencoder, providing downstream tasks with an occlusion-robust feature system. The emergence of Depth Anything (v1/v2/v3) enabled high-precision monocular depth estimation, providing deterministic spatial constraints for subsequent “2D-to-3D lifting”.

4.2 Early Attempts at Generalization / Few-Shot

With the assistance of 2D features, researchers started trying to break the limitation of “one network only recognizing one scene,” attempting to give models the ability to extrapolate and generalize.

DeepSDF learns a class-level continuous signed distance function representation, embedding multiple shape instances into a shared latent space. During the inference phase, given a new observation (like a point cloud or partial geometry), one only needs to optimize a low-dimensional latent code to reconstruct the corresponding 3D shape. Although this method possesses cross-instance expressive capabilities to some extent, it still relies on iteratively optimizing the latent code for new samples. Therefore, strictly speaking, it hasn’t escaped the test-time optimization paradigm.
MetaSDF goes a step further by introducing a meta-learning framework on top of DeepSDF. By simulating the process of “fast adaptation to new tasks” during training, it enables the model to reconstruct new shapes using just a few gradient steps when faced with new instances. This method significantly speeds up adaptation but inherently still relies on parameter updates at test time, thus still having limitations in real-time performance and inference efficiency. With the rise of neural rendering, works like MetaNeRF also introduced the MAML algorithm to NeRF, leveraging powerful prior initialization weights to converge with just a few fine-tuning steps under extremely sparse views.

However, whether it’s MetaSDF or MetaNeRF, the core bottleneck remains: they haven’t completely jumped out of test-time optimization, and inference efficiency remains constrained. To break through this, the research paradigm began shifting towards direct forward mapping:

PixelNeRF introduces a “spatial projection” mechanism, projecting pixel features (extracted by a CNN) from input images onto sampled points in 3D space, thereby achieving conditional modeling during NeRF’s volume rendering process. This design elegantly bridges 2D visual understanding (CNN) with 3D scene representation (NeRF) structurally. More importantly, both its CNN encoder and MLP rendering network are pre-trained on large-scale 3D datasets, giving the model significant cross-scene generalization capabilities: during testing, when faced with one or a few novel-view images, the model can perform direct forward inference to generate new views without the time-consuming per-scene optimization (which usually takes hours) required by the original NeRF.

4.3 3D-Aware Generation Based on Adversarial Learning (3D-aware GANs)

Most of the above works were limited to specific categories or simple scenes. To achieve freer generation, how could we leave out GANs? 3D-aware GANs also had their moments of glory:

Early on, HoloGAN demonstrated the unsupervised learning of 3D scene representations purely from a collection of natural images, without any 3D annotations or multi-view constraints.
Later, pi-GAN integrated NeRF and periodic activation functions into the GAN framework, achieving high-precision 3D-aware image synthesis.
Then, EG3D proposed an efficient geometry-aware Triplane representation, drastically reducing computational complexity. Generating directly in 3D voxel space leads to a cubic explosion in computation, while generating point clouds/meshes faces topological irregularity issues. Triplanes project 3D space onto three orthogonal 2D planes (XY, YZ, XZ), and the features of sampled points are obtained via orthogonal projection, bilinear interpolation, and concatenation. This ingenious dimensionality reduction perfectly matched modern CNN and Transformer architectures, directly inspiring later large-scale models like LRM.

However, these GANs were mostly focused on specific domains like human faces. When facing complex open-world scenes, limited by model capacity and mode collapse issues, their diversity and generalization hit a ceiling.

4.3 Optimization Generation Based on 2D Priors

(Note: Preserved your original numbering here)
As 2D Diffusion Models demonstrated astonishing open-world generation capabilities, people naturally carved out a shortcut: using 2D diffusion models as “craftsmen” to supervise the optimization process of 3D scenes (maximum likelihood estimation, inferring scene parameters based on observations). The era of “2D-training-3D” thus began:

DreamFusion: Turned pre-trained 2D text-to-image diffusion models into 3D supervision signals, evaluating the rationality of rendered images via Score Distillation Sampling (SDS) and backpropagating gradients to update NeRF, or achieving novel view synthesis conditioned on relative poses.
Subsequently, Magic3D adopted a coarse-to-fine two-stage strategy, improving resolution and generation quality.
Next, ProlificDreamer replaced SDS with VSD, attempting to alleviate over-smoothing and low diversity from a probabilistic modeling perspective.
DreamGaussian swapped the SDS optimization target from NeRF to 3DGS, significantly boosting speed.
GET3D took a different path: instead of “relying on diffusion to optimize 3D”, it directly trained a GAN to generate explicit textured meshes, proving that explicit meshes can also be directly produced through generative learning.

These 2D-guided optimization schemes made open-vocabulary 3D generation possible, but a fatal flaw was also exposed: they are still quite time-consuming (measured in hours) and highly prone to the Janus problem. The so-called Janus problem refers to the fact that because 2D diffusion models lack native 3D consistency awareness, independent viewpoint sampling causes the supervision signals to “fight each other,” resulting in the generation of deformed objects with multiple heads, extra limbs, and abnormal structures.

4.4 Multi-View Diffusion and Joint Denoising

Faced with the Janus problem, relying solely on single-view 2D guidance was a dead end. The solution became: fine-tune 2D diffusion models to force them to spit out a “family of multi-view images” with strict geometric correlations in a single denoising pass.

Zero-1-to-3 was a critical turning point. It transformed a single image into novel view images at specified camera poses, proving that diffusion models can learn relative viewpoint control; but it was still inherently single-view-to-single-view generation, so consistency was still limited.
SyncDreamer: Introduced a 3D cost volume to inject an attention mechanism, forcing virtual views to align in 3D space.
Wonder3D: Used a cross-domain attention mechanism to synchronously generate multi-view RGB and normal maps, leveraging local geometric priors from normals to enhance consistency.
BoostDream: Leaning more towards engineering, it injected existing 3D priors into the multi-view diffusion process to refine existing 3D drafts.

These methods significantly alleviated structural chaos, but fundamentally, they are still 2D generation disguised as 3D; they are still “2D-training-3D”. Constrained by viewpoint coverage and camera conditioning settings, they haven’t completely solved the issues of slow optimization pipelines and tedious steps, failing to meet the industry’s demand for instant, “seconds-level model generation”.

4.5 Feed-Forward Large Reconstruction Models (LRMs)

The methods mentioned above are actually all optimization-based. Every time the network encounters a new object, it has to start from scratch, slowly “brewing alchemy” by calculating errors, which takes anywhere from tens of minutes to a few hours. The optimization pipeline is simply too slow, and the multi-view diffusion steps are too tedious, making it impossible to meet the real-time demands of the industry. Thus, braving the challenges of sparse viewpoints and extreme generalization pressure, 3D reconstruction finally (and somewhat forcefully) ushered in its own “Large Model” era: Large Reconstruction Models (LRMs).

Standard LRMs abandon per-scene optimization. They use pre-trained models like DINO to extract image tokens, then through a Cross-Attention-based Image-to-Triplane Transformer network, they directly predict the object’s triplane features in a single forward pass. Finally, these features are decoded by a tiny MLP into a NeRF for volume rendering.

[2026 Update] Deleted the subsequent descriptions of other LRMs.

Subsequent works like Large Gaussian Reconstruction Model, Long LRM, and RayZer have continuously refined the architecture. There are even acceleration works like Turbo3D that have pushed the generation pipeline to extreme speeds, thoroughly validating the absolute efficiency advantage of feed-forward reconstruction models.

But with speed scaling up, is the quality good enough? When facing complex topologies and industrial-grade material requirements, the information loss caused by 2D Lifting once again becomes the ceiling.

4.6 Generative Models Based on Direct 3D Supervision

Whether it’s SDS supervision or feed-forward Triplane prediction, cross-modal feature mapping (2D Lifting) is essentially using dimensionality-reduced projections to indirectly constrain a high-dimensional space. When encountering intricately distributed geometry and highly coupled textures, 2D priors often fall short, making it difficult to directly generate 3D assets with rigorous Physically Based Rendering (PBR) materials and clean topology.

As everyone knows, the key to the problem lies in the accumulation of high-quality data and the scaling of network architectures. With the open-sourcing of million-scale, high-quality 3D assets like Objaverse, and the astonishing Scaling Laws demonstrated by DiT (Scalable Diffusion Models with Transformers) in image generation, people realized: as long as the data is properly tokenized, Transformers can digest and refine massive amounts of native 3D geometry and textures.

Combining 3D Representations with Transformers

To apply DiT in 3D space, we first need to solve the input format for 3D data (3D VAE):

Shap-E (OpenAI) abandons the traditional idea of directly generating a single NeRF or Mesh. Instead, it first trains a 3D encoder to map massive 3D assets into an implicit neural representation (containing implicit parameters for both NeRF and SDF). Subsequently, on this highly compressed latent space, it directly applies a Transformer-based Diffusion model for denoising.
3DShape2VecSet further proposes representing 3D shapes as a Set of Latent Vectors. This set-based representation naturally fits the Permutation Invariance of the Transformer’s unordered attention mechanism. When representing neural fields, it not only boasts an extremely high compression rate but also provides an excellent continuous Token input format for subsequent generative Diffusion models.

Direct Diffusion within Native 3D Space

In fact, early works such as AutoSDF, Diffusion-SDF, and SDFusion have already fully validated the powerful potential of directly introducing generation and diffusion networks within native SDF or voxel spaces. Since the native 3D diffusion route works, and armed with the tokenization foundations mentioned above, researchers began to aggressively introduce the more powerful DiT architecture into native 3D space:

DiT-3D: Explored converting 3D voxels or point clouds into serialized 3D Tokens and applying 3D self-attention mechanisms over 3D windows. This DiT architecture, which performs denoising directly in 3D space, proves the powerful potential of Transformers in processing pure 3D geometric features.
Direct3D pushes “native 3D supervision” to even higher fidelity. Facing the geometric collapse frequently caused by 2D Lifting, Direct3D builds a DiT model directly on the 3D latent space (such as high-resolution 3D Voxel Latents or Triplanes), utilizing the geometric distribution of real 3D data for end-to-end diffusion training. What it generates is no longer an optical illusion based on multiple viewpoints, but solid geometry with continuous surfaces that can be directly exported as a high-quality Mesh.

Compared to the 2D world, the 3D world still severely lacks high-quality, large-scale data. This is actually the biggest bottleneck.

Discrete Mesh Generation Based on the Autoregressive Paradigm

Besides applying diffusion models in continuous fields or voxel spaces, another highly promising native 3D generation route is the autoregressive paradigm. Inspired by Large Language Models, researchers began to completely discretize 3D geometry, transforming mesh generation into a Next-Token Prediction task—much like playing a text continuation game. This paradigm can solve the pain points of dense meshes and chaotic topologies caused by the Marching Cubes algorithm in traditional surface reconstruction:

MeshAnything: Models 3D mesh generation as an autoregressive sequence generation problem. Not only can it be conditioned on point clouds, voxels, or single images, but more importantly, it focuses on generating meshes at the level of human artists. This means it no longer generates a chaotic sea of triangular faces, but rather beautiful geometric structures with clean topology that highly conform to low-poly modeling standards, which can be seamlessly integrated into downstream rendering and physics engines.
DeepMesh: Makes further breakthroughs based on autoregressive mesh generation by introducing a brand-new tokenization compression algorithm, and bringing Reinforcement Learning (specifically, Direct Preference Optimization, DPO) into the realm of 3D mesh generation. By combining 3D geometric evaluation metrics with human visual preferences as feedback signals, DeepMesh significantly improves the accuracy of autoregressive models when handling complex geometric details and maintaining strict topology, allowing the generated meshes to achieve an excellent balance between visual expressiveness and rigorous physical structure.

Industry-Grade Asset Generation

The latest 3D generation algorithms are no longer just “visually plausible”; they now feature Regular Topology, high-fidelity geometric details, and rigorous physical materials (PBR), truly beginning to integrate into industrial pipelines:

CraftsMan3D (Li et al., CVPR 2024) proposes a generation paradigm that mimics the sculpting logic of a “craftsman”. To solve the mesh topology chaos caused by traditional generative methods, this framework first uses a 3D-Native DiT to directly fit the data distribution in the 3D Latent Space, generating rough geometry with a regular mesh structure in seconds. Subsequently, it introduces an interactive geometric optimizer based on normals, allowing high-precision sculpting of local surface details while ensuring watertightness and using the Winding Number for visibility checks. This decoupled architecture not only drastically improves the success rate of 3D reconstruction but also introduces highly industry-valuable interactive editing capabilities to native 3D generation for the first time.

Hunyuan3D 2.0 (Tencent et al., 2025) proposes a scalable, decoupled paradigm aimed at generating industry-grade, high-fidelity 3D assets. To bridge the quality gap between generative 3D content and traditional hand-sculpted models, this framework first utilizes the Hunyuan3D-DiT (a flow-matching-based diffusion Transformer) with tens of billions of parameters to directly learn 3D geometric priors from massive data. This stage accurately aligns with visual or text conditions to generate high-quality Bare Meshes with smooth surfaces and clean topology, laying an excellent geometric foundation for subsequent underlying surface reconstruction and mesh extraction. Then, the framework introduces the Hunyuan3D-Paint texture synthesis model. Through the deep integration of a multi-view diffusion architecture and geometric priors, it not only achieves high-resolution texture mapping but also supports the direct generation of PBR materials, accurately restoring real-world light-matter interactions (like metallic reflections and sub-surface scattering). This decoupled “separation of form and color” architecture effectively breaks down the optimization difficulty of 3D asset generation, significantly boosting the generation stability of complex geometry while constructing a plug-and-play workflow that seamlessly integrates with existing industry-standard graphics pipelines.

From relying on 2D vision models for assistance to self-strengthening through native 3D data, 3D generation is developing incredibly fast. Of course, compared to 2D images, the biggest bottleneck in the 3D world right now remains the extreme scarcity of high-quality, large-scale native data. (Note: There are also excellent commercial products on the market like Meshy AI, but because they are closed-source, their specific technical details cannot be verified at the moment.)

[2026 Update] Vincent Sitzmann published a blog post suggesting that explicit 3D representations might not be needed in end-to-end models anymore, and traditional CV is dying out: https://www.vincentsitzmann.com/blog/bitter_lesson_of_cv/. Whatever, AI will equally erode every industry. Talk is cheap, show me the token. Let’s just use this as the ending lol.