🧑💻 Work done during an internship at Huawei.
turns text, images, and point clouds into editable parametric CAD surfaces. It's trained on over 1.3M meshes
without any CAD annotations.
is our companion dataset of 1M+ GPT-5–generated captions,
making large-scale text-to-CAD research finally possible.
A unified framework combining differentiable surface representation with large-scale multimodal CAD generation.
Multimodal CAD generation from point-level supervision alone without any CAD annotations. BReps as rational Bézier patches, differentiably tessellated into meshes.
1M+ GPT-5–generated captions across ABC, Automate, CADParser & Fusion360 — the largest CAD captioning dataset for text-to-CAD research.
SOTA across text, image & point modalities on ABC and Objaverse — up to 70% lower Chamfer Distance and >75% user preference over all baselines.
DreamCAD's high-fidelity geometric outputs can serve as a strong geometric prior for CAD topology recovery.
Each CAD model is represented as a set of bicubic rational Bézier patches defined by learnable control points and weights. Adjacent patches share boundary control points to ensure C⁰ continuity — producing connected, watertight, and directly editable surfaces exportable as STEP files via OpenCascade.
Given a set of Bézier patches \(\{S_k\}_{k=1}^K\), each patch is evaluated on a uniform \(r \times r\) grid in the UV domain. A rational Bézier surface of degree \((n, m)\) is defined by control points \(\mathbf{C} = \{c_{ij}\}\) and non-negative weights \(\mathbf{W} = \{w_{ij}\}\) as: $$S(u,v) = \frac{\sum_{i,j} B_i^n(u)\, B_j^m(v)\, w_{ij}\, c_{ij}}{\sum_{i,j} B_i^n(u)\, B_j^m(v)\, w_{ij}}$$ where \(B_i^n(u) = \binom{n}{i} u^i (1-u)^{n-i}\) are Bernstein basis functions and \((u,v) \in [0,1]^2\). For the bicubic case \(n = m = 3\). Adjacent grid points define quadrilateral cells split into triangles to form a locally consistent mesh. Since \(S(u,v)\) is differentiable with respect to both \(\mathbf{C}\) and \(\mathbf{W}\), the entire tessellation supports end-to-end gradient-based optimization via Chamfer Distance loss.
DreamCAD adopts a multi-stage pipeline. Sparse voxel representations from input meshes are first encoded into structured latents, then decoded into C⁰-continuous Bézier patches.
The VAE encodes each 3D mesh into structured latents by voxelizing to 32³ resolution and augmenting each active voxel with DINOv2 embeddings from 150 RGB and normal renders, SDF values, and voxel centers. A sparse Transformer encoder produces structured latents \((v_i, z_i)\), which are decoded into bicubic rational Bézier patches. The surface is optimized end-to-end via Chamfer, G1, and Laplacian losses: $$\mathcal{L} = \lambda_{\text{cd}}\,\texttt{CD}(\mathcal{X}_g, \mathcal{X}_d) + \lambda_{g1}\,\texttt{G1}(\mathcal{S}_d) + \lambda_{\text{lp}}\,\texttt{Laplacian}(\mathcal{M}_d) + \lambda_{\text{kl}}\,D_{\text{KL}}$$
Directly predicting Bézier control points from latents leads to disconnected or overlapping patches. To enforce C⁰ continuity structurally, DreamCAD initializes patches from sparse voxels using a flood-fill algorithm that removes internal quads, leaving only surface-facing quads. Each surface quad is converted into a bicubic Bézier patch by sampling a \(4 \times 4\) control-point grid via bilinear interpolation. Adjacent patches share boundary control points by construction — guaranteeing seamless, gap-free surfaces before any decoder refinement.
DreamCAD supports text, image, and point cloud inputs through a two-stage flow-matching framework. The first stage generates a coarse voxel grid from the input condition via a lightweight voxel flow model. The second stage predicts fine-grained SLAT features per active voxel, which the pretrained parametric decoder transforms into the final Bézier surface. For text-to-CAD, a LoRA-finetuned Stable Diffusion 3.5 bridges text to the image-to-CAD model, completing the multimodal pipeline in ~30s.
CADCap-1M is built on top of four large-scale CAD repositories — ABC, Automate, CADParser, and Fusion360 — spanning over 1M parametric CAD models across mechanical, industrial, and everyday object categories. For each model, four orthographic views are rendered in Blender and passed to GPT-5 alongside structured metadata extracted directly from the CAD files — including model names, hole counts, and relative dimensions. This metadata-augmented prompting grounds the language model in geometric reality, reducing hallucinations and producing precise, structure-aware captions such as "M3×8 bolt … cylindrical shank … central hex socket. Height is 1.9× width." The result is the largest CAD captioning dataset to date, enabling large-scale text-to-CAD research for the first time.
Caption quality is assessed via GPT-5 evaluation on 5K samples and user studies on 1K samples. Evaluators are shown four rendered views, metadata, and the caption, then rate both geometric and semantic accuracy. Overall, 95.8% (user) and 98.31% (GPT-5) of captions are judged correct — validating the reliability of metadata-augmented prompting.
Explore DreamCAD's multimodal CAD reconstructions across text, image, and point cloud inputs — alongside CADCap-1M's high-quality annotations spanning industrial parts.
Comprehensive benchmarks across point-, image- and text-to-CAD tasks on ABC and Objaverse — demonstrating DreamCAD's significant improvements in geometric fidelity and visual alignment. F1 and IR are scaled by 10², while CD, JSD, and MMD are scaled by 10³.
| Model | F1 ↑ | NC ↑ | CD ↓ | HD ↓ | JSD ↓ | MMD ↓ | IR ↓ |
|---|---|---|---|---|---|---|---|
| DeepCAD | 19.31 | 0.49 | 51.10 | 0.37 | 783.94 | 29.63 | 11.01 |
| CAD-Recode | 75.99 | 0.79 | 3.73 | 0.13 | 271.89 | 2.94 | 15.39 |
| Cadrille | 78.86 | 0.80 | 2.98 | 0.12 | 236.10 | 2.51 | 5.84 |
| DreamCAD (Ours) | 92.12 | 0.94 | 0.93 | 0.06 | 96.13 | 0.84 | 0.00 |
Qualitative comparison of DreamCAD against baselines across point cloud, image, and text-conditioned CAD generation.
While DreamCAD's outputs are editable via control points and weights, they lack complete CAD topology for production-level readiness. However, DreamCAD's high-fidelity geometric reconstruction can serve as a strong prior for topology recovery, which forms the basis of our future research.
Each Bézier patch from DreamCAD's output is represented as 16 control points with corresponding weights, encoded via Transformer Encoder. We finetune Qwen3-4B on 50K samples to convert these patch-based representations into structured NURBS sequences with full semantic topology following the formulation of NURBGen . The resulting output is a valid BRep exportable as a standard STEP file.
Given DreamCAD's patch output, Qwen3-4B predicts a structured NURBS representation with knot vectors, degrees, poles, and weights — producing a complete, semantically valid BRep topology ready for downstream CAD workflows.
@article{dreamcad, title = {DreamCAD: Scaling Multi-modal CAD Generation using Differentiable Parametric Surfaces}, author = {Mohammad Sadil Khan, Muhammad Usama, Rolandos Alexandros Potamias, Didier Stricker, Muhammad Zeshan Afzal, Jiankang Deng, Ismail Elezi}, journal = {Arxiv}, year = {2026} }
Questions about DreamCAD, collaboration opportunities, or just want to say hi? Fill out the form and we'll get back to you.