DreamCAD Logo

Scaling Multi-modal CAD Generation
using Differentiable Parametric Surfaces

1 DFKI 2 RPTU 3 Imperial College London 4 Huawei London Research Center

🧑‍💻 Work done during an internship at Huawei.

Arxiv Paper Code 🤗 CADCap-1M
DreamCAD teaser

DreamCAD turns text, images, and point clouds into editable parametric CAD surfaces. It's trained on over 1.3M meshes without any CAD annotations. CADCap-1M is our companion dataset of 1M+ GPT-5–generated captions, making large-scale text-to-CAD research finally possible.

1.3M+
Training Meshes
Mesh-based training across 10 public datasets
1M+
GPT-5 Captions
Largest CAD captioning dataset to date
>75%
User Preference
Across multimodal generation tasks
15s-30s
Inference Time
Text, Image & point-to-CAD on a single H100
STEP
Output Format
Editable surfaces via control points & weights on CAD software

What DreamCAD Brings

A unified framework combining differentiable surface representation with large-scale multimodal CAD generation.

Bézier Patch-Based CAD

Each CAD model is represented as a set of bicubic rational Bézier patches defined by learnable control points and weights. Adjacent patches share boundary control points to ensure C⁰ continuity — producing connected, watertight, and directly editable surfaces exportable as STEP files via OpenCascade.

Live Demo Bicubic Bézier Surface — 4×4 Control Points

Given a set of Bézier patches \(\{S_k\}_{k=1}^K\), each patch is evaluated on a uniform \(r \times r\) grid in the UV domain. A rational Bézier surface of degree \((n, m)\) is defined by control points \(\mathbf{C} = \{c_{ij}\}\) and non-negative weights \(\mathbf{W} = \{w_{ij}\}\) as: $$S(u,v) = \frac{\sum_{i,j} B_i^n(u)\, B_j^m(v)\, w_{ij}\, c_{ij}}{\sum_{i,j} B_i^n(u)\, B_j^m(v)\, w_{ij}}$$ where \(B_i^n(u) = \binom{n}{i} u^i (1-u)^{n-i}\) are Bernstein basis functions and \((u,v) \in [0,1]^2\). For the bicubic case \(n = m = 3\). Adjacent grid points define quadrilateral cells split into triangles to form a locally consistent mesh. Since \(S(u,v)\) is differentiable with respect to both \(\mathbf{C}\) and \(\mathbf{W}\), the entire tessellation supports end-to-end gradient-based optimization via Chamfer Distance loss.

Bézier Tessellation

How DreamCAD Works

DreamCAD adopts a multi-stage pipeline. Sparse voxel representations from input meshes are first encoded into structured latents, then decoded into C⁰-continuous Bézier patches.

The VAE encodes each 3D mesh into structured latents by voxelizing to 32³ resolution and augmenting each active voxel with DINOv2 embeddings from 150 RGB and normal renders, SDF values, and voxel centers. A sparse Transformer encoder produces structured latents \((v_i, z_i)\), which are decoded into bicubic rational Bézier patches. The surface is optimized end-to-end via Chamfer, G1, and Laplacian losses: $$\mathcal{L} = \lambda_{\text{cd}}\,\texttt{CD}(\mathcal{X}_g, \mathcal{X}_d) + \lambda_{g1}\,\texttt{G1}(\mathcal{S}_d) + \lambda_{\text{lp}}\,\texttt{Laplacian}(\mathcal{M}_d) + \lambda_{\text{kl}}\,D_{\text{KL}}$$

VAE Architecture

Directly predicting Bézier control points from latents leads to disconnected or overlapping patches. To enforce C⁰ continuity structurally, DreamCAD initializes patches from sparse voxels using a flood-fill algorithm that removes internal quads, leaving only surface-facing quads. Each surface quad is converted into a bicubic Bézier patch by sampling a \(4 \times 4\) control-point grid via bilinear interpolation. Adjacent patches share boundary control points by construction — guaranteeing seamless, gap-free surfaces before any decoder refinement.

Flood Fill Parametric Surface Initialisation

DreamCAD supports text, image, and point cloud inputs through a two-stage flow-matching framework. The first stage generates a coarse voxel grid from the input condition via a lightweight voxel flow model. The second stage predicts fine-grained SLAT features per active voxel, which the pretrained parametric decoder transforms into the final Bézier surface. For text-to-CAD, a LoRA-finetuned Stable Diffusion 3.5 bridges text to the image-to-CAD model, completing the multimodal pipeline in ~30s.

Conditional Generation

CADCap-1M Annotation Pipeline

CADCap-1M is built on top of four large-scale CAD repositories — ABC, Automate, CADParser, and Fusion360 — spanning over 1M parametric CAD models across mechanical, industrial, and everyday object categories. For each model, four orthographic views are rendered in Blender and passed to GPT-5 alongside structured metadata extracted directly from the CAD files — including model names, hole counts, and relative dimensions. This metadata-augmented prompting grounds the language model in geometric reality, reducing hallucinations and producing precise, structure-aware captions such as "M3×8 bolt … cylindrical shank … central hex socket. Height is 1.9× width." The result is the largest CAD captioning dataset to date, enabling large-scale text-to-CAD research for the first time.

Caption quality is assessed via GPT-5 evaluation on 5K samples and user studies on 1K samples. Evaluators are shown four rendered views, metadata, and the caption, then rate both geometric and semantic accuracy. Overall, 95.8% (user) and 98.31% (GPT-5) of captions are judged correct — validating the reliability of metadata-augmented prompting.

Quantitative Results

Comprehensive benchmarks across point-, image- and text-to-CAD tasks on ABC and Objaverse — demonstrating DreamCAD's significant improvements in geometric fidelity and visual alignment. F1 and IR are scaled by 10², while CD, JSD, and MMD are scaled by 10³.

Model F1 ↑ NC ↑ CD ↓ HD ↓ JSD ↓ MMD ↓ IR ↓
DeepCAD 19.31 0.49 51.10 0.37 783.94 29.63 11.01
CAD-Recode 75.99 0.79 3.73 0.13 271.89 2.94 15.39
Cadrille 78.86 0.80 2.98 0.12 236.10 2.51 5.84
DreamCAD (Ours) 92.12 0.94 0.93 0.06 96.13 0.84 0.00

Visual Comparison

Qualitative comparison of DreamCAD against baselines across point cloud, image, and text-conditioned CAD generation.

Point-to-CAD qualitative comparison

CAD Topology Recovery

While DreamCAD's outputs are editable via control points and weights, they lack complete CAD topology for production-level readiness. However, DreamCAD's high-fidelity geometric reconstruction can serve as a strong prior for topology recovery, which forms the basis of our future research.

50K
Training Samples
Qwen3-4B finetuned with LoRA for NURBS prediction
99.2%
Valid BReps
Topology recovery success rate on test set
0.17
Chamfer Distance (×10³)
High geometric fidelity after topology recovery
STEP
Output Format
Industry-standard, editable in any CAD software
Research
Future Work
Full CAD topology recovery using DreamCAD's reconstruction as geometric prior.

Each Bézier patch from DreamCAD's output is represented as 16 control points with corresponding weights, encoded via Transformer Encoder. We finetune Qwen3-4B on 50K samples to convert these patch-based representations into structured NURBS sequences with full semantic topology following the formulation of NURBGen . The resulting output is a valid BRep exportable as a standard STEP file.

CAD Topology Recovery

Given DreamCAD's patch output, Qwen3-4B predicts a structured NURBS representation with knot vectors, degrees, poles, and weights — producing a complete, semantically valid BRep topology ready for downstream CAD workflows.

Topology Example

BibTeX

@article{dreamcad,
  title   = {DreamCAD: Scaling Multi-modal CAD Generation using Differentiable Parametric Surfaces},
  author  = {Mohammad Sadil Khan, Muhammad Usama, Rolandos Alexandros Potamias, Didier Stricker, Muhammad Zeshan Afzal, Jiankang Deng, Ismail Elezi},
  journal = {Arxiv},
  year    = {2026}
}

Related Work

NURBGen — AAAI 2026 MARVEL-40M+ — CVPR 2025 Text2CAD — NeurIPS 2024 🏆 CAD-SIGNet — CVPR 2024 🏆

Get in Touch

Questions about DreamCAD, collaboration opportunities, or just want to say hi? Fill out the form and we'll get back to you.