🧠 Learned Cardinality Estimation for DuckDB A DuckDB extension that integrates machine learning to predict cardinalities more accurately than traditional heuristics, using TPC-H benchmark data and ONNX inference in C++.
🚀 Overview Traditional cardinality estimation in query optimizers often struggles with complex data patterns. This project introduces a hybrid AI-classical system that: • Trains ML models (XGBoost/LightGBM) on TPC-H query plans. • Predicts cardinalities and associated uncertainty. • Overrides DuckDB’s optimizer estimates at runtime using a native C++ extension. • Supports temperature-controlled plan exploration via Thompson sampling.
📦 Features • Native DuckDB C++ extension with OptimizerExtension API • Trained ML models exported to ONNX for runtime inference • Supports temperature-tuned sampling for creative plan discovery • Integrated benchmarking pipeline with TPC-H queries • Docker-based reproducible environment
🏗️ Repository Structure See ARCHITECTURE.md for visual overview. • extern/duckdb/ – DuckDB submodule. • models/ – ONNX models for cardinality and uncertainty prediction. • src/ – Core C++ logic: feature extraction, inference, optimizer override. • include/ – C++ headers. • training/ – Python-based training scripts and notebooks. • benchmarks/ – TPC-H queries and benchmarking scripts. • scripts/ – Automation scripts for data generation and extension activation. • docker/ – Dev + CI Docker environments.
🧪 First-Time Setup Clone With Submodules git clone --recurse-submodules https://github.com/yourusername/learned-cardinality-duckdb Build the Extension mkdir build && cd build cmake .. -DCMAKE_BUILD_TYPE=Release make learned_cardinality Test It in DuckDB LOAD 'build/learned_cardinality.duckdb_extension'; SET enable_learned_cardinality = true; EXPLAIN SELECT * FROM lineitem JOIN orders ON ...;
🧠 Training the Model (Optional) To retrain the model or customize: cd training/ pip install -r requirements.txt python train.py # exports to models/cardinality.onnx
📊 Benchmarking Run all TPC-H queries and compare Q-Error: cd benchmarks/tpc-h/ python run_benchmark.py
⚙️ Configuration You can tune optimization behavior with DuckDB pragmas: SET learned_cardinality_temperature = 0.6; SET learned_cardinality_debug = true;
📈 Example Results Metric DuckDB Baseline Learned Estimator Avg Q-Error 9.8 2.1 Plan Stability 100% 89% Time Saved — ~22% on SF=1