Skip to content

Centralized Feature Store built on DVC for ML feature versioning, validation, and sharing. Includes MLflow integration for experiment tracking and Kubeflow Pipeline components for production ML workflows.

Notifications You must be signed in to change notification settings

arec1b0/dvc-feature-store

Repository files navigation

DVC Feature Store

Tests Python 3.10+ License: MIT

Centralized feature store built on DVC for ML feature versioning, validation, and sharing.

Features

  • Version Control - Track feature changes with DVC
  • Schema Validation - Validate features against Pydantic schemas
  • Multi-Backend Storage - S3, GCS, Azure, or local storage
  • Feature Registry - Central catalog for feature discovery
  • CI/CD Integration - Automated validation in GitHub Actions
  • MLflow Integration - Automatic feature lineage tracking
  • Kubeflow Pipelines - Production ML pipeline components
  • CLI Interface - Easy-to-use command line tools

Installation

# Basic installation
pip install -e .

# With MLflow support
pip install -e ".[all]"

# With Kubeflow support
pip install -e ".[kubeflow]"

Quick Start

# Initialize feature store
feature-store init --remote-url s3://my-bucket/features

# Add a feature
feature-store feature add customer demographics data.parquet --schema schema.yaml

# List features
feature-store feature list

# Validate
feature-store feature validate customer/demographics

# Push to remote
feature-store push

MLflow Integration

from src.feature_store import FeatureStoreMLflow

fs = FeatureStoreMLflow(
    registry_path=Path("features/registry.yaml"),
    experiment_name="my-experiment",
)

with fs.start_run(run_name="training-v1"):
    df = fs.create_training_dataset(
        feature_names=["customer/demographics", "transaction/aggregates"],
        join_keys=["customer_id"],
    )
    
    model = train_model(df)
    
    fs.log_model_with_features(
        model=model,
        artifact_path="model",
        feature_names=["customer/demographics", "transaction/aggregates"],
        registered_model_name="my-model",
    )

Kubeflow Pipelines

from src.feature_store.pipelines import create_training_pipeline
from kfp import compiler

compiler.Compiler().compile(
    pipeline_func=create_training_pipeline,
    package_path="training_pipeline.yaml",
)

Documentation

Document Description
User Guide General usage guide
MLflow Integration MLflow setup and usage
Kubeflow Integration Kubeflow pipeline guide
Contributing Development setup
Architecture Design decisions

Project Structure

dvc-feature-store/
├── src/feature_store/       # Core library
│   ├── models.py            # Pydantic models
│   ├── registry.py          # Feature registry
│   ├── validator.py         # Schema validation
│   ├── versioning.py        # DVC operations
│   ├── storage.py           # Remote storage
│   ├── mlflow_integration.py # MLflow integration
│   └── pipelines/           # Kubeflow components
├── features/                # Feature storage
├── examples/                # Usage examples
│   ├── ml_training/         # MLflow examples
│   └── kubeflow/            # Pipeline examples
├── tests/                   # Test suite
└── docs/                    # Documentation

License

MIT

About

Centralized Feature Store built on DVC for ML feature versioning, validation, and sharing. Includes MLflow integration for experiment tracking and Kubeflow Pipeline components for production ML workflows.

Topics

Resources

Contributing

Stars

Watchers

Forks

Languages