Version 2.0.0-rc.1 - MLOps pipeline (Local Development Ready) for predicting hourly bike traffic in Paris. Features automated data ingestion from Paris Open Data API, intelligent drift detection with sliding window training, champion/challenger model system with double evaluation, real-time monitoring via Prometheus + Grafana, and Discord alerting for critical events. Orchestrated with Airflow, tracked with MLflow (Cloud SQL backend), and deployed with FastAPI. All infrastructure runs locally via Docker Compose with 15 services. Production Kubernetes deployment under construction.
Watch a 4βminute demo of the complete MLOps pipeline on Vimeo
- Local (recommended for development)
- Copy the example file and edit values:
cp .env.example .env # then edit .env with your editor (do NOT commit) - Example file: .env.example
- Copy the example file and edit values:
# Start full MLOps stack (MLflow, Airflow, FastAPI, Monitoring)
./scripts/start-all.sh --with-monitoring
# Access services
open http://localhost:5000 # MLflow tracking
open http://localhost:8081 # Airflow (admin / see .env)
open http://localhost:8000 # FastAPI API docs
open http://localhost:3000 # Grafana (admin / see .env)
open http://localhost:9090 # Prometheus# DAG 1: Ingest data from Paris Open Data API
docker exec airflow-webserver airflow dags trigger fetch_comptage_daily
# DAG 2: Generate predictions
docker exec airflow-webserver airflow dags trigger daily_prediction
# DAG 3: Monitor & train (with force flag)
docker exec airflow-webserver airflow dags trigger monitor_and_fine_tune \
--conf '{"force_fine_tune": true, "test_mode": false}'- β Champion/Challenger System - Explicit model promotion with double evaluation
- β Sliding Window Training - Learns from fresh data (660K baseline + 1.6K current)
- β Drift Detection - Evidently-based monitoring with hybrid retraining strategy
- β Real-time Monitoring - Prometheus metrics + 4 Grafana dashboards
- β Discord Alerting - Critical events, training failures, champion promotions
- β Automated Ingestion - Daily fetch from Paris Open Data API β BigQuery
- β Prediction Pipeline - Daily ML predictions on last 7 days
- β Audit Logs - All training runs, drift metrics, deployment decisions tracked
- β MLflow Tracking - Cloud SQL PostgreSQL backend + GCS artifacts
- β
Custom Registry -
summary.jsonfor fast model loading - β Priority Loading - Champion models loaded first regardless of metrics
- β 68% Code Coverage - 47 tests across 4 suites
- β CI/CD Pipeline - GitHub Actions with Codecov integration
- β Pre-commit Hooks - Ruff, MyPy, Bandit, YAML validation
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β LAYER 3: MLOps (Training & Monitoring) β
β β’ DAG 3: Monitor & Train (weekly) β
β β’ Sliding window training (660K + 1.6K samples) β
β β’ Double evaluation (test_baseline + test_current) β
β β’ Champion promotion + Discord alerts β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β LAYER 2: DataOps (Ingestion & Predictions) β
β β’ DAG 1: Daily data ingestion β BigQuery β
β β’ DAG 2: Daily predictions (last 7 days) β
β β’ 3 BigQuery datasets (raw, predictions, audit) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β LAYER 1: InfraOps (Services & Storage) β
β β’ 15 Docker services (MLflow, Airflow, FastAPI) β
β β’ GCP: BigQuery, Cloud SQL, GCS β
β β’ Monitoring: Prometheus, Grafana, airflow-exporter β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Core Stack:
mlflow(port 5000) - Tracking server with Cloud SQL backendregmodel-backend(port 8000) - FastAPI with 5 endpointsairflow-webserver(port 8081) - DAG management UIairflow-scheduler- Task schedulingairflow-worker- Celery task executioncloud-sql-proxy- Secure Cloud SQL connection
Monitoring Stack (--profile monitoring):
prometheus(port 9090) - Metrics collectionpushgateway(port 9091:9091) - For testing metrics collectiongrafana(port 3000) - 4 dashboards (overview, performance, drift, training)airflow-exporter(port 9101) - Custom MLOps metrics
Supporting Services:
postgres-airflow- Airflow metadata DBredis-airflow- Celery brokerflower(port 5555) - Celery monitoring
High quality schema, for zoomed view:
βββ backend/regmodel/app/ # FastAPI backend
β βββ fastapi_app.py # API endpoints (/train, /predict, /promote_champion)
β βββ train.py # Training logic (sliding window)
β βββ model_registry_summary.py # Custom registry (summary.json)
β βββ middleware/ # Prometheus metrics middleware
βββ dags/ # Airflow DAGs
β βββ dag_daily_fetch_data.py # Data ingestion (daily @ 02:00)
β βββ dag_daily_prediction.py # Predictions (daily @ 04:00)
β βββ dag_monitor_and_train.py # Monitor & train (weekly @ Sunday)
β βββ utils/discord_alerts.py # Discord webhook integration
βββ monitoring/ # Monitoring configuration
β βββ grafana/provisioning/ # 4 dashboards + alerting rules
β βββ prometheus.yml # Scrape config (3 targets)
β βββ custom_exporters/ # airflow_exporter.py (MLOps metrics)
βββ scripts/ # Utility scripts
β βββ start-all.sh # Start all services (with/without monitoring)
β βββ restart-airflow.sh # Reset Airflow password
β βββ reset-airflow-password.sh
βββ data/ # Training data (DVC tracked)
β βββ train_baseline.csv # 660K samples (69.7%)
β βββ test_baseline.csv # 181K samples (30.3%)
βββ docs/ # Documentation (20+ files)
β βββ MLOPS_ROADMAP.md # Complete project roadmap
β βββ training_strategy.md # Sliding window + drift management
β βββ ARCHITECTURE_DIAGRAM_GUIDE.md # Excalidraw guide (3 layers)
β βββ DEMO_SCRIPT.md # 2-minute video demo script
β βββ monitoring/ # Monitoring docs (4 files)
βββ tests/ # Test suite (47 tests, 68% coverage)
β βββ test_pipelines.py # RF, NN pipeline validation
β βββ test_preprocessing.py # Transformer logic
β βββ test_api_regmodel.py # FastAPI endpoints
β βββ test_model_registry.py # Registry logic
βββ docker-compose.yaml # 15 services (6GB memory for training)
βββ .github/workflows/ci.yml # CI/CD pipeline
- Docker & Docker Compose
- GCP credentials (service account JSON)
- Python 3.11+ (for local development)
Create .env file at project root (see docs/secrets.md for production setup):
# ========================================
# Environment & GCP
# ========================================
ENV=DEV # DEV or PROD
GOOGLE_CLOUD_PROJECT=your-project-id
GOOGLE_APPLICATION_CREDENTIALS=gcp.json
# ========================================
# BigQuery Configuration
# ========================================
BQ_PROJECT=your-project-id
BQ_RAW_DATASET=bike_traffic_raw # Raw data from API
BQ_PREDICT_DATASET=bike_traffic_predictions # Model predictions
BQ_LOCATION=europe-west1
# ========================================
# Google Cloud Storage
# ========================================
GCS_BUCKET=your-bucket-name # MLflow artifacts + model registry
# ========================================
# API Configuration
# ========================================
API_URL_DEV=http://regmodel-api:8000 # Internal Docker network
API_KEY_SECRET=dev-key-unsafe # Change for production!
# ========================================
# Model Performance Thresholds (v2.0.0)
# ========================================
R2_CRITICAL=0.45 # Below this β immediate retraining
R2_WARNING=0.55 # Below this + drift β proactive retraining
RMSE_THRESHOLD=90.0 # Above this β immediate retraining
# Note: If you change these thresholds, also update:
# - monitoring/grafana/provisioning/dashboards/overview.json (lines 177, 181)
# - monitoring/grafana/provisioning/dashboards/model_performance.json
# - monitoring/grafana/provisioning/dashboards/training_deployment.json
# ========================================
# Discord Alerting (Optional)
# ========================================
DISCORD_WEBHOOK_URL=https://discord.com/api/webhooks/...
# ========================================
# Grafana
# ========================================
GF_SECURITY_ADMIN_PASSWORD=your-strong-password
# ========================================
# MLflow Backend (Cloud SQL)
# ========================================
MLFLOW_TRACKING_URI=http://mlflow:5000
MLFLOW_DB_USER=mlflow_user
MLFLOW_DB_PASSWORD=your-db-password
MLFLOW_DB_NAME=mlflow
MLFLOW_INSTANCE_CONNECTION=project-id:region:instance-name
# ========================================
# Airflow Configuration
# ========================================
_AIRFLOW_WWW_USER_USERNAME=admin
_AIRFLOW_WWW_USER_PASSWORD=admin
AIRFLOW_UID=50000 # Match host user for volume permissions
AIRFLOW_GID=50000Security Notes:
β οΈ .envcontains secrets - NEVER commit to Git (already in.gitignore)- π For production: Use GCP Secret Manager (see docs/secrets.md)
git clone https://github.com/arthurcornelio88/ds_traffic_cyclist1.git
cd ds_traffic_cyclist1
# Install dependencies (local dev)
uv init
uv venv
uv sync
source .venv/bin/activate# Place service account JSON in project root
# File: mlflow-trainer.json (for training + model upload)Required GCP Services:
- BigQuery (3 datasets: raw, predictions, audit)
- Cloud SQL PostgreSQL (MLflow metadata)
- GCS bucket:
gs://df_traffic_cyclist1/
See docs/secrets.md for detailed setup.
# Option 1: Core services only (MLflow, Airflow, FastAPI)
./scripts/start-all.sh
# Option 2: With monitoring (Prometheus + Grafana)
./scripts/start-all.sh --with-monitoring
# Check logs
docker compose logs -f regmodel-backend
docker compose logs -f airflow-scheduler# All tests with coverage
uv run pytest tests/ -v --cov
# Specific test suite
uv run pytest tests/test_api_regmodel.py -v
# Generate HTML coverage report
uv run pytest tests/ --cov --cov-report=html
open htmlcov/index.html# Install hooks
uv run pre-commit install
# Run manually
uv run pre-commit run --all-files# Quick test (1K samples)
python backend/regmodel/app/train.py \
--model-type rf \
--data-source baseline \
--model-test \
--env dev
# Full production training (660K samples)
python backend/regmodel/app/train.py \
--model-type rf \
--data-source baseline \
--env devBase URL: http://localhost:8000
| Endpoint | Method | Purpose |
|---|---|---|
/train |
POST | Train model with sliding window |
/predict |
POST | Generate predictions (returns champion metadata) |
/evaluate |
POST | Evaluate champion on test_baseline |
/drift |
POST | Detect data drift (Evidently) |
/promote_champion |
POST | Promote model to champion status |
/metrics |
GET | Prometheus metrics (scraped every 15s) |
/docs |
GET | Interactive API documentation |
Example: Train via API
curl -X POST "http://localhost:8000/train" \
-H "Content-Type: application/json" \
-d '{
"model_type": "rf",
"data_source": "baseline",
"test_mode": false,
"env": "dev"
}'Access: http://localhost:3000 (admin / see .env)
-
MLOps - Overview
- Drift status (50% detected)
- Champion RΒ² (0.78)
- API request rate & error rate
- Services health
-
MLOps - Model Performance
- RΒ² trends (champion vs challenger)
- RMSE: 32.5
- API latency percentiles (P50/P95/P99)
-
MLOps - Drift Monitoring
- Drift evolution over time
- Drifted features count
- RΒ² vs drift correlation
-
MLOps - Training & Deployment
- Training success rate (100%)
- Deployment decisions (deploy/skip/reject)
- Model improvement delta
Key Metrics:
bike_model_r2_champion_current- Champion RΒ² on recent databike_drift_detected- Binary drift flag (0/1)bike_training_runs_total- Training runs counterbike_model_deployments_total- Deployment decisionsfastapi_requests_total- API request ratefastapi_request_duration_seconds- API latency
See docs/monitoring/03_metrics_reference.md for full catalog.
See docs/DEMO_SCRIPT.md for video presentation guide:
- Infrastructure startup (0:00-0:15) - Start all services
- Data pipeline (0:15-0:45) - DAG 1 & 2, Discord alerts, BigQuery
- MLOps pipeline (0:45-1:30) - DAG 3 training, champion promotion
- Grafana dashboards (1:30-2:00) - 4 dashboards overview
- MLOPS_ROADMAP.md - Complete project roadmap (5 phases)
- docs/training_strategy.md - Sliding window + drift management
- docs/architecture.md - MLflow & model registry
- docs/dags.md - Airflow DAG reference (3 DAGs)
- docs/INFRASTRUCTURE.md - Complete infrastructure documentation (Docker services, GCP, external APIs)
- docs/mlflow_cloudsql.md - MLflow Cloud SQL setup
- docs/bigquery_setup.md - BigQuery pipeline
- docs/secrets.md - GCS credentials & Secret Manager
- docs/dvc.md - Data versioning
- docs/monitoring/01_architecture.md - Monitoring overview
- docs/monitoring/02_alerting.md - Alert configuration
- docs/monitoring/03_metrics_reference.md - Metrics catalog
- docs/monitoring/04_dashboards_explained.md - Dashboard guide
- docs/ARCHITECTURE_DIAGRAM_GUIDE.md - Excalidraw guide (3 layers)
- docs/DEMO_SCRIPT.md - 2-minute video demo script
Status: Local development ready, production Kubernetes deployment under construction
Major Features:
- β Production MLOps pipeline (Airflow, MLflow, FastAPI)
- β Champion/Challenger system with double evaluation
- β Sliding window training (660K + 1.6K samples)
- β Real-time monitoring (Prometheus + Grafana)
- β Discord alerting
- π§ Kubernetes deployment (under construction)
- π§ Production GCP deployment (under construction)
Features:
- Streamlit frontend for manual predictions
- Basic MLflow tracking (local only)
- Single model registry (
summary.json) - No automated orchestration
Note: V1 frontend (Streamlit) is deprecated in V2. Focus shifted to automated MLOps pipeline.
- Unified baseline: 905K records from
current_api_data.csv(2024-09-01 β 2025-10-10) - Temporal split: 660K train (69.7%) + 181K test (30.3%)
- DVC tracking: Data versioned with GCS remote storage
- Hybrid strategy: Proactive (preventive) + Reactive (corrective) triggers
- Thresholds: RΒ² < 0.65 (critical), drift β₯ 50% (proactive)
- Decision matrix: 5 priority levels (force, reactive, proactive, wait, all good)
- Sliding window: Concatenate train_baseline (660K) + train_current (1.6K)
- Double evaluation: test_baseline (regression check) + test_current (improvement check)
- Deployment logic: REJECT (RΒ² < 0.60) / SKIP (no improvement) / DEPLOY (RΒ² gain > 0.02)
./scripts/reset-airflow-password.sh
# Default: admin / admin# Check memory usage
docker stats
# Restart with clean slate
docker compose down -v
./scripts/start-all.sh --with-monitoringSee docs/mlflow_cloudsql.md for Cloud SQL troubleshooting.
Check Discord alerts or Airflow logs:
docker compose logs -f airflow-schedulerBuilt with β€οΈ by:
This project is part of a DataScientest MLOps training program.
Last Updated: November 2025 Version: 2.0.0 Status: Local ready, Kubernetes deployment under construction
