Skip to content

Commit 9a3029d

Browse files
committed
feat: add IntegratedML Custom Models documentation and improvements
- Add PRD.md documenting the IntegratedML Custom Models feature - Add CLAUDE.md with development guidance for Claude Code users - Add DNA similarity demo notebook - Update demo notebooks for credit risk, fraud detection, and sales forecasting - Improve .gitignore to properly track demo notebooks - Add module-level logger for convenience - Add backwards compatibility for data generator parameters - Enhance realtime features for fraud detection These changes provide comprehensive documentation for the IntegratedML Custom Models feature and improve compatibility across demos.
1 parent 4739f7a commit 9a3029d

File tree

11 files changed

+936
-49
lines changed

11 files changed

+936
-49
lines changed

.gitignore

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -235,3 +235,13 @@ notebooks/
235235
.ipynb_checkpoints/
236236
.venv/
237237
.uv/
238+
239+
# Allow tracked demo notebooks (outputs must remain stripped)
240+
!demos/credit_risk/notebooks/
241+
!demos/fraud_detection/notebooks/
242+
!demos/sales_forecasting/notebooks/
243+
!demos/dna_similarity/notebooks/
244+
!demos/credit_risk/notebooks/01_Credit_Risk_Complete_Demo.ipynb
245+
!demos/fraud_detection/notebooks/01_Fraud_Detection_Complete_Demo.ipynb
246+
!demos/sales_forecasting/notebooks/01_Sales_Forecasting_Complete_Demo.ipynb
247+
!demos/dna_similarity/notebooks/01_DNA_Similarity_Complete_Demo.ipynb

CLAUDE.md

Lines changed: 141 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,141 @@
1+
# CLAUDE.md
2+
3+
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
4+
5+
## Project Overview
6+
7+
This repository demonstrates **IntegratedML's Custom Models feature**, which allows Python ML models to be executed directly within InterSystems IRIS SQL commands. This is a groundbreaking capability that enables:
8+
9+
- Custom Python preprocessing and model training code within SQL `CREATE MODEL` statements
10+
- Model validation using `VALIDATE MODEL` syntax
11+
- Real-time predictions using `SELECT ... PREDICT()` without data movement
12+
- Integration of any scikit-learn compatible model into database workflows
13+
14+
**Key Innovation**: Data scientists can now bring their custom Python models directly into SQL, eliminating the need for data export/import cycles and enabling real-time ML on live data.
15+
16+
See [PRD.md](PRD.md) for the complete product vision and feature documentation.
17+
18+
## Common Development Commands
19+
20+
### Environment Setup
21+
```bash
22+
# Install dependencies with uv (recommended) or pip
23+
make install
24+
25+
# Start IRIS database (required for demos)
26+
make start
27+
28+
# Complete setup (dependencies + IRIS)
29+
make setup
30+
```
31+
32+
### Development Workflow
33+
```bash
34+
# Run all tests
35+
make test
36+
# or directly with pytest
37+
pytest demos/*/tests/ -v --tb=short
38+
39+
# Format code
40+
make format
41+
# or directly
42+
black .
43+
44+
# Run linting
45+
make lint
46+
# or directly
47+
flake8 . --max-line-length=88 --extend-ignore=E203,W503
48+
mypy shared/ --ignore-missing-imports
49+
50+
# Open notebooks in VS Code
51+
make notebooks
52+
```
53+
54+
### Running Demos
55+
```bash
56+
# Individual demos
57+
make demo-credit
58+
make demo-fraud
59+
make demo-sales
60+
make demo-dna
61+
62+
# All demos
63+
make demos
64+
65+
# Or run directly
66+
python run_credit_risk_demo.py
67+
python run_fraud_detection_demo.py
68+
python run_sales_forecasting_demo.py
69+
python run_dna_similarity_demo.py
70+
```
71+
72+
### Database Management
73+
```bash
74+
# Check status
75+
make status
76+
77+
# View logs
78+
make logs
79+
80+
# Initialize database with sample data
81+
make init-db
82+
83+
# Clean up (removes containers and volumes)
84+
make clean
85+
```
86+
87+
## High-Level Architecture
88+
89+
### Model Integration Pattern
90+
All ML models follow a standardized integration pattern based on scikit-learn compatibility:
91+
92+
1. **Base Model Inheritance**: All models inherit from `IntegratedMLBaseModel` (shared/models/base.py:20), which provides:
93+
- IntegratedML parameter serialization/deserialization
94+
- Model state persistence
95+
- Input validation and preprocessing hooks
96+
- Consistent fit/predict interfaces
97+
98+
2. **Model Types**: Three specialized base classes extend the core pattern:
99+
- `ClassificationModel` (shared/models/classification.py) - Binary/multi-class classification
100+
- `RegressionModel` (shared/models/regression.py) - Continuous value prediction
101+
- `EnsembleModel` (shared/models/ensemble.py) - Multi-model voting/averaging
102+
103+
3. **Demo Model Structure**: Each demo implements a custom model:
104+
- `CustomCreditRiskClassifier` (demos/credit_risk/models/credit_risk_classifier.py:22) - Feature engineering for financial data
105+
- `EnsembleFraudDetector` (demos/fraud_detection/models/ensemble_fraud_detector.py) - Multiple sub-models with weighted voting
106+
- `HybridForecastingModel` (demos/sales_forecasting/models/hybrid_forecasting_model.py) - Prophet + LightGBM combination
107+
- `DNASimilarityAnalyzer` (demos/dna_similarity/models/dna_classifier.py) - Sequence analysis algorithms
108+
109+
### Key Architectural Patterns
110+
111+
1. **Feature Engineering Pipeline**: Models implement custom preprocessing in `_engineer_features()` methods, allowing domain-specific transformations before training/prediction.
112+
113+
2. **Model State Management**: Models use `_get_model_state()` and `_set_model_state()` for serialization, enabling persistence across database sessions.
114+
115+
3. **Ensemble Architecture**: The fraud detection demo shows how to combine multiple models (neural, rule-based, behavioral) with weighted voting and confidence thresholds.
116+
117+
4. **Database Integration**: Models are designed to work within IRIS database constraints:
118+
- Parameters passed via JSON from SQL
119+
- Models execute in-database for data locality
120+
- Results returned directly to SQL queries
121+
122+
### Database Connection
123+
The project uses InterSystems IRIS with IntegratedML. Connection details are configured via environment variables (see .env.example). The default setup uses:
124+
- Host: localhost
125+
- Port: 1972
126+
- Namespace: USER
127+
- Default credentials in docker-compose.yml
128+
129+
### Testing Strategy
130+
Tests are organized by demo with shared utilities:
131+
- Unit tests for individual components
132+
- Integration tests with IRIS database
133+
- Performance benchmarks for latency requirements
134+
- Test data generators for reproducible scenarios
135+
136+
Run specific test suites:
137+
```bash
138+
pytest demos/credit_risk/tests/
139+
pytest demos/fraud_detection/tests/
140+
pytest demos/sales_forecasting/tests/
141+
```

PRD.md

Lines changed: 230 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,230 @@
1+
# IntegratedML Custom Models Feature - Product Requirements Document
2+
3+
## Executive Summary
4+
5+
IntegratedML now supports **custom Python model integration**, enabling data scientists and developers to bring their own machine learning models directly into InterSystems IRIS SQL workflows. This groundbreaking feature allows custom Python preprocessing, feature engineering, and model training code to be executed within SQL commands like `CREATE MODEL` and `SELECT ... PREDICT()`.
6+
7+
## Core Value Proposition
8+
9+
### Before IntegratedML Custom Models
10+
```python
11+
# Traditional approach - data movement required
12+
data = fetch_from_database()
13+
processed_data = custom_preprocessing(data)
14+
model = train_custom_model(processed_data)
15+
predictions = model.predict(new_data)
16+
write_to_database(predictions)
17+
```
18+
19+
### With IntegratedML Custom Models
20+
```sql
21+
-- Everything happens in-database!
22+
CREATE MODEL FraudDetectionModel
23+
PREDICTING (is_fraud)
24+
FROM Transactions
25+
USING "demos.fraud_detection.models.EnsembleFraudDetector";
26+
27+
-- Real-time predictions without data movement
28+
SELECT transaction_id, amount,
29+
PREDICT(FraudDetectionModel) as fraud_risk
30+
FROM LiveTransactions
31+
WHERE amount > 1000;
32+
```
33+
34+
## Key Features
35+
36+
### 1. Seamless SQL Integration
37+
- Use familiar SQL syntax to train and deploy custom ML models
38+
- No need to export data or manage separate ML infrastructure
39+
- Models execute where the data lives
40+
41+
### 2. Python Flexibility
42+
- Bring any scikit-learn compatible model
43+
- Custom preprocessing and feature engineering
44+
- Support for ensemble methods and complex architectures
45+
- Integration with popular libraries (TensorFlow, LightGBM, Prophet)
46+
47+
### 3. Production-Ready
48+
- Models persist in the database
49+
- Automatic versioning and lifecycle management
50+
- Built-in security and access controls
51+
- Scalable in-database execution
52+
53+
## How It Works
54+
55+
### Step 1: Define Your Custom Model
56+
```python
57+
from shared.models.base import IntegratedMLBaseModel
58+
59+
class CustomCreditRiskClassifier(IntegratedMLBaseModel):
60+
def fit(self, X, y):
61+
# Custom feature engineering
62+
X_engineered = self._engineer_features(X)
63+
# Train your model
64+
self.model = LogisticRegression()
65+
self.model.fit(X_engineered, y)
66+
return self
67+
68+
def predict(self, X):
69+
X_engineered = self._engineer_features(X)
70+
return self.model.predict(X_engineered)
71+
```
72+
73+
### Step 2: Train Using SQL
74+
```sql
75+
CREATE MODEL CreditRiskModel
76+
PREDICTING (default_risk)
77+
FROM CreditApplications
78+
USING "demos.credit_risk.models.CustomCreditRiskClassifier"
79+
WITH (enable_debt_ratio=true, decision_threshold=0.7);
80+
```
81+
82+
### Step 3: Validate Model Performance
83+
```sql
84+
VALIDATE MODEL CreditRiskModel
85+
FROM TestApplications;
86+
```
87+
88+
### Step 4: Make Predictions
89+
```sql
90+
SELECT
91+
customer_id,
92+
credit_amount,
93+
PREDICT(CreditRiskModel) as risk_prediction,
94+
PREDICT(CreditRiskModel PROBABILITY) as risk_probability
95+
FROM NewApplications;
96+
```
97+
98+
## Demo Showcase
99+
100+
This repository demonstrates four real-world use cases:
101+
102+
### 1. Credit Risk Assessment
103+
**Problem**: Banks need to assess credit risk while keeping sensitive financial data secure.
104+
105+
**Solution**: Custom feature engineering (debt-to-income ratios, stability scores) executed in-database.
106+
107+
**Key Features**:
108+
- Domain-specific financial calculations
109+
- Risk scoring with explanations
110+
- Compliance-friendly in-database processing
111+
112+
### 2. Real-time Fraud Detection
113+
**Problem**: Payment processors need sub-100ms fraud detection without data movement latency.
114+
115+
**Solution**: Ensemble model combining neural networks, rules, and behavioral analysis.
116+
117+
**Key Features**:
118+
- Multiple model voting strategies
119+
- Real-time feature calculation
120+
- Configurable confidence thresholds
121+
122+
### 3. Sales Forecasting
123+
**Problem**: Retailers need accurate forecasts combining time series and ML approaches.
124+
125+
**Solution**: Hybrid model integrating Prophet (trending) with LightGBM (pattern learning).
126+
127+
**Key Features**:
128+
- Third-party library integration
129+
- Confidence interval generation
130+
- Seasonal decomposition
131+
132+
### 4. DNA Sequence Similarity
133+
**Problem**: Genomics researchers need specialized sequence analysis algorithms.
134+
135+
**Solution**: Custom similarity metrics (Levenshtein distance, k-mer analysis) in SQL.
136+
137+
**Key Features**:
138+
- Bioinformatics algorithms
139+
- Specialized distance calculations
140+
- Optimized sequence processing
141+
142+
## Technical Architecture
143+
144+
### Model Lifecycle
145+
1. **Development**: Create Python model inheriting from `IntegratedMLBaseModel`
146+
2. **Registration**: Model path specified in SQL `USING` clause
147+
3. **Training**: SQL `CREATE MODEL` triggers Python fit() method
148+
4. **Validation**: SQL `VALIDATE MODEL` evaluates performance metrics
149+
5. **Persistence**: Trained model stored in IRIS
150+
6. **Inference**: SQL `PREDICT()` calls Python predict() method
151+
152+
### Integration Points
153+
- **Parameter Passing**: SQL `WITH` clause → Python `__init__` parameters
154+
- **Data Transfer**: IRIS tables → Pandas DataFrames
155+
- **Model Storage**: Python pickle → IRIS model repository
156+
- **Error Handling**: Python exceptions → SQL error messages
157+
158+
## Benefits for Different Personas
159+
160+
### For Data Scientists
161+
- Use familiar Python ML libraries
162+
- No infrastructure management
163+
- Focus on model development, not deployment
164+
165+
### For SQL Developers
166+
- Access ML capabilities through SQL
167+
- No Python knowledge required for predictions
168+
- Consistent SQL-based workflows
169+
170+
### For IT/Operations
171+
- Reduced infrastructure complexity
172+
- Unified security model
173+
- Simplified model governance
174+
175+
### For Business Users
176+
- Faster time-to-insight
177+
- Real-time predictions on live data
178+
- Lower total cost of ownership
179+
180+
## Getting Started
181+
182+
### Quick Demo (5 minutes)
183+
```bash
184+
# Clone and setup
185+
git clone <repo>
186+
cd integratedml-flexible-model-integration
187+
make setup
188+
189+
# Run a demo
190+
make demo-credit
191+
```
192+
193+
### Try Your Own Model (15 minutes)
194+
1. Create a model class inheriting from `IntegratedMLBaseModel`
195+
2. Implement `fit()` and `predict()` methods
196+
3. Use SQL to train: `CREATE MODEL ... USING "your.model.path"`
197+
4. Validate performance: `VALIDATE MODEL YourModel FROM TestData`
198+
5. Make predictions: `SELECT PREDICT(YourModel) FROM YourTable`
199+
200+
## Success Metrics
201+
202+
- **Performance**: In-database execution eliminates data movement latency
203+
- **Flexibility**: Support for any Python ML approach
204+
- **Adoption**: Simple SQL interface for complex ML
205+
- **Governance**: Centralized model management
206+
207+
## Roadmap
208+
209+
### Current Release
210+
- Scikit-learn compatible models
211+
- Basic parameter passing
212+
- Model persistence
213+
- SQL prediction functions
214+
215+
### Future Enhancements
216+
- Model versioning and A/B testing
217+
- Automated retraining triggers
218+
- Model explainability APIs
219+
- Distributed training support
220+
221+
## Call to Action
222+
223+
1. **Explore the Demos**: See how custom models solve real problems
224+
2. **Build Your Own**: Create models for your specific use cases
225+
3. **Share Feedback**: Help shape the future of IntegratedML
226+
4. **Join the Community**: Share your models and learn from others
227+
228+
---
229+
230+
*IntegratedML Custom Models Feature - Where SQL Meets Machine Learning*

0 commit comments

Comments
 (0)