|
| 1 | +# IntegratedML Custom Models Feature - Product Requirements Document |
| 2 | + |
| 3 | +## Executive Summary |
| 4 | + |
| 5 | +IntegratedML now supports **custom Python model integration**, enabling data scientists and developers to bring their own machine learning models directly into InterSystems IRIS SQL workflows. This groundbreaking feature allows custom Python preprocessing, feature engineering, and model training code to be executed within SQL commands like `CREATE MODEL` and `SELECT ... PREDICT()`. |
| 6 | + |
| 7 | +## Core Value Proposition |
| 8 | + |
| 9 | +### Before IntegratedML Custom Models |
| 10 | +```python |
| 11 | +# Traditional approach - data movement required |
| 12 | +data = fetch_from_database() |
| 13 | +processed_data = custom_preprocessing(data) |
| 14 | +model = train_custom_model(processed_data) |
| 15 | +predictions = model.predict(new_data) |
| 16 | +write_to_database(predictions) |
| 17 | +``` |
| 18 | + |
| 19 | +### With IntegratedML Custom Models |
| 20 | +```sql |
| 21 | +-- Everything happens in-database! |
| 22 | +CREATE MODEL FraudDetectionModel |
| 23 | +PREDICTING (is_fraud) |
| 24 | +FROM Transactions |
| 25 | +USING "demos.fraud_detection.models.EnsembleFraudDetector"; |
| 26 | + |
| 27 | +-- Real-time predictions without data movement |
| 28 | +SELECT transaction_id, amount, |
| 29 | + PREDICT(FraudDetectionModel) as fraud_risk |
| 30 | +FROM LiveTransactions |
| 31 | +WHERE amount > 1000; |
| 32 | +``` |
| 33 | + |
| 34 | +## Key Features |
| 35 | + |
| 36 | +### 1. Seamless SQL Integration |
| 37 | +- Use familiar SQL syntax to train and deploy custom ML models |
| 38 | +- No need to export data or manage separate ML infrastructure |
| 39 | +- Models execute where the data lives |
| 40 | + |
| 41 | +### 2. Python Flexibility |
| 42 | +- Bring any scikit-learn compatible model |
| 43 | +- Custom preprocessing and feature engineering |
| 44 | +- Support for ensemble methods and complex architectures |
| 45 | +- Integration with popular libraries (TensorFlow, LightGBM, Prophet) |
| 46 | + |
| 47 | +### 3. Production-Ready |
| 48 | +- Models persist in the database |
| 49 | +- Automatic versioning and lifecycle management |
| 50 | +- Built-in security and access controls |
| 51 | +- Scalable in-database execution |
| 52 | + |
| 53 | +## How It Works |
| 54 | + |
| 55 | +### Step 1: Define Your Custom Model |
| 56 | +```python |
| 57 | +from shared.models.base import IntegratedMLBaseModel |
| 58 | + |
| 59 | +class CustomCreditRiskClassifier(IntegratedMLBaseModel): |
| 60 | + def fit(self, X, y): |
| 61 | + # Custom feature engineering |
| 62 | + X_engineered = self._engineer_features(X) |
| 63 | + # Train your model |
| 64 | + self.model = LogisticRegression() |
| 65 | + self.model.fit(X_engineered, y) |
| 66 | + return self |
| 67 | + |
| 68 | + def predict(self, X): |
| 69 | + X_engineered = self._engineer_features(X) |
| 70 | + return self.model.predict(X_engineered) |
| 71 | +``` |
| 72 | + |
| 73 | +### Step 2: Train Using SQL |
| 74 | +```sql |
| 75 | +CREATE MODEL CreditRiskModel |
| 76 | +PREDICTING (default_risk) |
| 77 | +FROM CreditApplications |
| 78 | +USING "demos.credit_risk.models.CustomCreditRiskClassifier" |
| 79 | +WITH (enable_debt_ratio=true, decision_threshold=0.7); |
| 80 | +``` |
| 81 | + |
| 82 | +### Step 3: Validate Model Performance |
| 83 | +```sql |
| 84 | +VALIDATE MODEL CreditRiskModel |
| 85 | +FROM TestApplications; |
| 86 | +``` |
| 87 | + |
| 88 | +### Step 4: Make Predictions |
| 89 | +```sql |
| 90 | +SELECT |
| 91 | + customer_id, |
| 92 | + credit_amount, |
| 93 | + PREDICT(CreditRiskModel) as risk_prediction, |
| 94 | + PREDICT(CreditRiskModel PROBABILITY) as risk_probability |
| 95 | +FROM NewApplications; |
| 96 | +``` |
| 97 | + |
| 98 | +## Demo Showcase |
| 99 | + |
| 100 | +This repository demonstrates four real-world use cases: |
| 101 | + |
| 102 | +### 1. Credit Risk Assessment |
| 103 | +**Problem**: Banks need to assess credit risk while keeping sensitive financial data secure. |
| 104 | + |
| 105 | +**Solution**: Custom feature engineering (debt-to-income ratios, stability scores) executed in-database. |
| 106 | + |
| 107 | +**Key Features**: |
| 108 | +- Domain-specific financial calculations |
| 109 | +- Risk scoring with explanations |
| 110 | +- Compliance-friendly in-database processing |
| 111 | + |
| 112 | +### 2. Real-time Fraud Detection |
| 113 | +**Problem**: Payment processors need sub-100ms fraud detection without data movement latency. |
| 114 | + |
| 115 | +**Solution**: Ensemble model combining neural networks, rules, and behavioral analysis. |
| 116 | + |
| 117 | +**Key Features**: |
| 118 | +- Multiple model voting strategies |
| 119 | +- Real-time feature calculation |
| 120 | +- Configurable confidence thresholds |
| 121 | + |
| 122 | +### 3. Sales Forecasting |
| 123 | +**Problem**: Retailers need accurate forecasts combining time series and ML approaches. |
| 124 | + |
| 125 | +**Solution**: Hybrid model integrating Prophet (trending) with LightGBM (pattern learning). |
| 126 | + |
| 127 | +**Key Features**: |
| 128 | +- Third-party library integration |
| 129 | +- Confidence interval generation |
| 130 | +- Seasonal decomposition |
| 131 | + |
| 132 | +### 4. DNA Sequence Similarity |
| 133 | +**Problem**: Genomics researchers need specialized sequence analysis algorithms. |
| 134 | + |
| 135 | +**Solution**: Custom similarity metrics (Levenshtein distance, k-mer analysis) in SQL. |
| 136 | + |
| 137 | +**Key Features**: |
| 138 | +- Bioinformatics algorithms |
| 139 | +- Specialized distance calculations |
| 140 | +- Optimized sequence processing |
| 141 | + |
| 142 | +## Technical Architecture |
| 143 | + |
| 144 | +### Model Lifecycle |
| 145 | +1. **Development**: Create Python model inheriting from `IntegratedMLBaseModel` |
| 146 | +2. **Registration**: Model path specified in SQL `USING` clause |
| 147 | +3. **Training**: SQL `CREATE MODEL` triggers Python fit() method |
| 148 | +4. **Validation**: SQL `VALIDATE MODEL` evaluates performance metrics |
| 149 | +5. **Persistence**: Trained model stored in IRIS |
| 150 | +6. **Inference**: SQL `PREDICT()` calls Python predict() method |
| 151 | + |
| 152 | +### Integration Points |
| 153 | +- **Parameter Passing**: SQL `WITH` clause → Python `__init__` parameters |
| 154 | +- **Data Transfer**: IRIS tables → Pandas DataFrames |
| 155 | +- **Model Storage**: Python pickle → IRIS model repository |
| 156 | +- **Error Handling**: Python exceptions → SQL error messages |
| 157 | + |
| 158 | +## Benefits for Different Personas |
| 159 | + |
| 160 | +### For Data Scientists |
| 161 | +- Use familiar Python ML libraries |
| 162 | +- No infrastructure management |
| 163 | +- Focus on model development, not deployment |
| 164 | + |
| 165 | +### For SQL Developers |
| 166 | +- Access ML capabilities through SQL |
| 167 | +- No Python knowledge required for predictions |
| 168 | +- Consistent SQL-based workflows |
| 169 | + |
| 170 | +### For IT/Operations |
| 171 | +- Reduced infrastructure complexity |
| 172 | +- Unified security model |
| 173 | +- Simplified model governance |
| 174 | + |
| 175 | +### For Business Users |
| 176 | +- Faster time-to-insight |
| 177 | +- Real-time predictions on live data |
| 178 | +- Lower total cost of ownership |
| 179 | + |
| 180 | +## Getting Started |
| 181 | + |
| 182 | +### Quick Demo (5 minutes) |
| 183 | +```bash |
| 184 | +# Clone and setup |
| 185 | +git clone <repo> |
| 186 | +cd integratedml-flexible-model-integration |
| 187 | +make setup |
| 188 | + |
| 189 | +# Run a demo |
| 190 | +make demo-credit |
| 191 | +``` |
| 192 | + |
| 193 | +### Try Your Own Model (15 minutes) |
| 194 | +1. Create a model class inheriting from `IntegratedMLBaseModel` |
| 195 | +2. Implement `fit()` and `predict()` methods |
| 196 | +3. Use SQL to train: `CREATE MODEL ... USING "your.model.path"` |
| 197 | +4. Validate performance: `VALIDATE MODEL YourModel FROM TestData` |
| 198 | +5. Make predictions: `SELECT PREDICT(YourModel) FROM YourTable` |
| 199 | + |
| 200 | +## Success Metrics |
| 201 | + |
| 202 | +- **Performance**: In-database execution eliminates data movement latency |
| 203 | +- **Flexibility**: Support for any Python ML approach |
| 204 | +- **Adoption**: Simple SQL interface for complex ML |
| 205 | +- **Governance**: Centralized model management |
| 206 | + |
| 207 | +## Roadmap |
| 208 | + |
| 209 | +### Current Release |
| 210 | +- Scikit-learn compatible models |
| 211 | +- Basic parameter passing |
| 212 | +- Model persistence |
| 213 | +- SQL prediction functions |
| 214 | + |
| 215 | +### Future Enhancements |
| 216 | +- Model versioning and A/B testing |
| 217 | +- Automated retraining triggers |
| 218 | +- Model explainability APIs |
| 219 | +- Distributed training support |
| 220 | + |
| 221 | +## Call to Action |
| 222 | + |
| 223 | +1. **Explore the Demos**: See how custom models solve real problems |
| 224 | +2. **Build Your Own**: Create models for your specific use cases |
| 225 | +3. **Share Feedback**: Help shape the future of IntegratedML |
| 226 | +4. **Join the Community**: Share your models and learn from others |
| 227 | + |
| 228 | +--- |
| 229 | + |
| 230 | +*IntegratedML Custom Models Feature - Where SQL Meets Machine Learning* |
0 commit comments