|
| 1 | +# Add Supervisor Process Management Module |
| 2 | + |
| 3 | +This introduces a **supervisor module** that wraps ML frameworks with supervisord for automatic crash recovery and robust process management. It can be integrated into any Dockerfile easily. |
| 4 | + |
| 5 | +## Integration |
| 6 | + |
| 7 | +Install and use with these commands: |
| 8 | + |
| 9 | +```bash |
| 10 | +pip install model-hosting-container-standards |
| 11 | +standard-supervisor vllm serve model --host 0.0.0.0 --port 8080 |
| 12 | +``` |
| 13 | + |
| 14 | +Or in a Dockerfile: |
| 15 | +```dockerfile |
| 16 | +COPY model_hosting_container_standards-0.1.2-py3-none-any.whl /tmp/ |
| 17 | +RUN pip install supervisor |
| 18 | +RUN pip install /tmp/model_hosting_container_standards-0.1.2-py3-none-any.whl |
| 19 | + |
| 20 | +# Use supervisor entrypoint for SageMaker |
| 21 | +ENV ENGINE_AUTO_RECOVERY=true |
| 22 | +ENV ENGINE_MAX_RECOVERY_ATTEMPTS=3 |
| 23 | +ENTRYPOINT ["standard-supervisor", "./sagemaker-entrypoint.sh"] |
| 24 | +``` |
| 25 | + |
| 26 | +## Workflow |
| 27 | + |
| 28 | +1. **Parse command and environment** → Read ML framework command and supervisor configuration |
| 29 | +2. **Generate supervisord config** → Create robust configuration with configparser |
| 30 | +3. **Start supervisord** → Launch supervisor daemon with your framework as managed process |
| 31 | +4. **Monitor and restart** → Supervisor detects crashes and restarts automatically with configurable limits |
| 32 | +5. **Handle failures** → After max retries, container exits gracefully with proper error codes |
| 33 | + |
| 34 | +### **Key Components** |
| 35 | + |
| 36 | +**Core Modules:** |
| 37 | +- `models.py` - Configuration data models with comprehensive validation and environment variable parsing |
| 38 | +- `generator.py` - Robust supervisord configuration generation using configparser |
| 39 | + |
| 40 | +**CLI Tools & Scripts:** |
| 41 | +- `scripts/standard_supervisor.py` - Main CLI tool for running ML frameworks under supervisor (`standard-supervisor`) |
| 42 | +- `scripts/generate_supervisor_config.py` - Standalone configuration generator CLI |
| 43 | + |
| 44 | +**Documentation & Tests:** |
| 45 | +- `README.md` - Comprehensive setup guide with examples |
| 46 | +- `tests/integration/test_supervisor_cli_integration.py` - **Real behavior integration tests** that verify actual restart and retry behavior |
| 47 | +- `tests/supervisor/` - Comprehensive unit tests for all components |
| 48 | + |
| 49 | +## Usage Examples |
| 50 | + |
| 51 | +### Simple CLI Usage |
| 52 | +```bash |
| 53 | +# Direct command execution with supervisor |
| 54 | +standard-supervisor vllm serve model --host 0.0.0.0 --port 8080 |
| 55 | + |
| 56 | +# With custom configuration |
| 57 | +PROCESS_MAX_START_RETRIES=5 SUPERVISOR_PROGRAM__APP_STARTSECS=30 \ |
| 58 | +standard-supervisor python -m tensorrt_llm.hlapi.llm_api |
| 59 | +``` |
| 60 | + |
| 61 | +### Dockerfile Integration |
| 62 | +```dockerfile |
| 63 | +FROM vllm/vllm-openai:latest |
| 64 | + |
| 65 | +# Install with supervisor support |
| 66 | +RUN pip install model-hosting-container-standards |
| 67 | + |
| 68 | +# Configure your ML framework with supervisor settings |
| 69 | +ENV PROCESS_MAX_START_RETRIES=3 |
| 70 | +ENV SUPERVISOR_PROGRAM__APP_STARTSECS=30 |
| 71 | +ENV SUPERVISOR_PROGRAM__APP_STOPWAITSECS=60 |
| 72 | +ENV LOG_LEVEL=info |
| 73 | + |
| 74 | +# Use supervisor for process management |
| 75 | +ENTRYPOINT ["python", "-m", "model_hosting_container_standards.supervisor.scripts.standard_supervisor"] |
| 76 | +CMD ["vllm", "serve", "model", "--host", "0.0.0.0", "--port", "8080"] |
| 77 | +``` |
| 78 | + |
| 79 | +## Configuration Options |
| 80 | + |
| 81 | +**Basic Configuration:** |
| 82 | +- Command line arguments become the supervised process command |
| 83 | +- `PROCESS_MAX_START_RETRIES=3` - Maximum startup attempts before giving up (0-100) |
| 84 | +- `LOG_LEVEL=info` - Logging level (debug, info, warn, error, critical) |
| 85 | + |
| 86 | +**Advanced Supervisor Settings:** |
| 87 | +- `SUPERVISOR_PROGRAM__APP_STARTSECS=30` - Time process must run to be considered "started" |
| 88 | +- `SUPERVISOR_PROGRAM__APP_STOPWAITSECS=60` - Time to wait for graceful shutdown |
| 89 | +- `SUPERVISOR_PROGRAM__APP_AUTORESTART=true` - Enable automatic restart on failure |
| 90 | +- `SUPERVISOR_PROGRAM__APP_STARTRETRIES=3` - Startup retry attempts |
| 91 | +- `SUPERVISOR_CONFIG_PATH=/tmp/supervisord.conf` - Custom config file location |
| 92 | + |
| 93 | +**Custom Sections:** |
| 94 | +- `SUPERVISOR_SUPERVISORD_LOGLEVEL=debug` - Supervisord daemon log level |
| 95 | +- `SUPERVISOR_EVENTLISTENER__MEMMON_COMMAND=memmon -a 200MB` - Add custom event listeners |
| 96 | + |
| 97 | +## Testing & Validation |
| 98 | + |
| 99 | +**Comprehensive Test Suite:** |
| 100 | +- **Integration Tests** - Actual supervisor processes that verify continuous restart and retry limit behavior |
| 101 | +**Test Coverage:** |
| 102 | +- **Continuous restart behavior** - Verifies supervisor actually restarts failed processes |
| 103 | +- **Startup retry limits** - Confirms supervisor respects retry limits and gives up appropriately |
| 104 | +- **Signal handling** - Tests graceful shutdown with SIGTERM |
| 105 | +- **ML framework integration** - Tests with realistic ML framework startup patterns |
| 106 | +- **Configuration generation** - Validates all supervisor configuration options |
| 107 | +- **Error handling** - Tests invalid configurations and edge cases |
| 108 | + |
| 109 | +**Manual Testing:** |
| 110 | +- Tested with vLLM dockerfile build |
| 111 | +- Verified with `docker exec` process killing to confirm restart behavior |
| 112 | +- Validated in production-like container environments |
0 commit comments