Skip to content

Commit 1d9f7fd

Browse files
committed
Merge branch 'main' of github.com:aws/model-hosting-container-standards into toggle-sticky-routing
2 parents 9c9791a + 4211239 commit 1d9f7fd

File tree

17 files changed

+3097
-16
lines changed

17 files changed

+3097
-16
lines changed

PR_DESCRIPTION.md

Lines changed: 112 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,112 @@
1+
# Add Supervisor Process Management Module
2+
3+
This introduces a **supervisor module** that wraps ML frameworks with supervisord for automatic crash recovery and robust process management. It can be integrated into any Dockerfile easily.
4+
5+
## Integration
6+
7+
Install and use with these commands:
8+
9+
```bash
10+
pip install model-hosting-container-standards
11+
standard-supervisor vllm serve model --host 0.0.0.0 --port 8080
12+
```
13+
14+
Or in a Dockerfile:
15+
```dockerfile
16+
COPY model_hosting_container_standards-0.1.2-py3-none-any.whl /tmp/
17+
RUN pip install supervisor
18+
RUN pip install /tmp/model_hosting_container_standards-0.1.2-py3-none-any.whl
19+
20+
# Use supervisor entrypoint for SageMaker
21+
ENV ENGINE_AUTO_RECOVERY=true
22+
ENV ENGINE_MAX_RECOVERY_ATTEMPTS=3
23+
ENTRYPOINT ["standard-supervisor", "./sagemaker-entrypoint.sh"]
24+
```
25+
26+
## Workflow
27+
28+
1. **Parse command and environment** → Read ML framework command and supervisor configuration
29+
2. **Generate supervisord config** → Create robust configuration with configparser
30+
3. **Start supervisord** → Launch supervisor daemon with your framework as managed process
31+
4. **Monitor and restart** → Supervisor detects crashes and restarts automatically with configurable limits
32+
5. **Handle failures** → After max retries, container exits gracefully with proper error codes
33+
34+
### **Key Components**
35+
36+
**Core Modules:**
37+
- `models.py` - Configuration data models with comprehensive validation and environment variable parsing
38+
- `generator.py` - Robust supervisord configuration generation using configparser
39+
40+
**CLI Tools & Scripts:**
41+
- `scripts/standard_supervisor.py` - Main CLI tool for running ML frameworks under supervisor (`standard-supervisor`)
42+
- `scripts/generate_supervisor_config.py` - Standalone configuration generator CLI
43+
44+
**Documentation & Tests:**
45+
- `README.md` - Comprehensive setup guide with examples
46+
- `tests/integration/test_supervisor_cli_integration.py` - **Real behavior integration tests** that verify actual restart and retry behavior
47+
- `tests/supervisor/` - Comprehensive unit tests for all components
48+
49+
## Usage Examples
50+
51+
### Simple CLI Usage
52+
```bash
53+
# Direct command execution with supervisor
54+
standard-supervisor vllm serve model --host 0.0.0.0 --port 8080
55+
56+
# With custom configuration
57+
PROCESS_MAX_START_RETRIES=5 SUPERVISOR_PROGRAM__APP_STARTSECS=30 \
58+
standard-supervisor python -m tensorrt_llm.hlapi.llm_api
59+
```
60+
61+
### Dockerfile Integration
62+
```dockerfile
63+
FROM vllm/vllm-openai:latest
64+
65+
# Install with supervisor support
66+
RUN pip install model-hosting-container-standards
67+
68+
# Configure your ML framework with supervisor settings
69+
ENV PROCESS_MAX_START_RETRIES=3
70+
ENV SUPERVISOR_PROGRAM__APP_STARTSECS=30
71+
ENV SUPERVISOR_PROGRAM__APP_STOPWAITSECS=60
72+
ENV LOG_LEVEL=info
73+
74+
# Use supervisor for process management
75+
ENTRYPOINT ["python", "-m", "model_hosting_container_standards.supervisor.scripts.standard_supervisor"]
76+
CMD ["vllm", "serve", "model", "--host", "0.0.0.0", "--port", "8080"]
77+
```
78+
79+
## Configuration Options
80+
81+
**Basic Configuration:**
82+
- Command line arguments become the supervised process command
83+
- `PROCESS_MAX_START_RETRIES=3` - Maximum startup attempts before giving up (0-100)
84+
- `LOG_LEVEL=info` - Logging level (debug, info, warn, error, critical)
85+
86+
**Advanced Supervisor Settings:**
87+
- `SUPERVISOR_PROGRAM__APP_STARTSECS=30` - Time process must run to be considered "started"
88+
- `SUPERVISOR_PROGRAM__APP_STOPWAITSECS=60` - Time to wait for graceful shutdown
89+
- `SUPERVISOR_PROGRAM__APP_AUTORESTART=true` - Enable automatic restart on failure
90+
- `SUPERVISOR_PROGRAM__APP_STARTRETRIES=3` - Startup retry attempts
91+
- `SUPERVISOR_CONFIG_PATH=/tmp/supervisord.conf` - Custom config file location
92+
93+
**Custom Sections:**
94+
- `SUPERVISOR_SUPERVISORD_LOGLEVEL=debug` - Supervisord daemon log level
95+
- `SUPERVISOR_EVENTLISTENER__MEMMON_COMMAND=memmon -a 200MB` - Add custom event listeners
96+
97+
## Testing & Validation
98+
99+
**Comprehensive Test Suite:**
100+
- **Integration Tests** - Actual supervisor processes that verify continuous restart and retry limit behavior
101+
**Test Coverage:**
102+
- **Continuous restart behavior** - Verifies supervisor actually restarts failed processes
103+
- **Startup retry limits** - Confirms supervisor respects retry limits and gives up appropriately
104+
- **Signal handling** - Tests graceful shutdown with SIGTERM
105+
- **ML framework integration** - Tests with realistic ML framework startup patterns
106+
- **Configuration generation** - Validates all supervisor configuration options
107+
- **Error handling** - Tests invalid configurations and edge cases
108+
109+
**Manual Testing:**
110+
- Tested with vLLM dockerfile build
111+
- Verified with `docker exec` process killing to confirm restart behavior
112+
- Validated in production-like container environments

python/MANIFEST.in

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
# Include supervisor scripts
2+
recursive-include model_hosting_container_standards/supervisor/scripts *
3+
4+
# Include documentation
5+
include README.md
6+
include LICENSE
7+
8+
# Include configuration files
9+
include pyproject.toml
10+
11+
# Exclude development files
12+
exclude .gitignore
13+
exclude .pre-commit-config.yaml
14+
recursive-exclude * __pycache__
15+
recursive-exclude * *.py[co]
16+
recursive-exclude * .DS_Store

python/model_hosting_container_standards/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,4 +5,4 @@
55
- FastAPI: from .common.fastapi import EnvVars, ENV_CONFIG
66
"""
77

8-
__version__ = "0.1.4"
8+
__version__ = "0.1.7"

python/model_hosting_container_standards/logging_config.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,9 @@ def get_logger(name: str = "model_hosting_container_standards") -> logging.Logge
2525
logger.addHandler(handler)
2626
logger.setLevel(getattr(logging, level.upper()))
2727

28+
# Prevent propagation to avoid duplicate logs
29+
logger.propagate = False
30+
2831
return logger
2932

3033

0 commit comments

Comments
 (0)