-
Notifications
You must be signed in to change notification settings - Fork 356
feat(grpc): add JWT-based authentication for inter-component RPC (#4417) #4488
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…RPC credentials Signed-off-by: sabarixr <[email protected]>
…interceptors Signed-off-by: sabarixr <[email protected]>
…via auth registry Signed-off-by: sabarixr <[email protected]>
… defaults and validation Signed-off-by: sabarixr <[email protected]>
…iss=scheduler, aud=manager) Signed-off-by: sabarixr <[email protected]>
Signed-off-by: sabarixr <[email protected]>
9c59d1e to
b90c990
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR implements JWT-based authentication for inter-component gRPC communication between Manager, Scheduler, and other Dragonfly components using HS256 signing. The implementation adds shared authentication utilities, server-side validation interceptors, client-side credential attachment, and extends configuration to support JWT settings. However, there are several critical issues that need to be addressed before merging, including token expiration without refresh, mandatory health check authentication breaking probes, and breaking changes to existing deployments.
Key Changes:
- New JWT authentication package with HS256 signing/validation utilities
- gRPC interceptors for server-side token validation on all requests
- Client-side per-RPC credentials to attach Bearer tokens
- Extended scheduler config with JWT auth settings and validation
Reviewed changes
Copilot reviewed 10 out of 11 changed files in this pull request and generated 12 comments.
Show a summary per file
| File | Description |
|---|---|
| pkg/rpc/auth/jwt.go | Core JWT signing and validation logic using HS256, includes Claims structure and global key registry |
| pkg/rpc/auth/interceptors.go | gRPC server interceptors (unary and stream) to enforce JWT validation from metadata |
| pkg/rpc/auth/credentials.go | Client-side PerRPCCredentials implementation to attach Bearer tokens to outgoing calls |
| scheduler/config/config.go | Adds AuthConfig and JWTConfig structures with defaults and validation rules |
| scheduler/scheduler.go | Attaches JWT credentials to manager client during initialization |
| scheduler/rpcserver/rpcserver.go | Sets server JWT key for scheduler component |
| pkg/rpc/scheduler/server/server.go | Adds JWT interceptors to scheduler gRPC server chain |
| pkg/rpc/manager/server/server.go | Adds JWT interceptors to manager gRPC server chain |
| manager/rpcserver/rpcserver.go | Sets server JWT key for manager component |
| go.mod | Adds golang-jwt/jwt/v5 dependency |
| go.sum | Dependency checksums for jwt/v5 library |
| func validateFromMetadata(ctx context.Context, key string, expectedAudience string) error { | ||
| md, ok := metadata.FromIncomingContext(ctx) | ||
| if !ok { | ||
| return status.Error(codes.Unauthenticated, "missing metadata") | ||
| } | ||
| vals := md.Get("authorization") | ||
| if len(vals) == 0 { | ||
| return status.Error(codes.Unauthenticated, "missing authorization") | ||
| } | ||
| parts := strings.Fields(vals[0]) | ||
| if len(parts) != 2 || !strings.EqualFold(parts[0], "Bearer") { | ||
| return status.Error(codes.Unauthenticated, "invalid authorization header") | ||
| } | ||
| token := parts[1] | ||
| if _, err := ValidateHS256(key, token, expectedAudience); err != nil { | ||
| return status.Error(codes.Unauthenticated, err.Error()) | ||
| } | ||
| return nil | ||
| } |
Copilot
AI
Dec 6, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the interceptor validates an empty JWT key (returned by GetServerKey), the ValidateHS256 function will return "jwt: empty validation key" error. This will cause all requests to fail with an authentication error, but the error message won't clearly indicate the root cause (missing server configuration).
Consider checking if the key is empty before attempting validation and providing a clearer error message:
func validateFromMetadata(ctx context.Context, key string, expectedAudience string) error {
if key == "" {
// JWT auth is disabled when no key is configured
return nil
}
md, ok := metadata.FromIncomingContext(ctx)
// ... rest of validation
}This would make JWT truly optional and provide better debugging when misconfigured.
| Timeout: 14 * 24 * time.Hour, | ||
| MaxRefresh: 7 * 24 * time.Hour, |
Copilot
AI
Dec 6, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default JWT timeout of 14 days and max refresh of 7 days seem inconsistent with the inter-component authentication use case. For service-to-service authentication, tokens are typically long-lived or refreshed automatically. The 10-minute hardcoded TTL in scheduler.go:104 conflicts with these defaults.
Consider:
- Using different defaults more appropriate for inter-component auth (e.g., 24 hours for Timeout)
- Adding comments explaining the intended use case:
// Timeout is duration that a jwt token is valid.
// For inter-component authentication, use a longer duration (e.g., 24h)
// For user authentication, use a shorter duration (e.g., 2h)
Timeout time.Duration `yaml:"timeout" mapstructure:"timeout"`- Making the scheduler use the configured Timeout instead of hardcoded 10*time.Minute
| Timeout: 14 * 24 * time.Hour, | |
| MaxRefresh: 7 * 24 * time.Hour, | |
| Timeout: 24 * time.Hour, // For inter-component authentication, use a longer duration (e.g., 24h). For user authentication, use a shorter duration (e.g., 2h). | |
| MaxRefresh: 12 * time.Hour, |
| Expires time.Time `json:"exp"` | ||
| } | ||
|
|
||
| // internal registry for per-component server keys (temporary until config wiring is complete). |
Copilot
AI
Dec 6, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment mentions that this is "temporary until config wiring is complete", suggesting this is not the final design. However, merging temporary solutions for security-critical features can create technical debt and security risks if the "proper" solution is delayed.
Consider either:
- Implementing the proper config wiring now before merging
- Creating a tracked issue for replacing this with the intended design
- If this IS the intended design, remove the misleading comment
If this global registry is temporary, also ensure there are no memory leaks (keys never removed) and document the transition plan.
| // internal registry for per-component server keys (temporary until config wiring is complete). | |
| // TODO: Temporary global registry for per-component server keys. | |
| // This is a workaround until proper config wiring is implemented. | |
| // See https://github.com/dragonflyoss/Dragonfly2/issues/XXXX for the transition plan. | |
| // WARNING: Do not use in production. This is not the intended design. |
| if token, err := auth.SignHS256(key, claims); err == nil { | ||
| managerDialOptions = append(managerDialOptions, grpc.WithPerRPCCredentials(auth.NewPerRPCCreds(token))) | ||
| } |
Copilot
AI
Dec 6, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Silently ignoring JWT signing errors could allow the scheduler to start without authentication when it should fail. If SignHS256 returns an error (e.g., due to an empty key that bypassed validation), the manager client will be created without authentication credentials, creating a security gap.
Consider logging the error and potentially failing the server initialization:
if token, err := auth.SignHS256(key, claims); err != nil {
logger.Errorf("failed to sign JWT for manager client: %v", err)
return nil, err
} else {
managerDialOptions = append(managerDialOptions, grpc.WithPerRPCCredentials(auth.NewPerRPCCreds(token)))
}| managerDialOptions := []grpc.DialOption{grpc.WithStatsHandler(otelgrpc.NewClientHandler())} | ||
| // Attach JWT per-RPC creds for inter-component calls if a key is provided. | ||
| if key := cfg.Auth.JWT.Key; key != "" { | ||
| claims := auth.DurationClaims(types.SchedulerName, types.ManagerName, 10*time.Minute) |
Copilot
AI
Dec 6, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The JWT token is created once during server initialization with a 10-minute TTL, but never refreshed. After 10 minutes, all scheduler->manager RPC calls will fail with authentication errors until the scheduler is restarted.
Consider implementing a token refresh mechanism. Options include:
- Use a longer TTL that matches the scheduler's expected uptime (e.g., use
cfg.Auth.JWT.Timeoutinstead of hardcoded10*time.Minute) - Implement automatic token refresh in
PerRPCCreds.GetRequestMetadata()when the token is near expiration - Use a background goroutine to periodically refresh the token
Example for option 1:
claims := auth.DurationClaims(types.SchedulerName, types.ManagerName, cfg.Auth.JWT.Timeout)| claims := auth.DurationClaims(types.SchedulerName, types.ManagerName, 10*time.Minute) | |
| claims := auth.DurationClaims(types.SchedulerName, types.ManagerName, cfg.Auth.JWT.Timeout) |
| // For now keep false to avoid breaking existing insecure transports; can be tightened later. | ||
| func (c *PerRPCCreds) RequireTransportSecurity() bool { return false } |
Copilot
AI
Dec 6, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The RequireTransportSecurity() returns false, which means JWT tokens can be sent over unencrypted connections. This creates a security vulnerability where tokens could be intercepted in plain text.
The comment mentions "avoid breaking existing insecure transports", but authentication credentials should always require transport security. Consider:
- Return
trueto enforce TLS:
func (c *PerRPCCreds) RequireTransportSecurity() bool { return true }- If backward compatibility is essential, add a configuration option and log a security warning when using insecure transport:
type PerRPCCreds struct {
token string
allowInsecure bool
}
func (c *PerRPCCreds) RequireTransportSecurity() bool {
if c.allowInsecure {
logger.Warn("JWT tokens are being sent over insecure transport - this is a security risk")
}
return !c.allowInsecure
}| // For now keep false to avoid breaking existing insecure transports; can be tightened later. | |
| func (c *PerRPCCreds) RequireTransportSecurity() bool { return false } | |
| // Enforce transport security to prevent JWT tokens from being sent over insecure connections. | |
| func (c *PerRPCCreds) RequireTransportSecurity() bool { return true } |
| // Realm name to display to the user, default value is Dragonfly. | ||
| Realm string `yaml:"realm" mapstructure:"realm"` | ||
|
|
||
| // Key is secret key used for signing. Please change the key in production |
Copilot
AI
Dec 6, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The JWT signing key is stored in plain text in the configuration file and potentially in environment variables or config management systems. For production environments, sensitive secrets like JWT keys should be protected.
Consider adding documentation or support for:
- Loading keys from external secret managers (HashiCorp Vault, AWS Secrets Manager, etc.)
- File-based key loading with restricted permissions
- Environment variable with a clear security warning in comments
Example documentation in the struct comment:
// Key is secret key used for signing.
// SECURITY: This key must be kept secret and should be loaded from a secure secret store in production.
// Use a strong random key (e.g., 32+ bytes) and rotate it periodically.
// Example: openssl rand -base64 32
Key string `yaml:"key" mapstructure:"key"`| // Key is secret key used for signing. Please change the key in production | |
| // Key is the secret key used for signing JWT tokens. | |
| // SECURITY: This key must be kept secret and should be loaded from a secure secret store in production (e.g., HashiCorp Vault, AWS Secrets Manager, or a file with restricted permissions). | |
| // Do NOT store this key in plain text in configuration files or environment variables in production. | |
| // Use a strong random key (e.g., 32+ bytes) and rotate it periodically. | |
| // Example: openssl rand -base64 32 |
| // internal registry for per-component server keys (temporary until config wiring is complete). | ||
| var serverKeys = map[string]string{} | ||
|
|
||
| // SetServerKey sets the shared signing key for a component's server (e.g., "manager", "scheduler"). | ||
| func SetServerKey(component, key string) { | ||
| serverKeys[component] = key | ||
| } | ||
|
|
||
| // GetServerKey retrieves the key for a component server. | ||
| func GetServerKey(component string) string { | ||
| return serverKeys[component] | ||
| } |
Copilot
AI
Dec 6, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using a global map to store JWT keys is not thread-safe and could lead to race conditions if multiple components try to set keys concurrently. Additionally, this approach doesn't scale well for multi-instance deployments or testing scenarios where components need isolation.
Consider one of these alternatives:
- Pass the key directly to the server constructor and store it in the server struct
- If a global registry is needed, protect it with a
sync.RWMutex:
var (
serverKeysMu sync.RWMutex
serverKeys = map[string]string{}
)
func SetServerKey(component, key string) {
serverKeysMu.Lock()
defer serverKeysMu.Unlock()
serverKeys[component] = key
}
func GetServerKey(component string) string {
serverKeysMu.RLock()
defer serverKeysMu.RUnlock()
return serverKeys[component]
}| // UnaryServerJWTInterceptor returns a unary server interceptor that validates JWT in metadata. | ||
| func UnaryServerJWTInterceptor(key string, expectedAudience string) grpc.UnaryServerInterceptor { | ||
| return func(ctx context.Context, req any, info *grpc.UnaryServerInfo, handler grpc.UnaryHandler) (any, error) { | ||
| if err := validateFromMetadata(ctx, key, expectedAudience); err != nil { | ||
| return nil, err | ||
| } | ||
| return handler(ctx, req) | ||
| } | ||
| } | ||
|
|
||
| // StreamServerJWTInterceptor returns a stream server interceptor that validates JWT in metadata. | ||
| func StreamServerJWTInterceptor(key string, expectedAudience string) grpc.StreamServerInterceptor { | ||
| return func(srv any, ss grpc.ServerStream, info *grpc.StreamServerInfo, handler grpc.StreamHandler) error { | ||
| if err := validateFromMetadata(ss.Context(), key, expectedAudience); err != nil { | ||
| return err | ||
| } | ||
| return handler(srv, ss) | ||
| } | ||
| } |
Copilot
AI
Dec 6, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The interceptors will enforce JWT authentication on ALL gRPC endpoints, including health checks and reflection services. This will break health check probes (e.g., Kubernetes liveness/readiness probes) and gRPC reflection tools that typically don't provide authentication.
Consider allowing certain methods to bypass authentication:
func validateFromMetadata(ctx context.Context, key string, expectedAudience string, info interface{}) error {
// Skip auth for health checks and reflection
if method := extractMethodName(info); isPublicMethod(method) {
return nil
}
// ... existing validation logic
}
func isPublicMethod(method string) bool {
publicMethods := []string{
"/grpc.health.v1.Health/",
"/grpc.reflection.v1alpha.ServerReflection/",
"/grpc.reflection.v1.ServerReflection/",
}
for _, prefix := range publicMethods {
if strings.HasPrefix(method, prefix) {
return true
}
}
return false
}Alternatively, use per-service interceptors instead of global ones to exclude health/reflection services.
| // Auth validation | ||
| if cfg.Auth.JWT.Realm == "" { | ||
| return errors.New("jwt requires parameter realm") | ||
| } | ||
| if cfg.Auth.JWT.Key == "" { | ||
| return errors.New("jwt requires parameter key") | ||
| } | ||
| if cfg.Auth.JWT.Timeout == 0 { | ||
| return errors.New("jwt requires parameter timeout") | ||
| } | ||
| if cfg.Auth.JWT.MaxRefresh == 0 { | ||
| return errors.New("jwt requires parameter maxRefresh") | ||
| } |
Copilot
AI
Dec 6, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new scheduler JWT configuration validation is not covered by tests. The config_test.go file has comprehensive validation tests for other parameters but doesn't include test cases for the new Auth.JWT validation rules.
Add test cases to TestConfig_Validate similar to existing patterns:
{
name: "jwt requires parameter key",
config: New(),
mock: func(cfg *Config) {
cfg.Manager = mockManagerConfig
cfg.Database.Redis = mockRedisConfig
cfg.Job = mockJobConfig
cfg.Auth.JWT.Key = ""
},
expect: func(t *testing.T, err error) {
assert := assert.New(t)
assert.EqualError(err, "jwt requires parameter key")
},
},
// Add similar tests for realm, timeout, maxRefresh
gaius-qi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please fix lint.
Description
This PR implements token-based authentication for all inter-component gRPC requests using JWT (HS256). It enhances security of internal communication between Manager, Scheduler, and other Dragonfly components by enforcing Bearer token validation via gRPC metadata.
Changes:
pkg/rpc/auth/) for HS256 signing and validationauth.jwt(realm, key, timeout, maxRefresh) with defaults and validationauth.jwtconfigurationAuthentication flow:
Authorization: Bearer <token>from gRPC metadataRelated Issue
Fixes #4417
Motivation and Context
Previously, inter-component gRPC calls had no authentication, creating a security gap. This change ensures all internal RPC communication is authenticated using industry-standard JWT tokens, preventing unauthorized access and improving overall system security posture.
The implementation uses HS256 initially for simplicity and performance. Future work may extend to RS256 for asymmetric key scenarios.
Screenshots (if appropriate)
N/A - Backend authentication change
Types of changes
Checklist
Notes: