Skip to content

Conversation

@sabarixr
Copy link

@sabarixr sabarixr commented Dec 6, 2025

Description

This PR implements token-based authentication for all inter-component gRPC requests using JWT (HS256). It enhances security of internal communication between Manager, Scheduler, and other Dragonfly components by enforcing Bearer token validation via gRPC metadata.

Changes:

  • Added shared JWT utilities (pkg/rpc/auth/) for HS256 signing and validation
  • Implemented gRPC unary/stream server interceptors to enforce JWT authentication
  • Added client-side per-RPC credentials to attach JWT tokens to outgoing calls
  • Extended Scheduler config with auth.jwt (realm, key, timeout, maxRefresh) with defaults and validation
  • Manager reuses existing auth.jwt configuration
  • Scheduler->Manager client now signs and attaches JWT with issuer=scheduler, audience=manager

Authentication flow:

  1. Server interceptors extract Authorization: Bearer <token> from gRPC metadata
  2. JWT validated using HS256 with configured key and expected audience
  3. Client attaches signed JWT via PerRPCCredentials on dial

Related Issue

Fixes #4417

Motivation and Context

Previously, inter-component gRPC calls had no authentication, creating a security gap. This change ensures all internal RPC communication is authenticated using industry-standard JWT tokens, preventing unauthorized access and improving overall system security posture.

The implementation uses HS256 initially for simplicity and performance. Future work may extend to RS256 for asymmetric key scenarios.

Screenshots (if appropriate)

N/A - Backend authentication change

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation Update (if none of the other choices apply)

Checklist

  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
  • I have read the CONTRIBUTING document.
  • I have added tests to cover my changes.

Notes:

  • I will update the document in the next commit after knowing the changes are what needed.
  • All components build successfully with no lint/vet errors
  • Config validation ensures JWT key is non-empty at startup

Copilot AI review requested due to automatic review settings December 6, 2025 16:21
@sabarixr sabarixr force-pushed the feature/grpc-jwt-auth branch from 9c59d1e to b90c990 Compare December 6, 2025 16:24
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements JWT-based authentication for inter-component gRPC communication between Manager, Scheduler, and other Dragonfly components using HS256 signing. The implementation adds shared authentication utilities, server-side validation interceptors, client-side credential attachment, and extends configuration to support JWT settings. However, there are several critical issues that need to be addressed before merging, including token expiration without refresh, mandatory health check authentication breaking probes, and breaking changes to existing deployments.

Key Changes:

  • New JWT authentication package with HS256 signing/validation utilities
  • gRPC interceptors for server-side token validation on all requests
  • Client-side per-RPC credentials to attach Bearer tokens
  • Extended scheduler config with JWT auth settings and validation

Reviewed changes

Copilot reviewed 10 out of 11 changed files in this pull request and generated 12 comments.

Show a summary per file
File Description
pkg/rpc/auth/jwt.go Core JWT signing and validation logic using HS256, includes Claims structure and global key registry
pkg/rpc/auth/interceptors.go gRPC server interceptors (unary and stream) to enforce JWT validation from metadata
pkg/rpc/auth/credentials.go Client-side PerRPCCredentials implementation to attach Bearer tokens to outgoing calls
scheduler/config/config.go Adds AuthConfig and JWTConfig structures with defaults and validation rules
scheduler/scheduler.go Attaches JWT credentials to manager client during initialization
scheduler/rpcserver/rpcserver.go Sets server JWT key for scheduler component
pkg/rpc/scheduler/server/server.go Adds JWT interceptors to scheduler gRPC server chain
pkg/rpc/manager/server/server.go Adds JWT interceptors to manager gRPC server chain
manager/rpcserver/rpcserver.go Sets server JWT key for manager component
go.mod Adds golang-jwt/jwt/v5 dependency
go.sum Dependency checksums for jwt/v5 library

Comment on lines +33 to +51
func validateFromMetadata(ctx context.Context, key string, expectedAudience string) error {
md, ok := metadata.FromIncomingContext(ctx)
if !ok {
return status.Error(codes.Unauthenticated, "missing metadata")
}
vals := md.Get("authorization")
if len(vals) == 0 {
return status.Error(codes.Unauthenticated, "missing authorization")
}
parts := strings.Fields(vals[0])
if len(parts) != 2 || !strings.EqualFold(parts[0], "Bearer") {
return status.Error(codes.Unauthenticated, "invalid authorization header")
}
token := parts[1]
if _, err := ValidateHS256(key, token, expectedAudience); err != nil {
return status.Error(codes.Unauthenticated, err.Error())
}
return nil
}
Copy link

Copilot AI Dec 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the interceptor validates an empty JWT key (returned by GetServerKey), the ValidateHS256 function will return "jwt: empty validation key" error. This will cause all requests to fail with an authentication error, but the error message won't clearly indicate the root cause (missing server configuration).

Consider checking if the key is empty before attempting validation and providing a clearer error message:

func validateFromMetadata(ctx context.Context, key string, expectedAudience string) error {
    if key == "" {
        // JWT auth is disabled when no key is configured
        return nil
    }
    md, ok := metadata.FromIncomingContext(ctx)
    // ... rest of validation
}

This would make JWT truly optional and provide better debugging when misconfigured.

Copilot uses AI. Check for mistakes.
Comment on lines +412 to +413
Timeout: 14 * 24 * time.Hour,
MaxRefresh: 7 * 24 * time.Hour,
Copy link

Copilot AI Dec 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default JWT timeout of 14 days and max refresh of 7 days seem inconsistent with the inter-component authentication use case. For service-to-service authentication, tokens are typically long-lived or refreshed automatically. The 10-minute hardcoded TTL in scheduler.go:104 conflicts with these defaults.

Consider:

  1. Using different defaults more appropriate for inter-component auth (e.g., 24 hours for Timeout)
  2. Adding comments explaining the intended use case:
// Timeout is duration that a jwt token is valid.
// For inter-component authentication, use a longer duration (e.g., 24h)
// For user authentication, use a shorter duration (e.g., 2h)
Timeout time.Duration `yaml:"timeout" mapstructure:"timeout"`
  1. Making the scheduler use the configured Timeout instead of hardcoded 10*time.Minute
Suggested change
Timeout: 14 * 24 * time.Hour,
MaxRefresh: 7 * 24 * time.Hour,
Timeout: 24 * time.Hour, // For inter-component authentication, use a longer duration (e.g., 24h). For user authentication, use a shorter duration (e.g., 2h).
MaxRefresh: 12 * time.Hour,

Copilot uses AI. Check for mistakes.
Expires time.Time `json:"exp"`
}

// internal registry for per-component server keys (temporary until config wiring is complete).
Copy link

Copilot AI Dec 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment mentions that this is "temporary until config wiring is complete", suggesting this is not the final design. However, merging temporary solutions for security-critical features can create technical debt and security risks if the "proper" solution is delayed.

Consider either:

  1. Implementing the proper config wiring now before merging
  2. Creating a tracked issue for replacing this with the intended design
  3. If this IS the intended design, remove the misleading comment

If this global registry is temporary, also ensure there are no memory leaks (keys never removed) and document the transition plan.

Suggested change
// internal registry for per-component server keys (temporary until config wiring is complete).
// TODO: Temporary global registry for per-component server keys.
// This is a workaround until proper config wiring is implemented.
// See https://github.com/dragonflyoss/Dragonfly2/issues/XXXX for the transition plan.
// WARNING: Do not use in production. This is not the intended design.

Copilot uses AI. Check for mistakes.
Comment on lines +105 to +107
if token, err := auth.SignHS256(key, claims); err == nil {
managerDialOptions = append(managerDialOptions, grpc.WithPerRPCCredentials(auth.NewPerRPCCreds(token)))
}
Copy link

Copilot AI Dec 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Silently ignoring JWT signing errors could allow the scheduler to start without authentication when it should fail. If SignHS256 returns an error (e.g., due to an empty key that bypassed validation), the manager client will be created without authentication credentials, creating a security gap.

Consider logging the error and potentially failing the server initialization:

if token, err := auth.SignHS256(key, claims); err != nil {
    logger.Errorf("failed to sign JWT for manager client: %v", err)
    return nil, err
} else {
    managerDialOptions = append(managerDialOptions, grpc.WithPerRPCCredentials(auth.NewPerRPCCreds(token)))
}

Copilot uses AI. Check for mistakes.
managerDialOptions := []grpc.DialOption{grpc.WithStatsHandler(otelgrpc.NewClientHandler())}
// Attach JWT per-RPC creds for inter-component calls if a key is provided.
if key := cfg.Auth.JWT.Key; key != "" {
claims := auth.DurationClaims(types.SchedulerName, types.ManagerName, 10*time.Minute)
Copy link

Copilot AI Dec 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The JWT token is created once during server initialization with a 10-minute TTL, but never refreshed. After 10 minutes, all scheduler->manager RPC calls will fail with authentication errors until the scheduler is restarted.

Consider implementing a token refresh mechanism. Options include:

  1. Use a longer TTL that matches the scheduler's expected uptime (e.g., use cfg.Auth.JWT.Timeout instead of hardcoded 10*time.Minute)
  2. Implement automatic token refresh in PerRPCCreds.GetRequestMetadata() when the token is near expiration
  3. Use a background goroutine to periodically refresh the token

Example for option 1:

claims := auth.DurationClaims(types.SchedulerName, types.ManagerName, cfg.Auth.JWT.Timeout)
Suggested change
claims := auth.DurationClaims(types.SchedulerName, types.ManagerName, 10*time.Minute)
claims := auth.DurationClaims(types.SchedulerName, types.ManagerName, cfg.Auth.JWT.Timeout)

Copilot uses AI. Check for mistakes.
Comment on lines +26 to +27
// For now keep false to avoid breaking existing insecure transports; can be tightened later.
func (c *PerRPCCreds) RequireTransportSecurity() bool { return false }
Copy link

Copilot AI Dec 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The RequireTransportSecurity() returns false, which means JWT tokens can be sent over unencrypted connections. This creates a security vulnerability where tokens could be intercepted in plain text.

The comment mentions "avoid breaking existing insecure transports", but authentication credentials should always require transport security. Consider:

  1. Return true to enforce TLS:
func (c *PerRPCCreds) RequireTransportSecurity() bool { return true }
  1. If backward compatibility is essential, add a configuration option and log a security warning when using insecure transport:
type PerRPCCreds struct {
    token string
    allowInsecure bool
}

func (c *PerRPCCreds) RequireTransportSecurity() bool { 
    if c.allowInsecure {
        logger.Warn("JWT tokens are being sent over insecure transport - this is a security risk")
    }
    return !c.allowInsecure 
}
Suggested change
// For now keep false to avoid breaking existing insecure transports; can be tightened later.
func (c *PerRPCCreds) RequireTransportSecurity() bool { return false }
// Enforce transport security to prevent JWT tokens from being sent over insecure connections.
func (c *PerRPCCreds) RequireTransportSecurity() bool { return true }

Copilot uses AI. Check for mistakes.
// Realm name to display to the user, default value is Dragonfly.
Realm string `yaml:"realm" mapstructure:"realm"`

// Key is secret key used for signing. Please change the key in production
Copy link

Copilot AI Dec 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The JWT signing key is stored in plain text in the configuration file and potentially in environment variables or config management systems. For production environments, sensitive secrets like JWT keys should be protected.

Consider adding documentation or support for:

  1. Loading keys from external secret managers (HashiCorp Vault, AWS Secrets Manager, etc.)
  2. File-based key loading with restricted permissions
  3. Environment variable with a clear security warning in comments

Example documentation in the struct comment:

// Key is secret key used for signing. 
// SECURITY: This key must be kept secret and should be loaded from a secure secret store in production.
// Use a strong random key (e.g., 32+ bytes) and rotate it periodically.
// Example: openssl rand -base64 32
Key string `yaml:"key" mapstructure:"key"`
Suggested change
// Key is secret key used for signing. Please change the key in production
// Key is the secret key used for signing JWT tokens.
// SECURITY: This key must be kept secret and should be loaded from a secure secret store in production (e.g., HashiCorp Vault, AWS Secrets Manager, or a file with restricted permissions).
// Do NOT store this key in plain text in configuration files or environment variables in production.
// Use a strong random key (e.g., 32+ bytes) and rotate it periodically.
// Example: openssl rand -base64 32

Copilot uses AI. Check for mistakes.
Comment on lines +18 to +29
// internal registry for per-component server keys (temporary until config wiring is complete).
var serverKeys = map[string]string{}

// SetServerKey sets the shared signing key for a component's server (e.g., "manager", "scheduler").
func SetServerKey(component, key string) {
serverKeys[component] = key
}

// GetServerKey retrieves the key for a component server.
func GetServerKey(component string) string {
return serverKeys[component]
}
Copy link

Copilot AI Dec 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a global map to store JWT keys is not thread-safe and could lead to race conditions if multiple components try to set keys concurrently. Additionally, this approach doesn't scale well for multi-instance deployments or testing scenarios where components need isolation.

Consider one of these alternatives:

  1. Pass the key directly to the server constructor and store it in the server struct
  2. If a global registry is needed, protect it with a sync.RWMutex:
var (
    serverKeysMu sync.RWMutex
    serverKeys = map[string]string{}
)

func SetServerKey(component, key string) {
    serverKeysMu.Lock()
    defer serverKeysMu.Unlock()
    serverKeys[component] = key
}

func GetServerKey(component string) string {
    serverKeysMu.RLock()
    defer serverKeysMu.RUnlock()
    return serverKeys[component]
}

Copilot uses AI. Check for mistakes.
Comment on lines +13 to +31
// UnaryServerJWTInterceptor returns a unary server interceptor that validates JWT in metadata.
func UnaryServerJWTInterceptor(key string, expectedAudience string) grpc.UnaryServerInterceptor {
return func(ctx context.Context, req any, info *grpc.UnaryServerInfo, handler grpc.UnaryHandler) (any, error) {
if err := validateFromMetadata(ctx, key, expectedAudience); err != nil {
return nil, err
}
return handler(ctx, req)
}
}

// StreamServerJWTInterceptor returns a stream server interceptor that validates JWT in metadata.
func StreamServerJWTInterceptor(key string, expectedAudience string) grpc.StreamServerInterceptor {
return func(srv any, ss grpc.ServerStream, info *grpc.StreamServerInfo, handler grpc.StreamHandler) error {
if err := validateFromMetadata(ss.Context(), key, expectedAudience); err != nil {
return err
}
return handler(srv, ss)
}
}
Copy link

Copilot AI Dec 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The interceptors will enforce JWT authentication on ALL gRPC endpoints, including health checks and reflection services. This will break health check probes (e.g., Kubernetes liveness/readiness probes) and gRPC reflection tools that typically don't provide authentication.

Consider allowing certain methods to bypass authentication:

func validateFromMetadata(ctx context.Context, key string, expectedAudience string, info interface{}) error {
    // Skip auth for health checks and reflection
    if method := extractMethodName(info); isPublicMethod(method) {
        return nil
    }
    // ... existing validation logic
}

func isPublicMethod(method string) bool {
    publicMethods := []string{
        "/grpc.health.v1.Health/",
        "/grpc.reflection.v1alpha.ServerReflection/",
        "/grpc.reflection.v1.ServerReflection/",
    }
    for _, prefix := range publicMethods {
        if strings.HasPrefix(method, prefix) {
            return true
        }
    }
    return false
}

Alternatively, use per-service interceptors instead of global ones to exclude health/reflection services.

Copilot uses AI. Check for mistakes.
Comment on lines +575 to +587
// Auth validation
if cfg.Auth.JWT.Realm == "" {
return errors.New("jwt requires parameter realm")
}
if cfg.Auth.JWT.Key == "" {
return errors.New("jwt requires parameter key")
}
if cfg.Auth.JWT.Timeout == 0 {
return errors.New("jwt requires parameter timeout")
}
if cfg.Auth.JWT.MaxRefresh == 0 {
return errors.New("jwt requires parameter maxRefresh")
}
Copy link

Copilot AI Dec 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new scheduler JWT configuration validation is not covered by tests. The config_test.go file has comprehensive validation tests for other parameters but doesn't include test cases for the new Auth.JWT validation rules.

Add test cases to TestConfig_Validate similar to existing patterns:

{
    name:   "jwt requires parameter key",
    config: New(),
    mock: func(cfg *Config) {
        cfg.Manager = mockManagerConfig
        cfg.Database.Redis = mockRedisConfig
        cfg.Job = mockJobConfig
        cfg.Auth.JWT.Key = ""
    },
    expect: func(t *testing.T, err error) {
        assert := assert.New(t)
        assert.EqualError(err, "jwt requires parameter key")
    },
},
// Add similar tests for realm, timeout, maxRefresh

Copilot uses AI. Check for mistakes.
Copy link
Member

@gaius-qi gaius-qi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix lint.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Token-Based Authentication for gRPC Requests

4 participants