Fix: CloudWatch/Watchtower logs dropped in Task SDK due to handler lifetime bugs#66633
Open
korex-f wants to merge 2 commits intoapache:mainfrom
Open
Fix: CloudWatch/Watchtower logs dropped in Task SDK due to handler lifetime bugs#66633korex-f wants to merge 2 commits intoapache:mainfrom
korex-f wants to merge 2 commits intoapache:mainfrom
Conversation
…etime bugs Three related bugs all produce WatchtowerWarning: "Received message after logging system shutdown", causing CloudWatch log streams to be truncated or never created. Bug 1 (task-sdk/src/airflow/sdk/log.py): In configure_logging(), remote.processors was accessed before the inner configure_logging() call ran dictConfig(). dictConfig() calls _clearExistingHandlers() -> logging.shutdown() on all existing handlers, killing the just-created watchtower handler (shutting_down=True) before any task log is emitted. Fix: move getattr(remote, "processors") to after dictConfig runs, injecting via a second structlog.configure() call. Bug 2 (providers/amazon/.../cloudwatch_task_handler.py): CloudWatchRemoteLogIO.handler was a cached_property — once killed by logging.shutdown() the dead instance was never replaced. The processors cached_property also captured _handler in a closure, pinning the dead instance. Fix: convert handler to a regular property that recreates when shutting_down=True; access self.handler dynamically in the processors closure. Bug 3 (task-sdk/src/airflow/sdk/execution_time/supervisor.py): _configure_logging() returned a plain tuple with no cleanup path. The only code that ever closed the remote handler was logging.shutdown() at process exit, which fired after supervise_task() returned while final task messages were still in flight. Fix: convert _configure_logging() to a contextmanager that explicitly flushes and closes the handler after process.wait() drains all task log messages. Fixes: apache#66475
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Three related bugs all produce
WatchtowerWarning: "Received message after logging system shutdown", causing CloudWatch log streams to be truncated or never created.Bug 1 (
task-sdk/src/airflow/sdk/log.py):In
configure_logging(),remote.processorswas accessed before the innerconfigure_logging()call randictConfig().dictConfig()calls_clearExistingHandlers()→logging.shutdown()on all existing handlers, killing the just-created watchtower handler (shutting_down=True) before any task log is emitted. Fix: movegetattr(remote, "processors")to afterdictConfigruns, injecting via a secondstructlog.configure()call.Bug 2 (
providers/amazon/.../cloudwatch_task_handler.py):CloudWatchRemoteLogIO.handlerwas acached_property— once killed bylogging.shutdown()the dead instance was never replaced. Theprocessorscached_propertyalso captured_handlerin a closure, pinning the dead instance. Fix: converthandlerto a regular property that recreates whenshutting_down=True; accessself.handlerdynamically in theprocessorsclosure.Bug 3 (
task-sdk/src/airflow/sdk/execution_time/supervisor.py):_configure_logging()returned a plain tuple with no cleanup path. The only code that ever closed the remote handler waslogging.shutdown()at process exit, which fired aftersupervise_task()returned while final task messages were still in flight. Fix: convert_configure_logging()to acontextmanagerthat explicitly flushes and closes the handler afterprocess.wait()drains all task log messages.closes: #66475