Handle timeouts more gracefully by allowing the application to shutdown#895
Handle timeouts more gracefully by allowing the application to shutdown#895Nyholm wants to merge 9 commits intobrefphp:masterfrom
Conversation
Isn't every connection timeout an IO operation that PHP is waiting?
This seems to me like a strong enough limitation that may indicate it's not worth it to support it? It does bring a nice idea to the table though: what if we offer a Debugging Layer where you can write a basic function invocation and run the code that's getting timeout from it so that you might get a better idea where it's stopping? |
|
Hey, I wonder if something similar could be used in the PHP-FPM worker (the process handling the request)? That could be done as a Symfony/Laravel package? I'm not familiar enough with signals like that to know if that could work. Also another solution is to find a way for logs to be streamed correctly. Or else, find a way to force shorter timeouts on all connections in all frameworks, but I have no idea if that's even remotely possible. |
Yeah.. Each curl request, file load or
Im not sure.. I strongly believe in a future were all PHP applications are using a runtime. See #889.
Probably, yes.
Yes, it is called runtime/bref =)
I dont think that is possible. Or.. there is a difference between application logs and FPM logs. For the FPM logs => I have no idea.
I like the idea, But it does not solve the problem where 1/10000 request to my live application times out and I want to find out what happens. If the "debug layer" is not production ready, the I would not be happy adding it to my production environment. |
|
I reorganized things. I've updated I think this is a good way forward. I would like to get another review before I add tests and docs. |
|
You may also control this behavior with |
|
A PHP-FPM user would have to write something like: if (isset($_SERVER['LAMBDA_TASK_ROOT'])) {
Bref\Timeout\Timeout::enable();
}Preferably as soon as possible, ie |
|
Thanks, following our discussions I've opened Nyholm#3 to remove the option of configuring the duration timeout via an environment variable. |
* Always base the timeout duration on the configured Lambda timeout * Make sure that we can never enable the Bref timeout in the FPM bootstrap process * Remove unnecessary private method `timeoutAfter()` * Allow using falsy values for `BREF_FEATURE_TIMEOUT`, e.g. `0` or `''`, to allow disabling the feature * Cleanup global state * Improve tests following @Nyholm's feedback
|
The support PR is merged. This PR is now ready for a final rounds of review. |
Nyholm
left a comment
There was a problem hiding this comment.
Im 👍
We don't mention that the logs from php-fpm also will be flushed, but that is a technical detail I think it is okey to skip.
| error_log('Original stack trace'); | ||
| error_log(Timeout::$stackTrace); | ||
| Timeout::$stackTrace = null; | ||
| exit(1); |
There was a problem hiding this comment.
Should we force a 500 status code before exit?
That would only be taken into account in the FPM layer, so that wouldn't really impact the php-XX layer.
There was a problem hiding this comment.
How?
And wouldn't that effect my AWS::ApiGateway::GatewayResponse that triggers on timeouts or 500?
There was a problem hiding this comment.
http_response_code(500);And wouldn't that effect my AWS::ApiGateway::GatewayResponse that triggers on timeouts or 500?
🤔 not sure about that, maybe. But in any case, the same would happen if your app returns a pretty 500 error page.
Does that make sense? What I mean is that, when using FPM, there would be no difference between:
- Symfony catches any exception and displays an error page with a 500 status code
- Bref timeout kicks in and forces a 500 status code
In both cases, API Gateway would behave the same way.
There was a problem hiding this comment.
Oh, no that is not true.
Symfony catches any exception and displays an error page with a 500 status code
This will show you a nice error page. API Gateway does nothing
Bref timeout kicks in and forces a 500 status code
No content(HTML) is provided. I would like the custom error page in API Gateway to be used.
Im not sure if http_response_code(500) would make any difference.
Also note. We are not sure it is a HTTP context. It could be any other event.. So Im not sure about sending a HTTP status code.
| error_log('Lambda timed out'); | ||
| error_log((new LambdaTimeout)->getTraceAsString()); | ||
| error_log('Original stack trace'); | ||
| error_log(Timeout::$stackTrace); |
There was a problem hiding this comment.
For the record, I'm testing and notice that error_log adds noise in FPM:
NOTICE: PHP message: Lambda timed out
NOTICE: PHP message: #0 /var/task/demo/http.php(16): Bref\Timeout\Timeout::Bref\Timeout\{closure}()
#1 {main}
NOTICE: PHP message: Original stack trace
NOTICE: PHP message: #0 /var/task/demo/http.php(9): Bref\Timeout\Timeout::Bref\Timeout\{closure}()
#1 {main}
I'm changing that to:
$stderr = fopen('php://stderr', 'w');
fwrite($stderr, 'Lambda timed out' . PHP_EOL);
fwrite($stderr, (new LambdaTimeout)->getTraceAsString() . PHP_EOL);
fwrite($stderr, 'Original stack trace' . PHP_EOL);
fwrite($stderr, Timeout::$stackTrace . PHP_EOL);
fclose($stderr);Which gives a cleaner log:
Lambda timed out
#0 /var/task/demo/http.php(16): Bref\Timeout\Timeout::Bref\Timeout\{closure}()
#1 {main}
Original stack trace
#0 /var/task/demo/http.php(9): Bref\Timeout\Timeout::Bref\Timeout\{closure}()
#1 {main}
No need to update the PR
There was a problem hiding this comment.
Oh, I didnt know.
Thank you for testing.
|
OK following discussion through Slack, we were happy with the PR. However I tested and I'm getting weird results in FPM too:
I've been testing locally with PHP-FPM to pinpoint this, but my guess currently is that pcntl signals may not work well with FPM and its worker pool. Maybe we first set a signal handler on the process, and it's not correctly cleaned up by FPM on the next request cycle? I tried cleaning up manually the handlers, no changes. I tried putting the cleanup in a How to reproduce locally:
E.g. in my case: if (isset($_GET['sleep'])) {
echo 'Sleeping';
try {
sleep(6);
} catch (Throwable $e) {
echo 'CAUGHT!';
}
echo 'Sleeping';
sleep(2);
}
|
When Lambda times out with the PHP-FPM layer, the logs written by the PHP script are never flushed to stderr by PHP-FPM. That means they never reach CloudWatch, which makes timeouts really hard to debug. With this change, Bref waits for the FPM response until 1 second before the actual Lambda timeout (via a connection timeout on the FastCGI connection). If Bref reaches that point, it will ask PHP-FPM to gracefully restart the PHP-FPM worker, which: - flushes the logs (logs end up in CloudWatch, which is great) - restarts a clean FPM worker, without doing a full FPM restart (which may take longer) Follow up of #770, #772, #895 May address some of #862 Note: this does not change anything for the Function layer (only affects FPM). Also this does not show a full stack track of the place in the code where the timeout happens (#895 did). Still it's an improvement over the current status.
When Lambda times out with the PHP-FPM layer, the logs written by the PHP script are never flushed to stderr by PHP-FPM. That means they never reach CloudWatch, which makes timeouts really hard to debug. With this change, Bref waits for the FPM response until 1 second before the actual Lambda timeout (via a connection timeout on the FastCGI connection). If Bref reaches that point, it will ask PHP-FPM to gracefully restart the PHP-FPM worker, which: - flushes the logs (logs end up in CloudWatch, which is great) - restarts a clean FPM worker, without doing a full FPM restart (which may take longer) Follow up of #770, #772, #895 May address some of #862 Note: this does not change anything for the Function layer (only affects FPM). Also this does not show a full stack track of the place in the code where the timeout happens (#895 did). Still it's an improvement over the current status.
|
Since this wasn't working with FPM I went with a different route: #1133 This PR might still be worth following for the Function layer, so I'll leave it open. |
When Lambda times out with the PHP-FPM layer, the logs written by the PHP script are never flushed to stderr by PHP-FPM. That means they never reach CloudWatch, which makes timeouts really hard to debug. With this change, Bref waits for the FPM response until 1 second before the actual Lambda timeout (via a connection timeout on the FastCGI connection). If Bref reaches that point, it will ask PHP-FPM to gracefully restart the PHP-FPM worker, which: - flushes the logs (logs end up in CloudWatch, which is great) - restarts a clean FPM worker, without doing a full FPM restart (which may take longer) Follow up of #770, #772, #895 May address some of #862 Note: this does not change anything for the Function layer (only affects FPM). Also this does not show a full stack track of the place in the code where the timeout happens (#895 did). Still it's an improvement over the current status.
When Lambda times out with the PHP-FPM layer, the logs written by the PHP script are never flushed to stderr by PHP-FPM. That means they never reach CloudWatch, which makes timeouts really hard to debug. With this change, Bref waits for the FPM response until 1 second before the actual Lambda timeout (via a connection timeout on the FastCGI connection). If Bref reaches that point, it will ask PHP-FPM to gracefully restart the PHP-FPM worker, which: - flushes the logs (logs end up in CloudWatch, which is great) - restarts a clean FPM worker, without doing a full FPM restart (which may take longer) Follow up of #770, #772, #895 May address some of #862 Note: this does not change anything for the Function layer (only affects FPM). Also this does not show a full stack track of the place in the code where the timeout happens (#895 did). Still it's an improvement over the current status.
This PR builds on the work started by @t-geindre in #55.
If we are 1 second away from Lambda shutting down, then thrown an exception to let the application shutdown and (more importantly) flush the logs.
There are two things Im not super happy with:
This does not work for PHP-FPM since it the application run in a child processThe question is: do we want to continue down this path?
It would solve issues like #873 and #862.
TODO