Skip to content

Commit 80e2e27

Browse files
committed
Restart FPM completely in case of timeouts
Follow-up of #1133 This is because sending SIGUSR2 to FPM (the previously implemented solution) did not really stop with 100% certainty the PHP script that timed out. Indeed, it merely interrupted the currently blocked call (e.g. a sleep, a DB call, etc.), flushed the logs and carried on. My guess is that this could have caused the PHP script to continue to run in some cases, possibly running into yet another timeout on a next line (e.g. another DB call). This PR fixes the timeout test that wasn't really working (🤦) and restarts FPM completely in case of timeout. That is confirmed to completely stop the execution of the timed out script + flush the logs to stderr.
1 parent 7c124bd commit 80e2e27

3 files changed

Lines changed: 29 additions & 12 deletions

File tree

src/Event/Http/FpmHandler.php

Lines changed: 23 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -93,7 +93,11 @@ public function start(): void
9393
public function stop(): void
9494
{
9595
if ($this->fpm && $this->fpm->isRunning()) {
96-
$this->fpm->stop(2);
96+
// Give it less than a second to stop (500ms should be plenty enough time)
97+
// this is for the case where the script timed out: we reserve 1 second before the end
98+
// of the Lambda timeout, so we must kill everything and restart FPM in 1 second.
99+
// Note: Symfony will first try sending SIGTERM (15) and then SIGKILL (9)
100+
$this->fpm->stop(0.5);
97101
if ($this->isReady()) {
98102
throw new Exception('PHP-FPM cannot be stopped');
99103
}
@@ -122,12 +126,23 @@ public function handleRequest(HttpRequestEvent $event, Context $context): HttpRe
122126

123127
$response = $this->client->readResponse($socketId, $timeoutDelayInMs);
124128
} catch (TimedoutException $e) {
125-
// Send a SIGUSR2 signal to PHP-FPM
126-
// That causes FPM to reload all workers, which allows us to cleanly stop the FPM worker that is stuck in a timeout/waiting state.
127-
// A (intentional) side-effect is that it causes all worker logs buffered by FPM to be written to stderr.
128-
// Without that, all logs written by the PHP script are never written to stderr (and thus never logged to CloudWatch).
129-
// This takes a bit of time (a few ms), but it's still faster than rebooting FPM entirely.
130-
posix_kill($this->fpm->getPid(), SIGUSR2);
129+
echo "The PHP script timed out. Bref will now restart PHP-FPM to start from a clean slate and flush the PHP logs.\nTimeouts can happen for example when trying to connect to a remote API or database, if this happens continuously check for those.\nIf you are using a RDS database, read this: https://bref.sh/docs/environment/database.html#accessing-the-internet\n";
130+
131+
/**
132+
* Restart FPM so that the blocked script is 100% terminated and that its logs are flushed to stderr.
133+
*
134+
* - "why restart FPM?": if we don't, the previous request continues to execute on the next request
135+
* - "why not send a SIGUSR2 signal to FPM?": that was a promising approach because SIGUSR2
136+
* causes FPM to cleanly stop the FPM worker that is stuck in a timeout/waiting state.
137+
* It also causes all worker logs buffered by FPM to be written to stderr (great!).
138+
* This takes a bit of time (a few ms), but it's faster than rebooting FPM entirely.
139+
* However, the downside is that it doesn't "kill" the previous request execution:
140+
* it merely stops the execution of the line of code that is waiting (e.g. "sleep()",
141+
* "file_get_contents()", ...) and continues to the next line. That's super weird!
142+
* So SIGUSR2 isn't a great solution in the end.
143+
*/
144+
$this->stop();
145+
$this->start();
131146

132147
// Throw an exception so that:
133148
// - this is reported as a Lambda execution error ("error rate" metrics are accurate)
@@ -136,7 +151,7 @@ public function handleRequest(HttpRequestEvent $event, Context $context): HttpRe
136151
throw new Timeout($timeoutDelayInMs);
137152
} catch (Throwable $e) {
138153
printf(
139-
"Error communicating with PHP-FPM to read the HTTP response. A root cause of this can be that the Lambda (or PHP) timed out, for example when trying to connect to a remote API or database, if this happens continuously check for those! Bref will restart PHP-FPM now. Original exception message: %s %s\n",
154+
"Error communicating with PHP-FPM to read the HTTP response. Bref will restart PHP-FPM now. Original exception message: %s %s\n",
140155
get_class($e),
141156
$e->getMessage()
142157
);

tests/Handler/PhpFpm/php-fpm.conf

Lines changed: 0 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,3 @@ catch_workers_output = yes
1818
; Limit the number of core dump logs to 1 to avoid filling up the /tmp disk
1919
; See https://github.com/brefphp/bref/issues/275
2020
rlimit_core = 1
21-
22-
; Very short timeout to be able to test timeouts without having a very long test suite
23-
request_terminate_timeout = 1

tests/Handler/PhpFpm/timeout.php

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,4 +2,9 @@
22

33
error_log('This is a log message');
44

5-
sleep((int) ($_GET['timeout'] ?? 10));
5+
$timeout = (int) ($_GET['timeout'] ?? 10);
6+
$result = sleep($timeout);
7+
8+
if ($result && $timeout > 0) {
9+
throw new Exception('The execution continued after sleep was interrupted!');
10+
}

0 commit comments

Comments
 (0)