Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Background
In
send_notification_batch, we start withself.connect()to self recover from potential connection related errors:PyAPNs2/apns2/client.py
Lines 187 to 189 in 5e4a938
However, it is not enough, because the self recovery is done through:
APNsClient.connect()->APNsClient._connection.connect()-> https://github.com/python-hyper/hyper/blob/b77e758f472f00b098481e3aa8651b0808524d84/hyper/http20/connection.py#L331-L347, which is basically:In the cases where
self._sockis in a weird state, this call sequence does not help with resetting the connection states, asself._sockdoes still hold a value despite no longer being able to actually doing meaningful work. As a result, after callingAPNsClient.connect(), we still end up using a broken socket. And that scenario is not recoverable.What we have observed in production was that the
send_notification_batch()call just retries 3 times then raisesConnectionFailed💀 .Approach
This PR is an attempt to fix that, by forcing a full connection reset from the second retry onwards. We are currently running with a hack in production, by basically doing:
And this did work for us. The unrecoverable
ConnectionFailednever happened to us after this was deployed.Some thoughts
I still left the first try to be a "soft" connection reboot so that some other scenario might recover/reset faster, and also that keeps the first retry the same as the current code. But for sure we can change that to be a full reconnection too if you like 😃