feat: Update default timeout tiers on non-storage resource clients#664
feat: Update default timeout tiers on non-storage resource clients#664
Conversation
Previously only storage clients (dataset, key_value_store, request_queue) had per-endpoint timeout tiers. All other resource clients defaulted to 'long' (360s) for every method, even simple metadata CRUD operations. This assigns appropriate tiers based on operation complexity: - short (5s): simple get/update/delete on actors, tasks, schedules, webhooks, users, env vars, versions, and storage collection get_or_create - medium (30s): list operations, create operations, triggering runs/builds (start, build, metamorph, resurrect, reboot), schedule get_log, webhook test - no_timeout: actor.call() and task.call() which wait indefinitely for run completion Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #664 +/- ##
==========================================
+ Coverage 94.83% 94.90% +0.06%
==========================================
Files 45 45
Lines 4340 4340
==========================================
+ Hits 4116 4119 +3
+ Misses 224 221 -3
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Seems like a legit change, but I don't think I can provide any useful feedback beyond that of the other reviewers |
tobice
left a comment
There was a problem hiding this comment.
I skimmed the PR description — the tiering approach makes sense to me. So if you applied it consistently across the endpoints, I see no problem in that (and don't feel like reviewing them one by one — unless you really want me to 😁 ).
If you want the timeouts to be really accurate, here is an idea: You could try pulling actual request duration from Grafana for each endpoint and set the timeout based on that.
Btw just looking at it and the 99th percentile is usually ~250 ms. So even the shortest timeout is pretty comfortable.
Goal
Update default timeout tiers on individual resource client methods.
Tier philosophy
short(5s)medium(30s)long(360s)no_timeoutwait_for_finish,call, streaming logs)Current state
Only storage clients (Dataset, KVS, RQ) were tiered in PR #653. Everything else defaults to
'long'— meaning even a simpleactor.get()has a 360s timeout.Changes description
ActorClient
get()update()delete()start()call()build()default_build()last_run()validate_input()ActorCollectionClient
list()create()ActorVersionClient
get()update()delete()ActorVersionCollectionClient
list()create()ActorEnvVarClient
get()update()delete()ActorEnvVarCollectionClient
list()create()BuildClient
get()delete()abort()get_open_api_definition()wait_for_finish()BuildCollectionClient
list()RunClient
get()update()delete()abort()wait_for_finish()metamorph()resurrect()reboot()get_streamed_log()charge()get_status_message_watcher()RunCollectionClient
list()TaskClient
get()update()delete()start()call()get_input()update_input()last_run()TaskCollectionClient
list()create()ScheduleClient
get()update()delete()get_log()ScheduleCollectionClient
list()create()WebhookClient
get()update()delete()test()WebhookCollectionClient
list()create()WebhookDispatchClient
get()WebhookDispatchCollectionClient
list()StoreCollectionClient
list()LogClient
get()get_as_bytes()stream()UserClient
get()monthly_usage()limits()update_limits()Storage Collection Clients (currently all
long)DatasetCollectionClient.list()DatasetCollectionClient.get_or_create()KeyValueStoreCollectionClient.list()KeyValueStoreCollectionClient.get_or_create()RequestQueueCollectionClient.list()RequestQueueCollectionClient.get_or_create()Summary of Changes
long→short(simple CRUD on non-storage resources)long→medium(list operations, creates, trigger operations)long→no_timeout(actor.call(),task.call())long(log operations, streaming, data-heavy ops)Test plan