Skip to content

feat: Update default timeout tiers on non-storage resource clients#664

Merged
vdusek merged 2 commits intomasterfrom
feat/update-default-timeout-tiers
Mar 11, 2026
Merged

feat: Update default timeout tiers on non-storage resource clients#664
vdusek merged 2 commits intomasterfrom
feat/update-default-timeout-tiers

Conversation

@vdusek
Copy link
Contributor

@vdusek vdusek commented Mar 10, 2026

Goal

Update default timeout tiers on individual resource client methods.

Tier philosophy

Tier Value Use for
short (5s) Simple metadata CRUD get/update/delete single resources, quick checks
medium (30s) Moderate operations list with pagination, create, batch ops, triggering runs/builds
long (360s) Heavy operations downloading/streaming large data, large payloads
no_timeout Indefinite Polling/waiting (wait_for_finish, call, streaming logs)

Current state

Only storage clients (Dataset, KVS, RQ) were tiered in PR #653. Everything else defaults to 'long' — meaning even a simple actor.get() has a 360s timeout.

Changes description

ActorClient

Method Current Recommended Rationale
get() long short Simple metadata fetch
update() long short Simple metadata update
delete() long short Simple delete
start() long medium Triggers a run, returns run info (some server processing)
call() long no_timeout Starts run AND waits for finish — can take minutes or even much more
build() long medium Triggers a build, returns quickly
default_build() long short Simple metadata fetch
last_run() long short Simple metadata fetch
validate_input() long short Quick validation call

ActorCollectionClient

Method Current Recommended Rationale
list() long medium Paginated list, could be large
create() long medium Creates resource, some processing

ActorVersionClient

Method Current Recommended Rationale
get() long short Simple metadata
update() long short Simple metadata
delete() long short Simple delete

ActorVersionCollectionClient

Method Current Recommended Rationale
list() long short Typically small lists
create() long short Simple create

ActorEnvVarClient

Method Current Recommended Rationale
get() long short Simple metadata
update() long short Simple metadata
delete() long short Simple delete

ActorEnvVarCollectionClient

Method Current Recommended Rationale
list() long short Small lists
create() long short Simple create

BuildClient

Method Current Recommended Rationale
get() long short Simple metadata
delete() long short Simple delete
abort() long short Sends abort signal, returns immediately
get_open_api_definition() long medium Could be a larger JSON payload
wait_for_finish() no_timeout no_timeout Already correct

BuildCollectionClient

Method Current Recommended Rationale
list() long medium Paginated list

RunClient

Method Current Recommended Rationale
get() long short Simple metadata
update() long short Simple metadata
delete() long short Simple delete
abort() long short Sends abort signal
wait_for_finish() no_timeout no_timeout Already correct
metamorph() long medium Triggers transformation
resurrect() long medium Restarts a run
reboot() long medium Reboots a run
get_streamed_log() long long Streaming, keep long
charge() long short Quick API call
get_status_message_watcher() long long Long-lived watcher

RunCollectionClient

Method Current Recommended Rationale
list() long medium Paginated list

TaskClient

Method Current Recommended Rationale
get() long short Simple metadata
update() long short Simple metadata
delete() long short Simple delete
start() long medium Triggers a run
call() long no_timeout Starts and waits for completion
get_input() long short Simple metadata
update_input() long short Simple metadata update
last_run() long short Simple metadata

TaskCollectionClient

Method Current Recommended Rationale
list() long medium Paginated list
create() long medium Creates resource

ScheduleClient

Method Current Recommended Rationale
get() long short Simple metadata
update() long short Simple metadata
delete() long short Simple delete
get_log() long medium Returns list of schedule invocations

ScheduleCollectionClient

Method Current Recommended Rationale
list() long medium Paginated list
create() long short Simple create

WebhookClient

Method Current Recommended Rationale
get() long short Simple metadata
update() long short Simple metadata
delete() long short Simple delete
test() long medium Triggers test dispatch

WebhookCollectionClient

Method Current Recommended Rationale
list() long medium Paginated list
create() long short Simple create

WebhookDispatchClient

Method Current Recommended Rationale
get() long short Simple metadata

WebhookDispatchCollectionClient

Method Current Recommended Rationale
list() long medium Paginated list

StoreCollectionClient

Method Current Recommended Rationale
list() long medium Paginated list

LogClient

Method Current Recommended Rationale
get() long long This is not metadata call. Logs can be large — keep as is
get_as_bytes() long long Same
stream() long long Streaming — long is reasonable

UserClient

Method Current Recommended Rationale
get() long short Simple metadata
monthly_usage() long short Simple metadata
limits() long short Simple metadata
update_limits() long short Simple metadata update

Storage Collection Clients (currently all long)

Client.Method Current Recommended Rationale
DatasetCollectionClient.list() long medium Paginated list
DatasetCollectionClient.get_or_create() long short Quick idempotent op
KeyValueStoreCollectionClient.list() long medium Paginated list
KeyValueStoreCollectionClient.get_or_create() long short Quick idempotent op
RequestQueueCollectionClient.list() long medium Paginated list
RequestQueueCollectionClient.get_or_create() long short Quick idempotent op

Summary of Changes

  • ~60 methods longshort (simple CRUD on non-storage resources)
  • ~20 methods longmedium (list operations, creates, trigger operations)
  • 2 methods longno_timeout (actor.call(), task.call())
  • ~10 methods stay long (log operations, streaming, data-heavy ops)
  • Storage clients (dataset, kvs, request_queue) already well-tiered — only their collection clients need updates

Test plan

  • CI passes

Previously only storage clients (dataset, key_value_store, request_queue)
had per-endpoint timeout tiers. All other resource clients defaulted to
'long' (360s) for every method, even simple metadata CRUD operations.

This assigns appropriate tiers based on operation complexity:
- short (5s): simple get/update/delete on actors, tasks, schedules,
  webhooks, users, env vars, versions, and storage collection get_or_create
- medium (30s): list operations, create operations, triggering runs/builds
  (start, build, metamorph, resurrect, reboot), schedule get_log, webhook test
- no_timeout: actor.call() and task.call() which wait indefinitely for
  run completion

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@vdusek vdusek added adhoc Ad-hoc unplanned task added during the sprint. t-tooling Issues with this label are in the ownership of the tooling team. labels Mar 10, 2026
@vdusek vdusek self-assigned this Mar 10, 2026
@codecov
Copy link

codecov bot commented Mar 10, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 94.90%. Comparing base (723ec6e) to head (c7ec97e).
⚠️ Report is 3 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #664      +/-   ##
==========================================
+ Coverage   94.83%   94.90%   +0.06%     
==========================================
  Files          45       45              
  Lines        4340     4340              
==========================================
+ Hits         4116     4119       +3     
+ Misses        224      221       -3     
Flag Coverage Δ
integration 94.90% <100.00%> (+0.06%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@vdusek vdusek requested review from Pijukatel and janbuchar March 10, 2026 09:46
@vdusek vdusek changed the title feat: update default timeout tiers on non-storage resource clients fix: update default timeout tiers on non-storage resource clients Mar 10, 2026
@vdusek vdusek changed the title fix: update default timeout tiers on non-storage resource clients refactor!: update default timeout tiers on non-storage resource clients Mar 10, 2026
Copy link
Contributor

@Pijukatel Pijukatel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Someone who is well aware of API implementation-specifics should review this. @tobice can you please do it or delegate it to someone?

@vdusek vdusek requested a review from tobice March 10, 2026 12:02
@janbuchar janbuchar removed their request for review March 10, 2026 12:14
@janbuchar
Copy link
Contributor

Seems like a legit change, but I don't think I can provide any useful feedback beyond that of the other reviewers

@vdusek vdusek requested a review from Pijukatel March 10, 2026 12:32
@github-actions github-actions bot added this to the 136th sprint - Tooling team milestone Mar 10, 2026
Copy link
Contributor

@tobice tobice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I skimmed the PR description — the tiering approach makes sense to me. So if you applied it consistently across the endpoints, I see no problem in that (and don't feel like reviewing them one by one — unless you really want me to 😁 ).

If you want the timeouts to be really accurate, here is an idea: You could try pulling actual request duration from Grafana for each endpoint and set the timeout based on that.

Btw just looking at it and the 99th percentile is usually ~250 ms. So even the shortest timeout is pretty comfortable.

@vdusek vdusek changed the title refactor!: update default timeout tiers on non-storage resource clients feat: Update default timeout tiers on non-storage resource clients Mar 11, 2026
@vdusek vdusek merged commit 0b35bbe into master Mar 11, 2026
26 checks passed
@vdusek vdusek deleted the feat/update-default-timeout-tiers branch March 11, 2026 09:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

adhoc Ad-hoc unplanned task added during the sprint. t-tooling Issues with this label are in the ownership of the tooling team.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants