feat(throttle transform): Multi-threshold rate limiting with dropped output port#24702
Open
szibis wants to merge 13 commits intovectordotdev:masterfrom
Open
feat(throttle transform): Multi-threshold rate limiting with dropped output port#24702szibis wants to merge 13 commits intovectordotdev:masterfrom
szibis wants to merge 13 commits intovectordotdev:masterfrom
Conversation
cadda68 to
7fde598
Compare
…output port Add multi-dimensional rate limiting to the throttle transform with independent thresholds for event count, estimated JSON byte size, and custom VRL token expressions. Events are dropped when any configured threshold is exceeded. New capabilities: - `threshold.events` — maximum events per window (backward compat with `threshold: N`) - `threshold.json_bytes` — estimated JSON byte size via EstimatedJsonEncodedSizeOf - `threshold.tokens` — VRL expression for custom cost (e.g. `strlen(string!(.message))`) - `reroute_dropped` — routes throttled events to a named `.dropped` output port - Per-key per-threshold observability metrics (opt-in via `emit_detailed_metrics`) The legacy `threshold: <number>` syntax remains fully backward compatible. Closes vectordotdev#11854
7fde598 to
10eeee3
Compare
|
hey @szibis thank you for the PR! i've made an editorial review card for a docs team member to take a look: https://datadoghq.atlassian.net/browse/DOCS-13474 |
urseberry
approved these changes
Feb 20, 2026
Contributor
urseberry
left a comment
There was a problem hiding this comment.
Left non-blocking suggestions to replace "e.g." with "for example" per the Datadog public documentation guidelines.
Co-authored-by: Ursula Chen <58821586+urseberry@users.noreply.github.com>
Co-authored-by: Ursula Chen <58821586+urseberry@users.noreply.github.com>
Co-authored-by: Ursula Chen <58821586+urseberry@users.noreply.github.com>
Co-authored-by: Ursula Chen <58821586+urseberry@users.noreply.github.com>
Co-authored-by: Ursula Chen <58821586+urseberry@users.noreply.github.com>
fb8400a to
8f67480
Compare
Add overflow guards for VRL token cost (i64 → u32) and json_bytes (usize → u32) conversions. Values exceeding u32::MAX are now clamped instead of silently truncating.
When check_thresholds short-circuits (e.g., events limiter denies), subsequent limiters never consume tokens from the governor. Update utilization tracking to only count consumption for limiters that were actually checked, preventing drift between reported utilization and actual governor bucket state. Also clarify that tokens_threshold intentionally uses json_bytes as its budget, with a comment explaining the coupling.
- Warn when event cost exceeds governor burst capacity (check_key_n) - Defer VRL evaluate_tokens until after events limiter passes, avoiding expensive event.clone() on already-rejected events - Sample gauge emissions every 100 events instead of per-event (gauges overwrite so less frequent emission is equivalent) - Bound utilization HashMap to 10K keys to prevent unbounded memory growth from high-cardinality key fields - Reduce String allocations: avoid HashMap key clone when entry exists, allocate key_str once in process() for all metric emissions - Inline threshold checking into process() to enable early exits
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Add multi-dimensional rate limiting to the
throttletransform with independent thresholds for event count, estimated JSON byte size, and custom VRL token expressions. Events are dropped when any configured threshold is exceeded.Target release: 0.54.0
threshold.events— maximum events per window (backward compat withthreshold: N)threshold.json_bytes— estimated JSON byte size viaEstimatedJsonEncodedSizeOf(zero serialization overhead)threshold.tokens— VRL expression evaluated per event for custom cost (e.g.strlen(string!(.message)))reroute_dropped— routes throttled events to a named.droppedoutput port for dead-letter routinginternal_metrics.emit_detailed_metricswith bounded-cardinality defaultsMotivation
The current
throttletransform only rate limits by event count. This falls short in real-world scenarios:Architecture
%%{init: {'theme': 'base', 'themeVariables': { 'lineColor': '#000000', 'primaryTextColor': '#000000'}}}%% flowchart LR subgraph INPUT E[Event] end subgraph THROTTLE["Throttle Transform"] direction TB EX{Exclude?} EL[Events Limiter<br/>GCRA] BL[Bytes Limiter<br/>GCRA] TL[Tokens Limiter<br/>VRL + GCRA] CHK{Any exceeded?} end subgraph OUTPUT P[Primary Output] D[Dropped Output<br/>reroute_dropped] end E --> EX EX -->|excluded| P EX -->|check| EL EL --> BL BL --> TL TL --> CHK CHK -->|all pass| P CHK -->|any exceeded| D style EX fill:#ffffff,stroke:#000000,stroke-width:2px style EL fill:#e8e8e8,stroke:#000000,stroke-width:2px style BL fill:#e8e8e8,stroke:#000000,stroke-width:2px style TL fill:#cccccc,stroke:#000000,stroke-width:3px style CHK fill:#ffffff,stroke:#000000,stroke-width:2px style P fill:#ffffff,stroke:#000000,stroke-width:2px style D fill:#dddddd,stroke:#000000,stroke-width:2pxFull backward compatibility
The
thresholdfield uses#[serde(untagged)]enum deserialization to accept both the old integer syntax and the new object syntax:This means:
threshold: 100threshold: { events: 100 }component_discarded_events_totalthrottle_threshold_discarded_total(bounded, 1 series for events-only)emit_events_discarded_per_keyjson_bytes,tokens,reroute_dropped,emit_detailed_metricsZero migration needed. Every existing
throttleconfig continues to work without changes. The only observable difference is:throttle_threshold_discarded_total{threshold_type="events"}) which is bounded to 1 series for existing configsAll new features are purely additive and default to disabled.
Combined event rate + byte throughput limiting
A single
thresholdblock can enforce both an event rate cap and a byte throughput cap simultaneously. Each type runs its own independent GCRA limiter. An event is dropped the moment any limiter is exceeded:This covers two distinct failure modes in a single transform:
Neither threshold alone is sufficient. A service could bypass a byte-only limit by sending millions of tiny events, or bypass an event-only limit by sending few massive events. With both enforced, both attack vectors are covered.
A third dimension — a custom VRL token cost — can be added in the same definition:
All three in one definition, one transform, one
key_field— three independent limiters checked per event. Performance overhead for events+bytes combined is +71% vs events-only baseline, still processing ~2.80M events/sec.Configuration examples
Old syntax (still works)
Multi-threshold with per-tenant keys
Dropped output port (dead-letter routing)
Key design decisions
SyncTransform (not TaskTransform)
The original throttle uses
TaskTransform(async Stream). This PR rewrites toSyncTransformbecause:TransformOutputsBuf(required forreroute_dropped)remapwith dropped port)DynClonerequirement solved viaThrottleSyncTransformwrapper with lazy state initializationSeparate GCRA limiter per threshold type
Each threshold type gets its own independent governor
RateLimiter. An event is dropped when any limiter is exceeded. Governor'scheck_key_n()consumes N tokens atomically, so byte-cost and token-cost events interact correctly with the GCRA algorithm.EstimatedJsonEncodedSizeOf reuse
For
json_bytes, we reuse Vector's existingEstimatedJsonEncodedSizeOftrait (already implemented forEvent,LogEvent, allValuetypes with quickcheck tests). Zero allocation, zero serialization — just arithmetic over the in-memory value tree.Deferred key string allocation
key_stris only materialized (to_owned()) when a metric actually needs to be emitted. On the happy path (events pass through, no metrics enabled), noStringallocation occurs per event. This produced a measurable 5-7% throughput improvement.Metrics: three-tier cardinality control
Tier 1: Always emitted (bounded cardinality — max 4 series total)
These are emitted regardless of configuration. Safe for any deployment.
component_discarded_events_totalcomponent_id,intentional=truethrottle_threshold_discarded_totalthreshold_type(events|json_bytes|tokens)Tier 2: Legacy opt-in (
emit_events_discarded_per_key: true)Backward compatible with existing behavior. Cardinality = O(unique keys).
events_discarded_totalkeyTier 3: New detailed metrics (
emit_detailed_metrics: true)Full per-key per-threshold observability. Cardinality = O(keys × threshold_types).
throttle_events_discarded_totalkey,threshold_typethrottle_events_processed_totalkeythrottle_bytes_processed_totalkeythrottle_tokens_processed_totalkeythrottle_utilization_ratiokey,threshold_typeMetrics impact by configuration
false(default)component_discarded_events_total+throttle_threshold_discarded_total{threshold_type}emit_events_discarded_per_key: trueonlyevents_discarded_total{key}emit_detailed_metrics: trueonlytrueBoth flags default to
false, so out-of-the-box the transform emits only 4 bounded-cardinality metric series with zero overhead, regardless of how many unique keys exist.Performance impact
All benchmarks: Criterion, 200 samples, 30s measurement, 5s warmup, 100K resamples, 1024 events/iteration.
A. Throughput by threshold type (no metrics)
events_only/under_limitthreshold: Nconfigs take this path. 25% faster than initial SyncTransform (hot-path optimizations: inlined threshold checks, deferred VRL eval, sampled gauge emission).json_bytes_onlyEstimatedJsonEncodedSizeOfper event (~109ns/event). No allocation, just arithmetic over in-memory value tree.events_and_bytesvrl_tokensRuntime::resolve()dominates (~328ns/event). Expected for interpreted eval.all_three_thresholdsevents_only/over_limitcomponent_discarded_events_total(mandatory) + debug log.with_dropped_port.droppedoutput.high_cardinality_keys(100)key_field.B. Metrics overhead (100 keys, events-only threshold)
metrics_both_offmetrics_legacy_onlyevents_discarded_total{key}. Negligible.metrics_detailed_onlymetrics_both_onmetrics_detailed_high_cardinality(10K keys)metrics_detailed_all_thresholdsC. Key cardinality scaling (no metrics)
events_onlyevents+bytesall_threeScaling is sublinear — 100× more keys only causes 1.25-1.45× slowdown (DashMap O(1) amortized lookup).
D. Memory footprint per key
Even 10K tenants × 3 thresholds + detailed metrics uses under 5 MB — negligible vs Vector's baseline RSS (50-100 MB).
Impact assessment
What existing users get (zero config changes needed)
threshold: 100component_discarded_events_totalthrottle_threshold_discarded_total(bounded, max 3 series)What new users can opt into
threshold.json_bytesthreshold.tokensreroute_droppedemit_detailed_metricsAll new features are additive and opt-in. No existing behavior changes.
Real-world usage scenarios
Each feature below is opt-in and independent. Pick only what you need — overhead is only paid for features you enable.
threshold.json_bytes— byte-aware rate limiting (+52% overhead)Problem: Event-count throttling treats a 50-byte healthcheck and a 100 KB stack trace identically. Downstream services don't.
Scenario 1: Loki per-stream byte rate limits
Loki enforces a default
per_stream_rate_limitof 3 MB/s. When a service emits a burst of large log events, event-count throttling won't prevent 429 rejections because 100 events of 100 KB each = 10 MB, far exceeding the 3 MB limit — even if you setthreshold: 100.This catches the burst before it reaches Loki, avoiding 429 cascades and the retry storms that follow.
Why +52% is worth it: The overhead comes from
EstimatedJsonEncodedSizeOf— a fast recursive walk over the in-memory event value tree (~109ns/event, no serialization, no allocation). The percentage is relative to the highly-optimized events-only baseline (214µs for 1024 events); in absolute terms, json_bytes still processes 3.14M events/sec. For any pipeline where downstream charges by bytes or enforces byte limits, this prevents far more expensive outcomes: 429 retry storms, Loki stream lockouts, or unexpected cloud billing spikes.Scenario 2: Edge/IoT bandwidth-aware throttling
On edge devices with limited uplink (e.g., 1 Mbps satellite link), you need to throttle by actual payload size, not event count. A heartbeat event (50 bytes) and a firmware diagnostic dump (500 KB) should not be treated equally:
reroute_dropped— dead-letter routing (+3% on drop path only)Problem: Today, throttled events are silently discarded. You have no way to replay them, audit what was lost, or route them to cheaper storage.
When
reroute_dropped: trueis set, throttled events are sent to a named.droppedoutput port instead of being discarded. The +3% overhead only applies to events that are actually being dropped (the happy path — events passing through — has zero overhead from this flag).Scenario 1: Dead-letter queue for replay
Route throttled events to a file or S3 sink for later replay during off-peak hours:
During off-peak windows, replay from S3 back through Vector to recover dropped data — zero data loss, guaranteed byte-rate compliance.
Scenario 2: Audit trail for compliance
In regulated environments, you may need to prove what was dropped and why. Route dropped events to a local file with metadata:
Every dropped event is preserved verbatim (byte-identical to the input — the throttle transform never modifies events). Compliance teams can verify exactly what was rate-limited.
Scenario 3: Overflow to cheaper storage tier
Route excess traffic to a cheaper destination instead of dropping entirely:
Why +3% is negligible: The overhead only fires on the drop path, and it's just routing an event to a second output buffer. Compared to the value of not losing data, this is effectively free.
emit_detailed_metrics— per-tenant observability (+75% overhead)Problem: Without per-key metrics, you know that something is being throttled but not which tenant, which threshold, or how close other tenants are to their limits.
When
emit_detailed_metrics: trueis set, the transform emits:throttle_events_discarded_total{key, threshold_type}— which tenant hit which limitthrottle_events_processed_total{key}— total events per tenant (passed + dropped)throttle_bytes_processed_total{key}— total byte volume per tenantthrottle_tokens_processed_total{key}— total custom token cost per tenantthrottle_utilization_ratio{key, threshold_type}— current usage / threshold (0.0-1.0+)Scenario 1: Per-tenant dashboard for multi-tenant SaaS
You run a multi-tenant platform where each service has a logging quota. With detailed metrics piped to Prometheus + Grafana:
Now you can build Grafana panels showing per-service:
throttle_events_processed_total / threshold)throttle_bytes_processed_total)throttle_events_discarded_total)throttle_utilization_ratio)Scenario 2: Proactive alerting at 80% utilization
The
throttle_utilization_ratiogauge lets you alert before throttling kicks in:Operators get advance warning to contact tenants, adjust quotas, or investigate runaway services — instead of finding out after data is already being dropped.
Scenario 3: Cost attribution and chargeback
For platforms that bill tenants for logging usage,
throttle_bytes_processed_total{key}provides per-tenant byte volume that maps directly to cloud logging cost:Why +75% can be worth it: The overhead comes from updating 3-6 metric counters per event (each with a
keytag requiring hash lookups in the metrics registry). This is significant, but:log_to_metric+aggregatechain that costs more than 75%key_field— if you have 50 services, that's ~300 metric series (50 × 6 metrics). If you have 10K keys, consider whether you actually need per-key visibility for all of them, or if you can use a higher-level groupingWhen NOT to enable
emit_detailed_metricskey_fieldproduces <500 unique valueskey_fieldproduces 500-10K valueskey_fieldproduces >10K values or is unbounded (e.g., user IDs)throttle_threshold_discarded_total{threshold_type}for aggregate visibility insteadkey_fieldconfiguredCombining features: the cost is additive, not multiplicative
Each feature adds its overhead independently. Here's the combined cost for common configurations:
threshold: 100(existing)json_bytesonlyjson_bytes+reroute_droppedjson_bytes+reroute_dropped+emit_detailed_metricsevents+json_bytes+tokens(all thresholds, no metrics)The typical production config —
json_byteswithreroute_dropped— adds ~52-55% relative overhead vs the optimized events-only baseline, while still processing 2.80M+ events/sec and solving real problems that event-count throttling cannot address.How did you test this PR?
Unit Tests (22 tests)
cargo test -p vector --lib --features transforms-throttle -- transforms::throttleTests cover: backward compat, all threshold types, dropped port routing, key independence, exclude condition bypass, VRL expression errors (defaults to cost 1), data integrity (events unmodified through throttle), completeness (no events lost/duplicated), metrics emission with correct tags, utilization tracking across windows, key cardinality scaling (10/100/1K keys), memory footprint measurement.
Integration Tests (7 tests)
cargo test --test integration --features throttle-integration-tests -- throttleReal
vectorbinary viaassert_cmd: config validation, stdin→stdout event flow, dropped port routing to separate output, multi-threshold with key_field, backward compat simple threshold, exclude bypasses limit, data integrity verification.Benchmarks (23 benchmarks)
Three groups: throughput (8), metrics overhead (6), key cardinality scaling (9). Criterion with 200 samples, 30s measurement, statistical significance testing.
E2E Tests
cargo vdev e2e test throttle-transformDocker Compose with 3 config variants (events-only, bytes, multi-threshold).
Static Analysis
cargo clippy -p vector --features transforms-throttle -- -D warnings # CleanChange Type
Is this a breaking change?
The legacy
threshold: <number>syntax is fully preserved. The only observable change for existing configs is:throttle_threshold_discarded_total)Does this PR include user facing changes?
changelog.d/11854_throttle_multi_threshold.feature.md.References
Closes #11854