Skip to content

feat(throttle transform): Multi-threshold rate limiting with dropped output port#24702

Open
szibis wants to merge 13 commits intovectordotdev:masterfrom
szibis:feat/throttle-multi-threshold
Open

feat(throttle transform): Multi-threshold rate limiting with dropped output port#24702
szibis wants to merge 13 commits intovectordotdev:masterfrom
szibis:feat/throttle-multi-threshold

Conversation

@szibis
Copy link
Contributor

@szibis szibis commented Feb 20, 2026

Summary

Add multi-dimensional rate limiting to the throttle transform with independent thresholds for event count, estimated JSON byte size, and custom VRL token expressions. Events are dropped when any configured threshold is exceeded.

Target release: 0.54.0

  • threshold.events — maximum events per window (backward compat with threshold: N)
  • threshold.json_bytes — estimated JSON byte size via EstimatedJsonEncodedSizeOf (zero serialization overhead)
  • threshold.tokens — VRL expression evaluated per event for custom cost (e.g. strlen(string!(.message)))
  • reroute_dropped — routes throttled events to a named .dropped output port for dead-letter routing
  • Per-key per-threshold observability metrics — opt-in via internal_metrics.emit_detailed_metrics with bounded-cardinality defaults

Motivation

The current throttle transform only rate limits by event count. This falls short in real-world scenarios:

  • Loki sink users hit per-stream 3 MB byte rate limits, causing 429 cascades that event-count throttling cannot prevent
  • Cloud logging services (Datadog, CloudWatch, BigQuery) charge by ingested bytes, not events — a 100-byte log and a 100 KB log cost very differently
  • Edge/IoT deployments need bandwidth-aware throttling where network capacity is the constraint, not event rate
  • Multi-tenant platforms need per-service throttle visibility to understand which tenants consume quota and why

Architecture

%%{init: {'theme': 'base', 'themeVariables': { 'lineColor': '#000000', 'primaryTextColor': '#000000'}}}%%
flowchart LR
    subgraph INPUT
        E[Event]
    end

    subgraph THROTTLE["Throttle Transform"]
        direction TB
        EX{Exclude?}
        EL[Events Limiter<br/>GCRA]
        BL[Bytes Limiter<br/>GCRA]
        TL[Tokens Limiter<br/>VRL + GCRA]
        CHK{Any exceeded?}
    end

    subgraph OUTPUT
        P[Primary Output]
        D[Dropped Output<br/>reroute_dropped]
    end

    E --> EX
    EX -->|excluded| P
    EX -->|check| EL
    EL --> BL
    BL --> TL
    TL --> CHK
    CHK -->|all pass| P
    CHK -->|any exceeded| D

    style EX fill:#ffffff,stroke:#000000,stroke-width:2px
    style EL fill:#e8e8e8,stroke:#000000,stroke-width:2px
    style BL fill:#e8e8e8,stroke:#000000,stroke-width:2px
    style TL fill:#cccccc,stroke:#000000,stroke-width:3px
    style CHK fill:#ffffff,stroke:#000000,stroke-width:2px
    style P fill:#ffffff,stroke:#000000,stroke-width:2px
    style D fill:#dddddd,stroke:#000000,stroke-width:2px
Loading

Full backward compatibility

The threshold field uses #[serde(untagged)] enum deserialization to accept both the old integer syntax and the new object syntax:

#[serde(untagged)]
pub enum ThresholdConfig {
    Simple(u32),                    // threshold: 100
    Multi(MultiThresholdConfig),    // threshold: { events: 100, json_bytes: 500000 }
}

This means:

Aspect Old config New config Behavior
Config syntax threshold: 100 threshold: { events: 100 } Both parse correctly
Rate limiting 1 GCRA limiter (events) Same path when only events configured Identical
Metrics emitted component_discarded_events_total Same + throttle_threshold_discarded_total (bounded, 1 series for events-only) Additive only
Performance TaskTransform baseline SyncTransform + hot-path optimizations (214µs) 25% faster (Criterion-verified)
Memory Governor DashMap Same DashMap, same entries No change
emit_events_discarded_per_key Existing opt-in Still works identically No change
New fields N/A json_bytes, tokens, reroute_dropped, emit_detailed_metrics All default to off/absent

Zero migration needed. Every existing throttle config continues to work without changes. The only observable difference is:

  1. Slightly faster processing (SyncTransform vs TaskTransform)
  2. One new always-on metric (throttle_threshold_discarded_total{threshold_type="events"}) which is bounded to 1 series for existing configs

All new features are purely additive and default to disabled.


Combined event rate + byte throughput limiting

A single threshold block can enforce both an event rate cap and a byte throughput cap simultaneously. Each type runs its own independent GCRA limiter. An event is dropped the moment any limiter is exceeded:

transforms:
  rate_limit:
    type: throttle
    inputs: ["source"]
    window_secs: 60
    key_field: "{{ service }}"
    threshold:
      events: 5000          # event rate limit
      json_bytes: 3000000   # byte throughput limit

This covers two distinct failure modes in a single transform:

  • Event rate cap — prevents log flooding (e.g., tight loop logging 100K events/sec of tiny messages that individually pass byte limits)
  • Byte throughput cap — prevents volume spikes (e.g., 500 events/sec but each carrying 100KB stack traces that individually pass event count limits)

Neither threshold alone is sufficient. A service could bypass a byte-only limit by sending millions of tiny events, or bypass an event-only limit by sending few massive events. With both enforced, both attack vectors are covered.

A third dimension — a custom VRL token cost — can be added in the same definition:

    threshold:
      events: 5000
      json_bytes: 3000000
      tokens: 'strlen(string!(.message))'   # custom cost function

All three in one definition, one transform, one key_field — three independent limiters checked per event. Performance overhead for events+bytes combined is +71% vs events-only baseline, still processing ~2.80M events/sec.


Configuration examples

Old syntax (still works)

[transforms.simple]
type = "throttle"
inputs = ["source"]
threshold = 100
window_secs = 60

Multi-threshold with per-tenant keys

transforms:
  per_tenant:
    type: throttle
    inputs: ["source"]
    window_secs: 60
    key_field: "{{ service }}"
    threshold:
      events: 1000
      json_bytes: 500000
      tokens: 'strlen(string!(.message))'
    exclude: '.level == "error"'
    reroute_dropped: true
    internal_metrics:
      emit_detailed_metrics: true

Dropped output port (dead-letter routing)

transforms:
  rate_limit:
    type: throttle
    inputs: ["source"]
    threshold:
      events: 500
    reroute_dropped: true

sinks:
  dead_letter:
    type: file
    inputs: ["rate_limit.dropped"]
    path: "/var/log/vector/throttled/%Y-%m-%d.log"
    encoding:
      codec: json

Key design decisions

SyncTransform (not TaskTransform)

The original throttle uses TaskTransform (async Stream). This PR rewrites to SyncTransform because:

  • Enables multi-output ports via TransformOutputsBuf (required for reroute_dropped)
  • Eliminates async state machine overhead (measurable in benchmarks)
  • Pattern matches other multi-output transforms in the codebase (e.g., remap with dropped port)
  • DynClone requirement solved via ThrottleSyncTransform wrapper with lazy state initialization

Separate GCRA limiter per threshold type

Each threshold type gets its own independent governor RateLimiter. An event is dropped when any limiter is exceeded. Governor's check_key_n() consumes N tokens atomically, so byte-cost and token-cost events interact correctly with the GCRA algorithm.

EstimatedJsonEncodedSizeOf reuse

For json_bytes, we reuse Vector's existing EstimatedJsonEncodedSizeOf trait (already implemented for Event, LogEvent, all Value types with quickcheck tests). Zero allocation, zero serialization — just arithmetic over the in-memory value tree.

Deferred key string allocation

key_str is only materialized (to_owned()) when a metric actually needs to be emitted. On the happy path (events pass through, no metrics enabled), no String allocation occurs per event. This produced a measurable 5-7% throughput improvement.


Metrics: three-tier cardinality control

Tier 1: Always emitted (bounded cardinality — max 4 series total)

These are emitted regardless of configuration. Safe for any deployment.

Metric Type Tags Max series
component_discarded_events_total Counter component_id, intentional=true 1
throttle_threshold_discarded_total Counter threshold_type (events|json_bytes|tokens) 3

Tier 2: Legacy opt-in (emit_events_discarded_per_key: true)

Backward compatible with existing behavior. Cardinality = O(unique keys).

Metric Type Tags Cardinality
events_discarded_total Counter key O(keys)

Tier 3: New detailed metrics (emit_detailed_metrics: true)

Full per-key per-threshold observability. Cardinality = O(keys × threshold_types).

Metric Type Tags Description
throttle_events_discarded_total Counter key, threshold_type Drops per key per threshold
throttle_events_processed_total Counter key Total events per key (passed + dropped)
throttle_bytes_processed_total Counter key Cumulative JSON bytes per key
throttle_tokens_processed_total Counter key Cumulative VRL token cost per key
throttle_utilization_ratio Gauge key, threshold_type Current usage/threshold ratio (alert at 0.8)

Metrics impact by configuration

Configuration Metrics emitted Series at 100 keys × 3 thresholds Throughput impact
Both flags false (default) component_discarded_events_total + throttle_threshold_discarded_total{threshold_type} 4 0% (baseline)
emit_events_discarded_per_key: true only Above + events_discarded_total{key} 104 +1.0%
emit_detailed_metrics: true only Tier 1 + all Tier 3 metrics ~804 +75%
Both flags true All tiers combined ~904 +77%

Both flags default to false, so out-of-the-box the transform emits only 4 bounded-cardinality metric series with zero overhead, regardless of how many unique keys exist.


Performance impact

All benchmarks: Criterion, 200 samples, 30s measurement, 5s warmup, 100K resamples, 1024 events/iteration.

A. Throughput by threshold type (no metrics)

Benchmark Time (µs) Throughput vs events baseline Analysis
events_only/under_limit 214 4.78M/s Backward-compat baseline. Existing threshold: N configs take this path. 25% faster than initial SyncTransform (hot-path optimizations: inlined threshold checks, deferred VRL eval, sampled gauge emission).
json_bytes_only 326 3.14M/s +52% EstimatedJsonEncodedSizeOf per event (~109ns/event). No allocation, just arithmetic over in-memory value tree.
events_and_bytes 366 2.80M/s +71% Two governor calls + byte estimation. Additive.
vrl_tokens 550 1.86M/s +157% VRL Runtime::resolve() dominates (~328ns/event). Expected for interpreted eval.
all_three_thresholds 596 1.72M/s +178% Maximum config: 3 limiters + bytes + VRL. Still >1.7M events/sec.
events_only/over_limit 397 2.58M/s +85% Discard path: governor rejection + component_discarded_events_total (mandatory) + debug log.
with_dropped_port 414 2.48M/s +93% Over-limit + routing to .dropped output.
high_cardinality_keys (100) 403 2.54M/s +88% Template rendering for key_field.

B. Metrics overhead (100 keys, events-only threshold)

Benchmark Time (µs) vs metrics-off What it measures
metrics_both_off 363 Baseline: 100 keys, no per-key metrics.
metrics_legacy_only 367 +1.0% Just events_discarded_total{key}. Negligible.
metrics_detailed_only 636 +75% Full Tier 3: 3 counters + 3 gauges per event.
metrics_both_on 643 +77% Both tiers. Nearly same as detailed-only.
metrics_detailed_high_cardinality (10K keys) 779 +115% 10K unique keys × detailed metrics.
metrics_detailed_all_thresholds 964 +166% Maximum: detailed + 3 threshold types. Worst case.

C. Key cardinality scaling (no metrics)

Threshold config 10 keys 100 keys 1000 keys 10→1000 factor
events_only 355µs 365µs 444µs 1.25×
events+bytes 405µs 424µs 589µs 1.45×
all_three 646µs 681µs 922µs 1.43×

Scaling is sublinear — 100× more keys only causes 1.25-1.45× slowdown (DashMap O(1) amortized lookup).

D. Memory footprint per key

Config Theoretical per-key 10K keys total
events_only (1 limiter) ~104 bytes ~1.0 MB
events+bytes (2 limiters) ~208 bytes ~2.0 MB
all_three (3 limiters) ~312 bytes ~3.0 MB
all_three + detailed_metrics ~448 bytes ~4.4 MB

Even 10K tenants × 3 thresholds + detailed metrics uses under 5 MB — negligible vs Vector's baseline RSS (50-100 MB).


Impact assessment

What existing users get (zero config changes needed)

Aspect Before After
Config syntax threshold: 100 Still works identically
Processing path TaskTransform (async) SyncTransform (sync, 25% faster)
Metrics emitted component_discarded_events_total Same + throttle_threshold_discarded_total (bounded, max 3 series)
Performance Baseline No regression (Criterion: "Performance has improved")
Memory Baseline No change

What new users can opt into

Feature Overhead Use case
threshold.json_bytes +52% throughput Loki byte limits, cloud logging cost control
threshold.tokens +157% throughput Custom cost functions (message length, field-based pricing)
reroute_dropped +3% on drop path Dead-letter routing, replay, audit
emit_detailed_metrics +75% throughput Per-tenant dashboard, utilization alerting

All new features are additive and opt-in. No existing behavior changes.


Real-world usage scenarios

Each feature below is opt-in and independent. Pick only what you need — overhead is only paid for features you enable.

threshold.json_bytes — byte-aware rate limiting (+52% overhead)

Problem: Event-count throttling treats a 50-byte healthcheck and a 100 KB stack trace identically. Downstream services don't.

Scenario 1: Loki per-stream byte rate limits

Loki enforces a default per_stream_rate_limit of 3 MB/s. When a service emits a burst of large log events, event-count throttling won't prevent 429 rejections because 100 events of 100 KB each = 10 MB, far exceeding the 3 MB limit — even if you set threshold: 100.

transforms:
  loki_guard:
    type: throttle
    inputs: ["app_logs"]
    window_secs: 1
    key_field: "{{ stream }}"
    threshold:
      json_bytes: 3000000   # Match Loki's 3 MB/stream/sec limit

This catches the burst before it reaches Loki, avoiding 429 cascades and the retry storms that follow.

Why +52% is worth it: The overhead comes from EstimatedJsonEncodedSizeOf — a fast recursive walk over the in-memory event value tree (~109ns/event, no serialization, no allocation). The percentage is relative to the highly-optimized events-only baseline (214µs for 1024 events); in absolute terms, json_bytes still processes 3.14M events/sec. For any pipeline where downstream charges by bytes or enforces byte limits, this prevents far more expensive outcomes: 429 retry storms, Loki stream lockouts, or unexpected cloud billing spikes.

Scenario 2: Edge/IoT bandwidth-aware throttling

On edge devices with limited uplink (e.g., 1 Mbps satellite link), you need to throttle by actual payload size, not event count. A heartbeat event (50 bytes) and a firmware diagnostic dump (500 KB) should not be treated equally:

transforms:
  bandwidth_guard:
    type: throttle
    inputs: ["edge_telemetry"]
    window_secs: 10
    key_field: "{{ device_id }}"
    threshold:
      json_bytes: 125000   # ~100 Kbps sustained per device

reroute_dropped — dead-letter routing (+3% on drop path only)

Problem: Today, throttled events are silently discarded. You have no way to replay them, audit what was lost, or route them to cheaper storage.

When reroute_dropped: true is set, throttled events are sent to a named .dropped output port instead of being discarded. The +3% overhead only applies to events that are actually being dropped (the happy path — events passing through — has zero overhead from this flag).

Scenario 1: Dead-letter queue for replay

Route throttled events to a file or S3 sink for later replay during off-peak hours:

transforms:
  rate_limit:
    type: throttle
    inputs: ["source"]
    window_secs: 60
    key_field: "{{ service }}"
    threshold:
      events: 1000
      json_bytes: 3000000
    reroute_dropped: true

sinks:
  primary:
    type: loki
    inputs: ["rate_limit"]
    # ... normal Loki config

  replay_queue:
    type: aws_s3
    inputs: ["rate_limit.dropped"]
    bucket: "my-dead-letter-bucket"
    key_prefix: "throttled/{{ service }}/%Y-%m-%d/"
    encoding:
      codec: json

During off-peak windows, replay from S3 back through Vector to recover dropped data — zero data loss, guaranteed byte-rate compliance.

Scenario 2: Audit trail for compliance

In regulated environments, you may need to prove what was dropped and why. Route dropped events to a local file with metadata:

sinks:
  audit_trail:
    type: file
    inputs: ["rate_limit.dropped"]
    path: "/var/log/vector/audit/throttled/%Y-%m-%d.jsonl"
    encoding:
      codec: json

Every dropped event is preserved verbatim (byte-identical to the input — the throttle transform never modifies events). Compliance teams can verify exactly what was rate-limited.

Scenario 3: Overflow to cheaper storage tier

Route excess traffic to a cheaper destination instead of dropping entirely:

sinks:
  primary:
    type: elasticsearch
    inputs: ["rate_limit"]        # Premium: fast, indexed, searchable
    # ... expensive cluster

  overflow:
    type: aws_s3
    inputs: ["rate_limit.dropped"] # Budget: cold storage, query on demand
    bucket: "overflow-logs"
    encoding:
      codec: json
      compression: gzip

Why +3% is negligible: The overhead only fires on the drop path, and it's just routing an event to a second output buffer. Compared to the value of not losing data, this is effectively free.

emit_detailed_metrics — per-tenant observability (+75% overhead)

Problem: Without per-key metrics, you know that something is being throttled but not which tenant, which threshold, or how close other tenants are to their limits.

When emit_detailed_metrics: true is set, the transform emits:

  • throttle_events_discarded_total{key, threshold_type} — which tenant hit which limit
  • throttle_events_processed_total{key} — total events per tenant (passed + dropped)
  • throttle_bytes_processed_total{key} — total byte volume per tenant
  • throttle_tokens_processed_total{key} — total custom token cost per tenant
  • throttle_utilization_ratio{key, threshold_type} — current usage / threshold (0.0-1.0+)

Scenario 1: Per-tenant dashboard for multi-tenant SaaS

You run a multi-tenant platform where each service has a logging quota. With detailed metrics piped to Prometheus + Grafana:

transforms:
  tenant_throttle:
    type: throttle
    inputs: ["all_services"]
    window_secs: 60
    key_field: "{{ service }}"
    threshold:
      events: 5000
      json_bytes: 10000000
    internal_metrics:
      emit_detailed_metrics: true

sources:
  vector_metrics:
    type: internal_metrics

sinks:
  prometheus:
    type: prometheus_exporter
    inputs: ["vector_metrics"]

Now you can build Grafana panels showing per-service:

  • Event rate vs quota (throttle_events_processed_total / threshold)
  • Byte volume vs budget (throttle_bytes_processed_total)
  • Drop rate by threshold type (throttle_events_discarded_total)
  • Utilization heatmap across all tenants (throttle_utilization_ratio)

Scenario 2: Proactive alerting at 80% utilization

The throttle_utilization_ratio gauge lets you alert before throttling kicks in:

# Alert: tenant approaching 80% of byte limit
throttle_utilization_ratio{threshold_type="json_bytes"} > 0.8

# Alert: any tenant actively being throttled
rate(throttle_events_discarded_total[5m]) > 0

# Dashboard: top 10 tenants by byte consumption
topk(10, throttle_bytes_processed_total)

Operators get advance warning to contact tenants, adjust quotas, or investigate runaway services — instead of finding out after data is already being dropped.

Scenario 3: Cost attribution and chargeback

For platforms that bill tenants for logging usage, throttle_bytes_processed_total{key} provides per-tenant byte volume that maps directly to cloud logging cost:

# Monthly byte ingestion per service (for billing)
increase(throttle_bytes_processed_total[30d])

# Cost estimate at $0.50/GB (e.g., Datadog)
increase(throttle_bytes_processed_total[30d]) / 1e9 * 0.50

Why +75% can be worth it: The overhead comes from updating 3-6 metric counters per event (each with a key tag requiring hash lookups in the metrics registry). This is significant, but:

  1. You only enable this where you need tenant visibility — not on every throttle transform in your pipeline
  2. The alternative is worse — without per-key metrics, diagnosing throttling issues requires log diving, guesswork, or adding a separate log_to_metric + aggregate chain that costs more than 75%
  3. Cardinality is bounded by your key_field — if you have 50 services, that's ~300 metric series (50 × 6 metrics). If you have 10K keys, consider whether you actually need per-key visibility for all of them, or if you can use a higher-level grouping
  4. The base throughput is still 1.6M+ events/sec — even with detailed metrics on, the transform processes events far faster than most sinks can consume them

When NOT to enable emit_detailed_metrics

Scenario Recommendation
key_field produces <500 unique values Safe to enable — bounded cardinality, manageable series count
key_field produces 500-10K values Enable with monitoring — watch Prometheus scrape times and memory
key_field produces >10K values or is unbounded (e.g., user IDs) Don't enable — use the always-on throttle_threshold_discarded_total{threshold_type} for aggregate visibility instead
No key_field configured Low value — only one key ("None"), so detailed metrics add 6 series. Marginal benefit over Tier 1 metrics

Combining features: the cost is additive, not multiplicative

Each feature adds its overhead independently. Here's the combined cost for common configurations:

Configuration Total overhead Typical use case
threshold: 100 (existing) -25% (faster) Legacy configs — free upgrade
json_bytes only +52% Loki/cloud byte limits without tenant visibility
json_bytes + reroute_dropped +52% normal / +55% drop path Byte limits + dead-letter routing
json_bytes + reroute_dropped + emit_detailed_metrics +75-90% Full stack: byte limits, dead-letter, per-tenant dashboards
events + json_bytes + tokens (all thresholds, no metrics) +178% Maximum rate limiting, no observability overhead

The typical production config — json_bytes with reroute_dropped — adds ~52-55% relative overhead vs the optimized events-only baseline, while still processing 2.80M+ events/sec and solving real problems that event-count throttling cannot address.


How did you test this PR?

Unit Tests (22 tests)

cargo test -p vector --lib --features transforms-throttle -- transforms::throttle

Tests cover: backward compat, all threshold types, dropped port routing, key independence, exclude condition bypass, VRL expression errors (defaults to cost 1), data integrity (events unmodified through throttle), completeness (no events lost/duplicated), metrics emission with correct tags, utilization tracking across windows, key cardinality scaling (10/100/1K keys), memory footprint measurement.

Integration Tests (7 tests)

cargo test --test integration --features throttle-integration-tests -- throttle

Real vector binary via assert_cmd: config validation, stdin→stdout event flow, dropped port routing to separate output, multi-threshold with key_field, backward compat simple threshold, exclude bypasses limit, data integrity verification.

Benchmarks (23 benchmarks)

cargo bench --bench transform --features transform-benches -- throttle

Three groups: throughput (8), metrics overhead (6), key cardinality scaling (9). Criterion with 200 samples, 30s measurement, statistical significance testing.

E2E Tests

cargo vdev e2e test throttle-transform

Docker Compose with 3 config variants (events-only, bytes, multi-threshold).

Static Analysis

cargo clippy -p vector --features transforms-throttle -- -D warnings  # Clean

Change Type

  • Bug fix
  • New feature
  • Non-functional (chore, refactoring, docs)
  • Performance

Is this a breaking change?

  • Yes
  • No

The legacy threshold: <number> syntax is fully preserved. The only observable change for existing configs is:

  1. ~25% throughput improvement (SyncTransform + hot-path optimizations vs TaskTransform)
  2. One new bounded-cardinality metric (throttle_threshold_discarded_total)

Does this PR include user facing changes?

  • Yes. Changelog fragment included at changelog.d/11854_throttle_multi_threshold.feature.md.
  • No.

References

Issue/PR Relationship
#11854 Closes — Original feature request for byte-based throttling
#14280 Builds on — Earlier attempt at byte throttling (this PR uses better architecture)

Closes #11854

@szibis szibis requested review from a team as code owners February 20, 2026 15:43
@github-actions github-actions bot added domain: transforms Anything related to Vector's transform components domain: external docs Anything related to Vector's external, public documentation labels Feb 20, 2026
@szibis szibis force-pushed the feat/throttle-multi-threshold branch 2 times, most recently from cadda68 to 7fde598 Compare February 20, 2026 16:10
…output port

Add multi-dimensional rate limiting to the throttle transform with
independent thresholds for event count, estimated JSON byte size,
and custom VRL token expressions. Events are dropped when any
configured threshold is exceeded.

New capabilities:
- `threshold.events` — maximum events per window (backward compat with `threshold: N`)
- `threshold.json_bytes` — estimated JSON byte size via EstimatedJsonEncodedSizeOf
- `threshold.tokens` — VRL expression for custom cost (e.g. `strlen(string!(.message))`)
- `reroute_dropped` — routes throttled events to a named `.dropped` output port
- Per-key per-threshold observability metrics (opt-in via `emit_detailed_metrics`)

The legacy `threshold: <number>` syntax remains fully backward compatible.

Closes vectordotdev#11854
@szibis szibis mentioned this pull request Feb 20, 2026
@szibis szibis force-pushed the feat/throttle-multi-threshold branch from 7fde598 to 10eeee3 Compare February 20, 2026 16:15
@github-actions github-actions bot added the domain: ci Anything related to Vector's CI environment label Feb 20, 2026
@OliviaShoup
Copy link

hey @szibis thank you for the PR! i've made an editorial review card for a docs team member to take a look: https://datadoghq.atlassian.net/browse/DOCS-13474

Copy link
Contributor

@urseberry urseberry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left non-blocking suggestions to replace "e.g." with "for example" per the Datadog public documentation guidelines.

szibis and others added 6 commits February 21, 2026 08:47
Co-authored-by: Ursula Chen <58821586+urseberry@users.noreply.github.com>
Co-authored-by: Ursula Chen <58821586+urseberry@users.noreply.github.com>
Co-authored-by: Ursula Chen <58821586+urseberry@users.noreply.github.com>
Co-authored-by: Ursula Chen <58821586+urseberry@users.noreply.github.com>
Co-authored-by: Ursula Chen <58821586+urseberry@users.noreply.github.com>
@szibis szibis force-pushed the feat/throttle-multi-threshold branch from fb8400a to 8f67480 Compare February 23, 2026 21:36
Add overflow guards for VRL token cost (i64 → u32) and json_bytes
(usize → u32) conversions. Values exceeding u32::MAX are now clamped
instead of silently truncating.
When check_thresholds short-circuits (e.g., events limiter denies),
subsequent limiters never consume tokens from the governor. Update
utilization tracking to only count consumption for limiters that were
actually checked, preventing drift between reported utilization and
actual governor bucket state.

Also clarify that tokens_threshold intentionally uses json_bytes as its
budget, with a comment explaining the coupling.
- Warn when event cost exceeds governor burst capacity (check_key_n)
- Defer VRL evaluate_tokens until after events limiter passes, avoiding
  expensive event.clone() on already-rejected events
- Sample gauge emissions every 100 events instead of per-event
  (gauges overwrite so less frequent emission is equivalent)
- Bound utilization HashMap to 10K keys to prevent unbounded memory
  growth from high-cardinality key fields
- Reduce String allocations: avoid HashMap key clone when entry exists,
  allocate key_str once in process() for all metric emissions
- Inline threshold checking into process() to enable early exits
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

domain: ci Anything related to Vector's CI environment domain: external docs Anything related to Vector's external, public documentation domain: transforms Anything related to Vector's transform components editorial review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

throttle by bytes

3 participants