Skip to content

Worker sometimes doesn’t pick up specific tasks (scheduler/API enqueued) until multiple restarts #544

@vishal12032000

Description

@vishal12032000

Summary

  • Worker processes some tasks (e.g., shipment_tasks.process_external_shipments) but ignores tracking tasks from other modules (scheduler/API enqueued). After multiple Swarm restarts (remove/re-add service), worker starts consuming them. Removing --max-tasks-per-child "5000" fixes it reliably.
  • Regression: stable for ~1 month; issue started yesterday. Today, after many restarts, tasks began processing again. I suspect tasks don’t register properly after child recycle.

Environment

  • Orchestrator: Docker Swarm
  • OS: Linux 5.4.0-216-generic
  • Python: 3.x
  • TaskIQ: 0.11.18
  • taskiq-redis: 1.1.1
  • taskiq-rate-limiter: 0.1.0b2
  • redis (python): 6.4.0
  • Broker: ListQueueBroker (Redis), queue_name="shipaxis"
  • Result backend: RedisAsyncResultBackend (Redis)
  • Scheduler: TaskiqScheduler + LabelScheduleSource
  • Env: ENV=prod (cron labels are gated on PROJECT_ENV == "prod")

Commands/Flags

  • Worker:
taskiq worker app.taskiq.worker:taskiq_broker \
  --workers 4 \
  --tasks-pattern "**/*_tasks.py" \
  --fs-discover \
  --log-level INFO \
  --max-async-tasks 60 \
  --max-prefetch 10 \
  --ack-type when_executed \
  --max-tasks-per-child "5000"   # problematic
  • Scheduler:
taskiq scheduler app.taskiq.scheduler:taskiq_scheduler \
  --tasks-pattern "**/*_tasks.py" \
  --fs-discover \
  --log-level INFO \
  --skip-first-run

What happens

  • Tracking tasks from courier modules (queued by scheduler or API) are enqueued to Redis but not consumed.
  • Other tasks (from a different module) are consumed normally.
  • After several Swarm restarts (remove service, re-add), tracking tasks begin processing.
  • Removing --max-tasks-per-child "5000" makes the problem disappear across redeploys.

Expected

  • Worker consistently registers and processes all discovered tasks across modules, with or without child recycling.

Actual

  • With --max-tasks-per-child set, the worker intermittently fails to “see” or process tracking tasks from certain modules, while other tasks work.

Repro steps (approx)

  1. Run worker with --fs-discover, --tasks-pattern "**/*_tasks.py", and --max-tasks-per-child "5000".
  2. Define tasks across multiple modules (e.g., shipment_tasks.py and courier/*_tasks.py).
  3. Enqueue tasks from both groups (some via scheduler, some via API).
  4. Observe that only the shipment task(s) run; tracking tasks remain queued.
  5. Restart/recreate services multiple times in Swarm → tracking tasks eventually start running.
  6. Remove --max-tasks-per-child → tracking tasks process reliably from first boot.

Timeline/Regression

  • Running fine for about a month.
  • Started failing yesterday without intentional code changes.
  • Today, after many restarts (rm/re-add in Swarm), tasks started processing again.
  • Strong suspicion: tasks aren’t registered correctly after child recycle.

Notes/Observations

  • Using fs-discovery; if a module errors at import, tasks won’t register. No obvious import errors at INFO log level.
  • Explicitly importing the task packages at worker startup (e.g., import app.tasks where __init__ imports courier/*_tasks.py) makes behavior deterministic (tasks always register).
  • Hypothesis: child recycling + fs-discovery/import timing leaves some children without the full registry after fork, or there’s an import/env race. Removing the per-child cap avoids recycle-related non-determinism.

Questions

  • Is --max-tasks-per-child supported with fs-discovery across multiple task modules?
  • Should fs-discovery be avoided when using child recycling, and explicit imports preferred?
  • Any recommended pattern to ensure a complete task registry after child recycle?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions