-
-
Notifications
You must be signed in to change notification settings - Fork 98
Open
Description
Summary
- Worker processes some tasks (e.g.,
shipment_tasks.process_external_shipments) but ignores tracking tasks from other modules (scheduler/API enqueued). After multiple Swarm restarts (remove/re-add service), worker starts consuming them. Removing--max-tasks-per-child "5000"fixes it reliably. - Regression: stable for ~1 month; issue started yesterday. Today, after many restarts, tasks began processing again. I suspect tasks don’t register properly after child recycle.
Environment
- Orchestrator: Docker Swarm
- OS: Linux 5.4.0-216-generic
- Python: 3.x
- TaskIQ: 0.11.18
- taskiq-redis: 1.1.1
- taskiq-rate-limiter: 0.1.0b2
- redis (python): 6.4.0
- Broker:
ListQueueBroker(Redis),queue_name="shipaxis" - Result backend:
RedisAsyncResultBackend(Redis) - Scheduler:
TaskiqScheduler+LabelScheduleSource - Env:
ENV=prod(cron labels are gated onPROJECT_ENV == "prod")
Commands/Flags
- Worker:
taskiq worker app.taskiq.worker:taskiq_broker \
--workers 4 \
--tasks-pattern "**/*_tasks.py" \
--fs-discover \
--log-level INFO \
--max-async-tasks 60 \
--max-prefetch 10 \
--ack-type when_executed \
--max-tasks-per-child "5000" # problematic
- Scheduler:
taskiq scheduler app.taskiq.scheduler:taskiq_scheduler \
--tasks-pattern "**/*_tasks.py" \
--fs-discover \
--log-level INFO \
--skip-first-run
What happens
- Tracking tasks from courier modules (queued by scheduler or API) are enqueued to Redis but not consumed.
- Other tasks (from a different module) are consumed normally.
- After several Swarm restarts (remove service, re-add), tracking tasks begin processing.
- Removing
--max-tasks-per-child "5000"makes the problem disappear across redeploys.
Expected
- Worker consistently registers and processes all discovered tasks across modules, with or without child recycling.
Actual
- With
--max-tasks-per-childset, the worker intermittently fails to “see” or process tracking tasks from certain modules, while other tasks work.
Repro steps (approx)
- Run worker with
--fs-discover,--tasks-pattern "**/*_tasks.py", and--max-tasks-per-child "5000". - Define tasks across multiple modules (e.g.,
shipment_tasks.pyandcourier/*_tasks.py). - Enqueue tasks from both groups (some via scheduler, some via API).
- Observe that only the shipment task(s) run; tracking tasks remain queued.
- Restart/recreate services multiple times in Swarm → tracking tasks eventually start running.
- Remove
--max-tasks-per-child→ tracking tasks process reliably from first boot.
Timeline/Regression
- Running fine for about a month.
- Started failing yesterday without intentional code changes.
- Today, after many restarts (rm/re-add in Swarm), tasks started processing again.
- Strong suspicion: tasks aren’t registered correctly after child recycle.
Notes/Observations
- Using fs-discovery; if a module errors at import, tasks won’t register. No obvious import errors at INFO log level.
- Explicitly importing the task packages at worker startup (e.g.,
import app.taskswhere__init__importscourier/*_tasks.py) makes behavior deterministic (tasks always register). - Hypothesis: child recycling + fs-discovery/import timing leaves some children without the full registry after fork, or there’s an import/env race. Removing the per-child cap avoids recycle-related non-determinism.
Questions
- Is
--max-tasks-per-childsupported with fs-discovery across multiple task modules? - Should fs-discovery be avoided when using child recycling, and explicit imports preferred?
- Any recommended pattern to ensure a complete task registry after child recycle?
Metadata
Metadata
Assignees
Labels
No labels