Update collection stats (e.g. date ranges, counts, tags) in background job#3241
Update collection stats (e.g. date ranges, counts, tags) in background job#3241
Conversation
ikreymer
left a comment
There was a problem hiding this comment.
Nicely done! Just left one comment about list to set change.
For larger collections, as discussed, we probably want to add a status indicator that the collection is being updated, but otherwise all working well.
Tested delete of crawl in collection and saw that index got updated too!
|
Working on status indicator now. Note to self: Check if public collections should also poll to receive updates / show "updating" status indicator |
39c1aa0 to
461d73f
Compare
|
@ikreymer I've added a computed @SuaYoo This should now be good for you to do frontend work. Namely:
Thanks both! |
Wherever updating collection counts, tags, and dates would block API respones, this commit moves those operations instead to an asyncio task. It also ensures those tasks aren't garbage collected before they are completed. In addition, this moves updating counts, tags, and dates into a single function update_collection_stats that is used uniformly, as it was previously inconsistent whether the collection's date range was always updated.
Also set background job ttlSecondsAfterFinished to 0 so that the finalizer runs immediately. Otherwise it seems that updates are still running far after the actual stats update has been committed, which results in an odd UX in the frontend.
…) (#3262) Adds an "updating" indicator to various places in the collection detail page when a collection's `runningUpdatesCount` is > 0.
756f193 to
e571e6a
Compare
|
One thing I've noticed with the frontend in place is that the stats get updated sometimes a few seconds before the background job count updates, is there much we might be able to do about this, or is it just a consequence of how background jobs work? |
This is due to the slight gap between when the business logic of the background job finishes (i.e. the collection stats update) and when the background job runs through the operator's finalizer and is completed by Kubernetes. I changed the TTL for how long background jobs are persisted after finishing and before being removed by Kubernetes to 0 seconds from its previous value of 90 seconds (where this lag was much more noticeable), but that still leaves a difference of a few seconds sometimes, as you've noticed. Before settling on computing the number of update jobs running, I initially tried to solve this timing issue by having the business logic of the update set a boolean on the collection indicating whether the collection was being updated or not, but since we can have multiple actions triggering updates happening at any given time, that introduced more of a possibility for race conditions. I'm sure we could engineer a proper solution to get rid of the lag, but I suspect it'd be higher effort than might be worth it to eliminate a few extra seconds of the spinner being shown. |
|
Tested with @emma-sg's frontend updates locally and on dev with some larger (<100 GB) collections, and all seems good! The user experience is nice in the end, good work :) |
emma-sg
left a comment
There was a problem hiding this comment.
Backend is working as expected, from my testing!
ikreymer
left a comment
There was a problem hiding this comment.
Works well! The delay in update is fine, hardly noticeable, as just means larger collection is still being updated. Tested with larger collections.
Fixes #3218
This PR moves updating of collections after changes (e.g. items being added or removed) to a background job, to ensure that collection API requests remain quick.
Changes
runningUpdatesCountto collection detail and replay.json endpointsrunningUpdatesCount > 0(implemented by @emma-sg, thank you!)asyncio.create_taskhave also been updated so that they will not be garbage collected before they complete (see [Task]: Ensure asyncio tasks aren't garbage collected before they complete #3240 for more context and tracking of completing this across the rest of the backend)Testing
Nightly test run: https://github.com/webrecorder/browsertrix/actions/runs/23769472591