Skip to content

Update collection stats (e.g. date ranges, counts, tags) in background job#3241

Merged
ikreymer merged 22 commits intomainfrom
issue-3218-update-collections-async
Apr 16, 2026
Merged

Update collection stats (e.g. date ranges, counts, tags) in background job#3241
ikreymer merged 22 commits intomainfrom
issue-3218-update-collections-async

Conversation

@tw4l
Copy link
Copy Markdown
Member

@tw4l tw4l commented Mar 30, 2026

Fixes #3218

This PR moves updating of collections after changes (e.g. items being added or removed) to a background job, to ensure that collection API requests remain quick.

Changes

  • New background job added to recalculate collection stats
  • Ensure all instances where collection statistics would be re-created as part of an API method now kick off a background job instead of awaiting (there are a long-running processes such as org import where we still await instead)
  • Ensure that collections and collection dedupe indexes are fully updated following crawl deletion
  • Add computed runningUpdatesCount to collection detail and replay.json endpoints
  • Backend and nightly tests updated to account for the changes
  • Poll in collection frontend to pick up updates after changes that kick off stats recalculation (e.g. adding or removing items) and displays an "Updating" spinner based on runningUpdatesCount > 0 (implemented by @emma-sg, thank you!)
  • A few places in the backend modules these changes touch where we were using asyncio.create_task have also been updated so that they will not be garbage collected before they complete (see [Task]: Ensure asyncio tasks aren't garbage collected before they complete #3240 for more context and tracking of completing this across the rest of the backend)

Testing

  • Spin up a Browsertrix instance
  • Create a collection with some items
  • Add and remove items and then verify that the collection stats update not long after, and an "Updating" spinner icon is shown while the updates are being applied
  • Verify that background jobs have been created and marked as successful in the database via API
  • Verify that deleting crawls from an item updates related collections (can check modified timestamp + verify that index update job is run for dedupe index)

Nightly test run: https://github.com/webrecorder/browsertrix/actions/runs/23769472591

Comment thread backend/btrixcloud/colls.py Outdated
Comment thread backend/btrixcloud/colls.py Outdated
Copy link
Copy Markdown
Member

@ikreymer ikreymer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nicely done! Just left one comment about list to set change.

For larger collections, as discussed, we probably want to add a status indicator that the collection is being updated, but otherwise all working well.

Tested delete of crawl in collection and saw that index got updated too!

@tw4l
Copy link
Copy Markdown
Member Author

tw4l commented Apr 2, 2026

Working on status indicator now.

Note to self: Check if public collections should also poll to receive updates / show "updating" status indicator

@tw4l tw4l force-pushed the issue-3218-update-collections-async branch 3 times, most recently from 39c1aa0 to 461d73f Compare April 8, 2026 00:24
Comment thread frontend/src/pages/org/collection-detail/collection-detail.ts Outdated
@tw4l tw4l requested a review from ikreymer April 8, 2026 14:39
@tw4l
Copy link
Copy Markdown
Member Author

tw4l commented Apr 8, 2026

@ikreymer I've added a computed runningUpdatesCount to the CollOut and PublicCollOut responses, so that the frontend can determine if a collection is updating by this count. In order for the user experience of that to be good, I had to reduce the ttlSecondsAfterFinished value, as otherwise the finalizer doesn't run until after that delay and the job seems to be still running long after the collection is actually updated. I've set it to ttlSecondsAfterFinished: 0, as I think the delay that was there is mostly only useful for development and we can always bump the value in dev branches as needed.

@SuaYoo This should now be good for you to do frontend work. Namely:

  • Adding some visual indicator to the collection page that a collection is in the process of updating if its runningUpdatesCount > 0
  • Should also look at the public collection view to see if polling and/or the updating status icon is needed there as well

Thanks both!

tw4l added 14 commits April 15, 2026 17:23
Wherever updating collection counts, tags, and dates would block API
respones, this commit moves those operations instead to an asyncio
task. It also ensures those tasks aren't garbage collected before they
are completed.

In addition, this moves updating counts, tags, and dates into a single
function update_collection_stats that is used uniformly, as it was
previously inconsistent whether the collection's date range was always
updated.
ikreymer and others added 6 commits April 15, 2026 17:23
Also set background job ttlSecondsAfterFinished to 0 so that the
finalizer runs immediately. Otherwise it seems that updates
are still running far after the actual stats update has been
committed, which results in an odd UX in the frontend.
…) (#3262)

Adds an "updating" indicator to various places in the collection detail
page when a collection's `runningUpdatesCount` is > 0.
@emma-sg emma-sg force-pushed the issue-3218-update-collections-async branch from 756f193 to e571e6a Compare April 15, 2026 21:24
@emma-sg
Copy link
Copy Markdown
Member

emma-sg commented Apr 15, 2026

One thing I've noticed with the frontend in place is that the stats get updated sometimes a few seconds before the background job count updates, is there much we might be able to do about this, or is it just a consequence of how background jobs work?

@tw4l
Copy link
Copy Markdown
Member Author

tw4l commented Apr 15, 2026

One thing I've noticed with the frontend in place is that the stats get updated sometimes a few seconds before the background job count updates, is there much we might be able to do about this, or is it just a consequence of how background jobs work?

This is due to the slight gap between when the business logic of the background job finishes (i.e. the collection stats update) and when the background job runs through the operator's finalizer and is completed by Kubernetes. I changed the TTL for how long background jobs are persisted after finishing and before being removed by Kubernetes to 0 seconds from its previous value of 90 seconds (where this lag was much more noticeable), but that still leaves a difference of a few seconds sometimes, as you've noticed.

Before settling on computing the number of update jobs running, I initially tried to solve this timing issue by having the business logic of the update set a boolean on the collection indicating whether the collection was being updated or not, but since we can have multiple actions triggering updates happening at any given time, that introduced more of a possibility for race conditions.

I'm sure we could engineer a proper solution to get rid of the lag, but I suspect it'd be higher effort than might be worth it to eliminate a few extra seconds of the spinner being shown.

@tw4l
Copy link
Copy Markdown
Member Author

tw4l commented Apr 15, 2026

Tested with @emma-sg's frontend updates locally and on dev with some larger (<100 GB) collections, and all seems good! The user experience is nice in the end, good work :)

Copy link
Copy Markdown
Member

@emma-sg emma-sg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Backend is working as expected, from my testing!

Copy link
Copy Markdown
Member

@ikreymer ikreymer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works well! The delay in update is fine, hardly noticeable, as just means larger collection is still being updated. Tested with larger collections.

@ikreymer ikreymer merged commit b88d792 into main Apr 16, 2026
30 checks passed
@ikreymer ikreymer deleted the issue-3218-update-collections-async branch April 16, 2026 00:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Task]: Update collection counts, tags, and dates in background job

3 participants