fix(query): Remove duplicated rows when datasets shared Publicly in the hub page#6016
fix(query): Remove duplicated rows when datasets shared Publicly in the hub page#6016Mrudhulraj wants to merge 2 commits into
Conversation
|
👋 Thanks for opening this pull request, @Mrudhulraj! It looks like the pull request description doesn't quite follow our template yet:
Filling out the template helps reviewers understand and triage your contribution faster. Please edit the description to complete it. This message will disappear automatically once the template is followed. You can find the template prompts by editing the description, or see CONTRIBUTING.md for the full contribution flow. |
Automated Reviewer SuggestionsBased on the
|
|
Add some specs to both PRs |
|
Thanks, this is much more focused now. The core fix looks right to me: filtering I have two requests:
Optional cleanup: the condition rewrite is logically okay, but the PR could be smaller if we keep the existing WHERE logic and only change the JOIN. Also, the new comments have a couple typos and “join is skipped” is slightly misleading because the join remains present but is forced to match no rows when |
What changes were proposed in this PR?
Issue - Duplicate datasets on hub landing page / hub search
Symptom: A user creates a dataset, makes it public, and grants another user explicit access. When the grantee browses the hub, the dataset appears twice in the search results.
Root cause:
DatasetSearchQueryBuilder.constructFromClause produced this SQL:
path:
amber\src\main\scala\org\apache\texera\web\resource\dashboard\DatasetSearchQueryBuilder.scala:72For a dataset that is both public AND explicitly shared with the user, the LEFT JOIN produces one row per matching dataset_user_access row and the OR makes both branches true.
This applies similarly to worflows too.
Fix 1 — DatasetSearchQueryBuilder.constructFromClause
Move the UID filter from the WHERE clause into the JOIN's ON clause so each dataset produces at most one joined row, and force the JOIN to FALSE when uid == null so the SELECT still references a valid
table.
Why
ANDFALSEforuid == null?The
SELECTreferencesDATASET_USER_ACCESS.PRIVILEGE. Withoutdataset_user_accessin theFROM, DB throws missingFROM-clause entry for table "dataset_user_access".
ANDFALSEkeeps the table in the FROM while making the JOIN yield NULL access columns — which is the correct semantic for "no explicit grant".Behavior matrix:
Any related issues, documentation, discussions?
Refs #5957
How was this PR tested?
Tested manually with database checks and UI workflow testing.
Was this PR authored or co-authored using generative AI tooling?
No AI tools were used in the process.