fix: Remove duplicated rows when dataset/workflows shared Publicly in the hub page#5962
fix: Remove duplicated rows when dataset/workflows shared Publicly in the hub page#5962Mrudhulraj wants to merge 2 commits into
Conversation
|
👋 Thanks for your first contribution to Texera, @Mrudhulraj! If you're looking for a good place to start, browse issues labeled You can drive common housekeeping yourself by commenting one of these commands on its own line:
Each command must match exactly: |
|
👋 Thanks for opening this pull request, @Mrudhulraj! It looks like the pull request description doesn't quite follow our template yet:
Filling out the template helps reviewers understand and triage your contribution faster. Please edit the description to complete it. This message will disappear automatically once the template is followed. You can find the template prompts by editing the description, or see CONTRIBUTING.md for the full contribution flow. |
Automated Reviewer SuggestionsBased on the
|
|
Does not select distinct works? |
|
No, SELECT DISTINCT wouldn't work here when it is explicitly shared and publicly available. LEFT JOIN would produce one row per matching dataset_user_access row, then OR would make both branches true. |
|
Let me explain a bit further @carloea2 with schema : Table "texera_db.dataset_user_access" Column | Type | Nullable | Default Indexes: Foreign-key constraints: Table "texera_db.dataset" Foreign-key constraints: What we see is that when Also when After applying joins with PS:
This applies to workflows too! Hope this clarifies!! |
|
Thanks for working on this. Would you mind splitting the dataset and workflow fixes into separate PRs? They share the same root cause, but they touch different query builders and the workflow case is more complex because it also includes project access. I suggest:
For the first PR, please use Also, let’s keep the changes focused on the duplicate-row issue and avoid unrelated changes such as |
|
@Mrudhulraj Thanks. I hope @carloea2 can take the lead to review these PRs. @carloea2 After that, feel free to add a committer to review and merge them. |
What changes were proposed in this PR?
Issue - Duplicate datasets/workflows on hub landing page / hub search
Symptom: A user creates a dataset, makes it public, and grants another user explicit access. When the grantee browses the hub, the dataset appears twice in the search results.
Root cause:
DatasetSearchQueryBuilder.constructFromClause produced this SQL:
path:
amber\src\main\scala\org\apache\texera\web\resource\dashboard\DatasetSearchQueryBuilder.scala:72For a dataset that is both public AND explicitly shared with the user, the LEFT JOIN produces one row per matching dataset_user_access row and the OR makes both branches true.
This applies similarly to worflows too.
Fix 1 — DatasetSearchQueryBuilder.constructFromClause
Move the UID filter from the WHERE clause into the JOIN's ON clause so each dataset produces at most one joined row, and force the JOIN to FALSE when uid == null so the SELECT still references a valid
table.
Why
ANDFALSEforuid == null?The
SELECTreferencesDATASET_USER_ACCESS.PRIVILEGE. Withoutdataset_user_accessin theFROM, DB throws missingFROM-clause entry for table "dataset_user_access".
ANDFALSEkeeps the table in the FROM while making the JOIN yield NULL access columns — which is the correct semantic for "no explicit grant".Behavior matrix:
Fix 2 — WorkflowSearchQueryBuilder.toEntryImpl + WorkflowSearchQueryBuilder.constructFromClause
Apply the same JOIN pattern for workflow represented in Fix-1 and add null-safe getters to handle the now-NULL access columns:
path:
amber\src\main\scala\org\apache\texera\web\resource\dashboard\WorkflowSearchQueryBuilder.scalaAny related issues, documentation, discussions?
Fixes #5957
How was this PR tested?
Tested manually with database checks and UI workflow testing.
Was this PR authored or co-authored using generative AI tooling?
No AI tools were used in the process.