fix(query): Remove duplicated rows when datasets shared Publicly in the hub page by Mrudhulraj · Pull Request #6016 · apache/texera

Mrudhulraj · 2026-06-29T02:18:43Z

What changes were proposed in this PR?

Issue - Duplicate datasets on hub landing page / hub search

Symptom: A user creates a dataset, makes it public, and grants another user explicit access. When the grantee browses the hub, the dataset appears twice in the search results.

Root cause:

DatasetSearchQueryBuilder.constructFromClause produced this SQL:

path: amber\src\main\scala\org\apache\texera\web\resource\dashboard\DatasetSearchQueryBuilder.scala:72

  SELECT DISTINCT ...
  FROM dataset
  LEFT JOIN dataset_user_access ON dua.did = dataset.did
  LEFT JOIN "user" ON ...
  WHERE (dua.uid = <ME>) OR (dataset.is_public = true)

For a dataset that is both public AND explicitly shared with the user, the LEFT JOIN produces one row per matching dataset_user_access row and the OR makes both branches true.
This applies similarly to worflows too.

Fix 1 — DatasetSearchQueryBuilder.constructFromClause

Move the UID filter from the WHERE clause into the JOIN's ON clause so each dataset produces at most one joined row, and force the JOIN to FALSE when uid == null so the SELECT still references a valid
table.

  val baseJoin = DATASET
    .leftJoin(DATASET_USER_ACCESS)
    .on(DATASET_USER_ACCESS.DID.eq(DATASET.DID))
    .**and**(if (uid == null) DSL.**falseCondition**() else DATASET_USER_ACCESS.UID.eq(uid))
    .leftJoin(USER)
    .on(USER.UID.eq(DATASET.OWNER_UID))

  val condition: Condition =
    if (uid == null) {
      DATASET.IS_PUBLIC.eq(true)
    } else if (includePublic) {
      DATASET.IS_PUBLIC.eq(true).or(DATASET_USER_ACCESS.UID.isNotNull)
    } else {
      DATASET_USER_ACCESS.UID.isNotNull
    }

  baseJoin.where(condition)

Why AND FALSE for uid == null?
The SELECT references DATASET_USER_ACCESS.PRIVILEGE. Without dataset_user_access in the FROM, DB throws missing
FROM-clause entry for table "dataset_user_access". AND FALSE keeps the table in the FROM while making the JOIN yield NULL access columns — which is the correct semantic for "no explicit grant".

Behavior matrix:

uid	includePublic	Matched datasets
null	(n/a)	Public only
not null	false	Datasets with explicit access of logged-in user only
not null	true	Public + logged-in explicit access (no duplicates)

Any related issues, documentation, discussions?

Refs #5957

How was this PR tested?

Tested manually with database checks and UI workflow testing.

Was this PR authored or co-authored using generative AI tooling?

No AI tools were used in the process.

…n the hub page

github-actions · 2026-06-29T02:18:54Z

👋 Thanks for opening this pull request, @Mrudhulraj!

It looks like the pull request description doesn't quite follow our template yet:

The What changes were proposed in this PR? section is empty; please fill it in.

Filling out the template helps reviewers understand and triage your contribution faster. Please edit the description to complete it. This message will disappear automatically once the template is followed.

You can find the template prompts by editing the description, or see CONTRIBUTING.md for the full contribution flow.

github-actions · 2026-06-29T02:18:56Z

Automated Reviewer Suggestions

Based on the git blame history of the changed files, we recommend the following reviewers:

Contributors with relevant context: @xuang7
You can notify them by mentioning @xuang7 in a comment.

carloea2 · 2026-06-29T04:02:19Z

Add some specs to both PRs

carloea2 · 2026-06-29T04:25:55Z

Thanks, this is much more focused now.

The core fix looks right to me: filtering DATASET_USER_ACCESS by the current uid in the JOIN should prevent the public-dataset join fan-out, since dataset_user_access is keyed by (did, uid).

I have two requests:

Please add a null-safe fallback for accessPrivilege.
After this change, public datasets without an explicit access row for the current user will have no joined DATASET_USER_ACCESS.PRIVILEGE, so record.get(...) can be null. The frontend type expects "READ" | "WRITE" | "NONE", so I think we should return PrivilegeEnum.NONE instead of null.

Example:

val accessPrivilege =
Option(record.get(DATASET_USER_ACCESS.PRIVILEGE, classOf[PrivilegeEnum]))
.getOrElse(PrivilegeEnum.NONE)
Please add a small spec for the duplicate case:
- dataset is public
- dataset is explicitly shared with the current user
- search with includePublic=true
- result contains exactly one dataset row, with the current user's privilege

Optional cleanup: the condition rewrite is logically okay, but the PR could be smaller if we keep the existing WHERE logic and only change the JOIN. Also, the new comments have a couple typos and “join is skipped” is slightly misleading because the join remains present but is forced to match no rows when uid == null.

Mrudhulraj and others added 2 commits June 28, 2026 19:07

fix(dashboard): Remove duplicated rows when dataset shared Publicly i…

e7bd008

…n the hub page

Merge branch 'apache:main' into fix/dataset-fetch-query-fix

a16f6ad

github-actions Bot assigned Mrudhulraj Jun 29, 2026

github-actions Bot added engine fix labels Jun 29, 2026

Mrudhulraj changed the title ~~fix(query): Remove duplicated rows when dataset/workflows shared Publicly in the hub page~~ fix(query): Remove duplicated rows when datasets shared Publicly in the hub page Jun 29, 2026

This was referenced Jun 29, 2026

fix(query): Remove duplicated rows when workflows shared Publicly in the hub page #6017

Open

fix: Remove duplicated rows when dataset/workflows shared Publicly in the hub page #5962

Closed

xuang7 added the release/v1.2 back porting to release/v1.2 label Jun 29, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(query): Remove duplicated rows when datasets shared Publicly in the hub page#6016

fix(query): Remove duplicated rows when datasets shared Publicly in the hub page#6016
Mrudhulraj wants to merge 2 commits into
apache:mainfrom
Mrudhulraj:fix/dataset-fetch-query-fix

Mrudhulraj commented Jun 29, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 29, 2026

Uh oh!

github-actions Bot commented Jun 29, 2026

Uh oh!

carloea2 commented Jun 29, 2026

Uh oh!

carloea2 commented Jun 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

Mrudhulraj commented Jun 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this PR?

Issue - Duplicate datasets on hub landing page / hub search

Fix 1 — DatasetSearchQueryBuilder.constructFromClause

Any related issues, documentation, discussions?

How was this PR tested?

Was this PR authored or co-authored using generative AI tooling?

Uh oh!

github-actions Bot commented Jun 29, 2026

Uh oh!

github-actions Bot commented Jun 29, 2026

Automated Reviewer Suggestions

Uh oh!

carloea2 commented Jun 29, 2026

Uh oh!

carloea2 commented Jun 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Mrudhulraj commented Jun 29, 2026 •

edited

Loading