Skip to content

fix(query): Remove duplicated rows when datasets shared Publicly in the hub page#6016

Open
Mrudhulraj wants to merge 2 commits into
apache:mainfrom
Mrudhulraj:fix/dataset-fetch-query-fix
Open

fix(query): Remove duplicated rows when datasets shared Publicly in the hub page#6016
Mrudhulraj wants to merge 2 commits into
apache:mainfrom
Mrudhulraj:fix/dataset-fetch-query-fix

Conversation

@Mrudhulraj

@Mrudhulraj Mrudhulraj commented Jun 29, 2026

Copy link
Copy Markdown

What changes were proposed in this PR?

Issue - Duplicate datasets on hub landing page / hub search

Symptom: A user creates a dataset, makes it public, and grants another user explicit access. When the grantee browses the hub, the dataset appears twice in the search results.

Root cause:

DatasetSearchQueryBuilder.constructFromClause produced this SQL:

path: amber\src\main\scala\org\apache\texera\web\resource\dashboard\DatasetSearchQueryBuilder.scala:72

  SELECT DISTINCT ...
  FROM dataset
  LEFT JOIN dataset_user_access ON dua.did = dataset.did
  LEFT JOIN "user" ON ...
  WHERE (dua.uid = <ME>) OR (dataset.is_public = true)

For a dataset that is both public AND explicitly shared with the user, the LEFT JOIN produces one row per matching dataset_user_access row and the OR makes both branches true.
This applies similarly to worflows too.

Fix 1 — DatasetSearchQueryBuilder.constructFromClause

Move the UID filter from the WHERE clause into the JOIN's ON clause so each dataset produces at most one joined row, and force the JOIN to FALSE when uid == null so the SELECT still references a valid
table.

  val baseJoin = DATASET
    .leftJoin(DATASET_USER_ACCESS)
    .on(DATASET_USER_ACCESS.DID.eq(DATASET.DID))
    .**and**(if (uid == null) DSL.**falseCondition**() else DATASET_USER_ACCESS.UID.eq(uid))
    .leftJoin(USER)
    .on(USER.UID.eq(DATASET.OWNER_UID))

  val condition: Condition =
    if (uid == null) {
      DATASET.IS_PUBLIC.eq(true)
    } else if (includePublic) {
      DATASET.IS_PUBLIC.eq(true).or(DATASET_USER_ACCESS.UID.isNotNull)
    } else {
      DATASET_USER_ACCESS.UID.isNotNull
    }

  baseJoin.where(condition)

Why AND FALSE for uid == null?
The SELECT references DATASET_USER_ACCESS.PRIVILEGE. Without dataset_user_access in the FROM, DB throws missing
FROM-clause entry for table "dataset_user_access". AND FALSE keeps the table in the FROM while making the JOIN yield NULL access columns — which is the correct semantic for "no explicit grant".

Behavior matrix:

uid includePublic Matched datasets
null (n/a) Public only
not null false Datasets with explicit access of logged-in user only
not null true Public + logged-in explicit access (no duplicates)

Any related issues, documentation, discussions?

Refs #5957

How was this PR tested?

Tested manually with database checks and UI workflow testing.

Was this PR authored or co-authored using generative AI tooling?

No AI tools were used in the process.

@github-actions

Copy link
Copy Markdown
Contributor

👋 Thanks for opening this pull request, @Mrudhulraj!

It looks like the pull request description doesn't quite follow our template yet:

  • The What changes were proposed in this PR? section is empty; please fill it in.

Filling out the template helps reviewers understand and triage your contribution faster. Please edit the description to complete it. This message will disappear automatically once the template is followed.

You can find the template prompts by editing the description, or see CONTRIBUTING.md for the full contribution flow.

@github-actions

Copy link
Copy Markdown
Contributor

Automated Reviewer Suggestions

Based on the git blame history of the changed files, we recommend the following reviewers:

  • Contributors with relevant context: @xuang7
    You can notify them by mentioning @xuang7 in a comment.

@Mrudhulraj Mrudhulraj changed the title fix(query): Remove duplicated rows when dataset/workflows shared Publicly in the hub page fix(query): Remove duplicated rows when datasets shared Publicly in the hub page Jun 29, 2026
@carloea2

Copy link
Copy Markdown
Contributor

Add some specs to both PRs

@carloea2

Copy link
Copy Markdown
Contributor

Thanks, this is much more focused now.

The core fix looks right to me: filtering DATASET_USER_ACCESS by the current uid in the JOIN should prevent the public-dataset join fan-out, since dataset_user_access is keyed by (did, uid).

I have two requests:

  1. Please add a null-safe fallback for accessPrivilege.
    After this change, public datasets without an explicit access row for the current user will have no joined DATASET_USER_ACCESS.PRIVILEGE, so record.get(...) can be null. The frontend type expects "READ" | "WRITE" | "NONE", so I think we should return PrivilegeEnum.NONE instead of null.

    Example:

    val accessPrivilege =
    Option(record.get(DATASET_USER_ACCESS.PRIVILEGE, classOf[PrivilegeEnum]))
    .getOrElse(PrivilegeEnum.NONE)

  2. Please add a small spec for the duplicate case:

    • dataset is public
    • dataset is explicitly shared with the current user
    • search with includePublic=true
    • result contains exactly one dataset row, with the current user's privilege

Optional cleanup: the condition rewrite is logically okay, but the PR could be smaller if we keep the existing WHERE logic and only change the JOIN. Also, the new comments have a couple typos and “join is skipped” is slightly misleading because the join remains present but is forced to match no rows when uid == null.

@xuang7 xuang7 added the release/v1.2 back porting to release/v1.2 label Jun 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

engine fix release/v1.2 back porting to release/v1.2

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants