fix: Remove duplicated rows when dataset/workflows shared Publicly in the hub page by Mrudhulraj · Pull Request #5962 · apache/texera

Mrudhulraj · 2026-06-28T02:28:28Z

What changes were proposed in this PR?

Issue - Duplicate datasets/workflows on hub landing page / hub search

Symptom: A user creates a dataset, makes it public, and grants another user explicit access. When the grantee browses the hub, the dataset appears twice in the search results.

Root cause:

DatasetSearchQueryBuilder.constructFromClause produced this SQL:

path: amber\src\main\scala\org\apache\texera\web\resource\dashboard\DatasetSearchQueryBuilder.scala:72

  SELECT DISTINCT ...
  FROM dataset
  LEFT JOIN dataset_user_access ON dua.did = dataset.did
  LEFT JOIN "user" ON ...
  WHERE (dua.uid = <ME>) OR (dataset.is_public = true)

For a dataset that is both public AND explicitly shared with the user, the LEFT JOIN produces one row per matching dataset_user_access row and the OR makes both branches true.
This applies similarly to worflows too.

Fix 1 — DatasetSearchQueryBuilder.constructFromClause

Move the UID filter from the WHERE clause into the JOIN's ON clause so each dataset produces at most one joined row, and force the JOIN to FALSE when uid == null so the SELECT still references a valid
table.

  val baseJoin = DATASET
    .leftJoin(DATASET_USER_ACCESS)
    .on(DATASET_USER_ACCESS.DID.eq(DATASET.DID))
    .**and**(if (uid == null) DSL.**falseCondition**() else DATASET_USER_ACCESS.UID.eq(uid))
    .leftJoin(USER)
    .on(USER.UID.eq(DATASET.OWNER_UID))

  val condition: Condition =
    if (uid == null) {
      DATASET.IS_PUBLIC.eq(true)
    } else if (includePublic) {
      DATASET.IS_PUBLIC.eq(true).or(DATASET_USER_ACCESS.UID.isNotNull)
    } else {
      DATASET_USER_ACCESS.UID.isNotNull
    }

  baseJoin.where(condition)

Why AND FALSE for uid == null?
The SELECT references DATASET_USER_ACCESS.PRIVILEGE. Without dataset_user_access in the FROM, DB throws missing
FROM-clause entry for table "dataset_user_access". AND FALSE keeps the table in the FROM while making the JOIN yield NULL access columns — which is the correct semantic for "no explicit grant".

Behavior matrix:

uid	includePublic	Matched datasets
null	(n/a)	Public only
not null	false	Datasets with explicit access of logged-in user only
not null	true	Public + logged-in explicit access (no duplicates)

Fix 2 — WorkflowSearchQueryBuilder.toEntryImpl + WorkflowSearchQueryBuilder.constructFromClause

Apply the same JOIN pattern for workflow represented in Fix-1 and add null-safe getters to handle the now-NULL access columns:
path: amber\src\main\scala\org\apache\texera\web\resource\dashboard\WorkflowSearchQueryBuilder.scala

  val privilege: String =
    Option(record.get(WORKFLOW_USER_ACCESS.PRIVILEGE, classOf[PrivilegeEnum]))
      .map(_.toString)
      .getOrElse("NONE")

  val ownerName: String =
    Option(record.into(USER).getName).getOrElse("")

  val ownerUid: Integer =
    Option(record.into(USER).getUid).getOrElse(0)

Any related issues, documentation, discussions?

Fixes #5957

How was this PR tested?

Tested manually with database checks and UI workflow testing.

Was this PR authored or co-authored using generative AI tooling?

No AI tools were used in the process.

github-actions · 2026-06-28T02:28:39Z

👋 Thanks for your first contribution to Texera, @Mrudhulraj!

If you're looking for a good place to start, browse issues labeled starter-task; they're scoped to be approachable for newcomers.

You can drive common housekeeping yourself by commenting one of these commands on its own line:

Issues. Comment /take to assign an open issue to yourself, or /untake to release it. You can find unclaimed work with the search filter is:issue is:open no:assignee.
Sub-issues. To link issues into a parent/child hierarchy, comment /sub-issue #5166 #5222 on the parent to attach those children (or /unsub-issue #5166 #5222 to detach them). From a child issue, comment /parent-issue #5166 to set its parent, or /unparent-issue to clear it (the current parent is detected automatically). References may be written as #5166 or as a bare 5166; cross-repository references are not supported.
Pull requests (author only). Comment /request-review @user to request a review from someone, or /unrequest-review @user to withdraw that request.

Each command must match exactly: /take this will not work, only /take does. For the full contribution flow, see CONTRIBUTING.md.

github-actions · 2026-06-28T02:28:40Z

👋 Thanks for opening this pull request, @Mrudhulraj!

It looks like the pull request description doesn't quite follow our template yet:

The What changes were proposed in this PR? section is empty; please fill it in.

Filling out the template helps reviewers understand and triage your contribution faster. Please edit the description to complete it. This message will disappear automatically once the template is followed.

You can find the template prompts by editing the description, or see CONTRIBUTING.md for the full contribution flow.

github-actions · 2026-06-28T02:28:45Z

Automated Reviewer Suggestions

Based on the git blame history of the changed files, we recommend the following reviewers:

Contributors with relevant context: @xuang7, @aglinxinyuan
You can notify them by mentioning @xuang7, @aglinxinyuan in a comment.

carloea2 · 2026-06-28T02:38:55Z

Does not select distinct works?

Mrudhulraj · 2026-06-28T17:51:03Z

No, SELECT DISTINCT wouldn't work here when it is explicitly shared and publicly available.

LEFT JOIN would produce one row per matching dataset_user_access row, then OR would make both branches true.
Applying SELECT DISTINCT would not work here as the privilege column in dataset_user_access will differ (WRITE vs NULL).

Mrudhulraj · 2026-06-28T18:34:35Z

Let me explain a bit further @carloea2 with schema :

Table "texera_db.dataset_user_access"

Indexes:
"dataset_user_access_pkey" PRIMARY KEY, btree (did, uid)

Foreign-key constraints:
"dataset_user_access_did_fkey" FOREIGN KEY (did) REFERENCES dataset(did) ON DELETE CASCADE
"dataset_user_access_uid_fkey" FOREIGN KEY (uid) REFERENCES "user"(uid) ON DELETE CASCADE

Foreign-key constraints:
"dataset_owner_uid_fkey" FOREIGN KEY (owner_uid) REFERENCES "user"(uid) ON DELETE CASCADE

What we see is that when did is_public=true in dataset, we have one row with the same did in dua (dataset_user_access) where privilege is "NONE".

Also when did of the same dataset for one of the uids in dua changes we modify the privilege for that uid.

After applying joins with uid set and public=true we result in duplicate rows, because we get NONE(with is_public=true) from other users and the same dataset from the uid set which has WRITE access now.

PS:

I am not sure if there would be a need to migrate is_public attribute to dua. OR
Discuss the modified query I propose. OR
Set either the privilege to READ/WRITE for all users and disable explicit sharing of dataset when shared publicly.

This applies to workflows too!

Hope this clarifies!!
cc: @chenlica

carloea2 · 2026-06-28T21:04:47Z

Thanks for working on this.

Would you mind splitting the dataset and workflow fixes into separate PRs? They share the same root cause, but they touch different query builders and the workflow case is more complex because it also includes project access.

I suggest:

PR 1: dataset duplicate fix only in DatasetSearchQueryBuilder.scala
PR 2: workflow duplicate fix only in WorkflowSearchQueryBuilder.scala

For the first PR, please use Refs #5957 instead of Fixes #5957, so the issue stays open until the workflow side is fixed too.

Also, let’s keep the changes focused on the duplicate-row issue and avoid unrelated changes such as ownerName / ownerUid fallback defaults unless is mandatory.

Mrudhulraj · 2026-06-29T02:53:46Z

@carloea2 I have raised 2 fresh PRs #6016 and #6017 . Once accepted, I will close this PR. Is that fine?

chenlica · 2026-06-29T06:12:01Z

@Mrudhulraj Thanks. I hope @carloea2 can take the lead to review these PRs. @carloea2 After that, feel free to add a committer to review and merge them.

xuang7 · 2026-06-29T18:08:28Z

Closing this PR since the changes have been split into #6016 and #6017.

Fix: Remove duplicated rows when dataset/workflows shared Publicly

363e4a1

github-actions Bot assigned Mrudhulraj Jun 28, 2026

github-actions Bot added engine fix labels Jun 28, 2026

Mrudhulraj changed the title ~~Fix(query): Remove duplicated rows when dataset/workflows shared Publicly in the hub page~~ fix: Remove duplicated rows when dataset/workflows shared Publicly in the hub page Jun 28, 2026

Merge branch 'apache:main' into fix/dataset-workkflow-fix

d853040

xuang7 closed this Jun 29, 2026

Mrudhulraj deleted the fix/dataset-workkflow-fix branch June 30, 2026 04:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: Remove duplicated rows when dataset/workflows shared Publicly in the hub page#5962

fix: Remove duplicated rows when dataset/workflows shared Publicly in the hub page#5962
Mrudhulraj wants to merge 2 commits into
apache:mainfrom
Mrudhulraj:fix/dataset-workkflow-fix

Mrudhulraj commented Jun 28, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 28, 2026

Uh oh!

github-actions Bot commented Jun 28, 2026

Uh oh!

github-actions Bot commented Jun 28, 2026

Uh oh!

carloea2 commented Jun 28, 2026

Uh oh!

Mrudhulraj commented Jun 28, 2026

Uh oh!

Mrudhulraj commented Jun 28, 2026 •

edited

Loading

Uh oh!

carloea2 commented Jun 28, 2026

Uh oh!

Mrudhulraj commented Jun 29, 2026

Uh oh!

chenlica commented Jun 29, 2026

Uh oh!

xuang7 commented Jun 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

Mrudhulraj commented Jun 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this PR?

Issue - Duplicate datasets/workflows on hub landing page / hub search

Fix 1 — DatasetSearchQueryBuilder.constructFromClause

Fix 2 — WorkflowSearchQueryBuilder.toEntryImpl + WorkflowSearchQueryBuilder.constructFromClause

Any related issues, documentation, discussions?

How was this PR tested?

Was this PR authored or co-authored using generative AI tooling?

Uh oh!

github-actions Bot commented Jun 28, 2026

Uh oh!

github-actions Bot commented Jun 28, 2026

Uh oh!

github-actions Bot commented Jun 28, 2026

Automated Reviewer Suggestions

Uh oh!

carloea2 commented Jun 28, 2026

Uh oh!

Mrudhulraj commented Jun 28, 2026

Uh oh!

Mrudhulraj commented Jun 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

carloea2 commented Jun 28, 2026

Uh oh!

Mrudhulraj commented Jun 29, 2026

Uh oh!

chenlica commented Jun 29, 2026

Uh oh!

xuang7 commented Jun 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Mrudhulraj commented Jun 28, 2026 •

edited

Loading

Mrudhulraj commented Jun 28, 2026 •

edited

Loading