Skip to content

fix: Remove duplicated rows when dataset/workflows shared Publicly in the hub page#5962

Closed
Mrudhulraj wants to merge 2 commits into
apache:mainfrom
Mrudhulraj:fix/dataset-workkflow-fix
Closed

fix: Remove duplicated rows when dataset/workflows shared Publicly in the hub page#5962
Mrudhulraj wants to merge 2 commits into
apache:mainfrom
Mrudhulraj:fix/dataset-workkflow-fix

Conversation

@Mrudhulraj

@Mrudhulraj Mrudhulraj commented Jun 28, 2026

Copy link
Copy Markdown

What changes were proposed in this PR?

Issue - Duplicate datasets/workflows on hub landing page / hub search

Symptom: A user creates a dataset, makes it public, and grants another user explicit access. When the grantee browses the hub, the dataset appears twice in the search results.

Root cause:

DatasetSearchQueryBuilder.constructFromClause produced this SQL:

path: amber\src\main\scala\org\apache\texera\web\resource\dashboard\DatasetSearchQueryBuilder.scala:72

  SELECT DISTINCT ...
  FROM dataset
  LEFT JOIN dataset_user_access ON dua.did = dataset.did
  LEFT JOIN "user" ON ...
  WHERE (dua.uid = <ME>) OR (dataset.is_public = true)

For a dataset that is both public AND explicitly shared with the user, the LEFT JOIN produces one row per matching dataset_user_access row and the OR makes both branches true.
This applies similarly to worflows too.

Fix 1 — DatasetSearchQueryBuilder.constructFromClause

Move the UID filter from the WHERE clause into the JOIN's ON clause so each dataset produces at most one joined row, and force the JOIN to FALSE when uid == null so the SELECT still references a valid
table.

  val baseJoin = DATASET
    .leftJoin(DATASET_USER_ACCESS)
    .on(DATASET_USER_ACCESS.DID.eq(DATASET.DID))
    .**and**(if (uid == null) DSL.**falseCondition**() else DATASET_USER_ACCESS.UID.eq(uid))
    .leftJoin(USER)
    .on(USER.UID.eq(DATASET.OWNER_UID))

  val condition: Condition =
    if (uid == null) {
      DATASET.IS_PUBLIC.eq(true)
    } else if (includePublic) {
      DATASET.IS_PUBLIC.eq(true).or(DATASET_USER_ACCESS.UID.isNotNull)
    } else {
      DATASET_USER_ACCESS.UID.isNotNull
    }

  baseJoin.where(condition)

Why AND FALSE for uid == null?
The SELECT references DATASET_USER_ACCESS.PRIVILEGE. Without dataset_user_access in the FROM, DB throws missing
FROM-clause entry for table "dataset_user_access". AND FALSE keeps the table in the FROM while making the JOIN yield NULL access columns — which is the correct semantic for "no explicit grant".

Behavior matrix:

uid includePublic Matched datasets
null (n/a) Public only
not null false Datasets with explicit access of logged-in user only
not null true Public + logged-in explicit access (no duplicates)

Fix 2 — WorkflowSearchQueryBuilder.toEntryImpl + WorkflowSearchQueryBuilder.constructFromClause

Apply the same JOIN pattern for workflow represented in Fix-1 and add null-safe getters to handle the now-NULL access columns:
path: amber\src\main\scala\org\apache\texera\web\resource\dashboard\WorkflowSearchQueryBuilder.scala

  val privilege: String =
    Option(record.get(WORKFLOW_USER_ACCESS.PRIVILEGE, classOf[PrivilegeEnum]))
      .map(_.toString)
      .getOrElse("NONE")

  val ownerName: String =
    Option(record.into(USER).getName).getOrElse("")

  val ownerUid: Integer =
    Option(record.into(USER).getUid).getOrElse(0)

Any related issues, documentation, discussions?

Fixes #5957

How was this PR tested?

Tested manually with database checks and UI workflow testing.

Was this PR authored or co-authored using generative AI tooling?

No AI tools were used in the process.

@github-actions

Copy link
Copy Markdown
Contributor

👋 Thanks for your first contribution to Texera, @Mrudhulraj!

If you're looking for a good place to start, browse issues labeled starter-task; they're scoped to be approachable for newcomers.

You can drive common housekeeping yourself by commenting one of these commands on its own line:

  • Issues. Comment /take to assign an open issue to yourself, or /untake to release it. You can find unclaimed work with the search filter is:issue is:open no:assignee.
  • Sub-issues. To link issues into a parent/child hierarchy, comment /sub-issue #5166 #5222 on the parent to attach those children (or /unsub-issue #5166 #5222 to detach them). From a child issue, comment /parent-issue #5166 to set its parent, or /unparent-issue to clear it (the current parent is detected automatically). References may be written as #5166 or as a bare 5166; cross-repository references are not supported.
  • Pull requests (author only). Comment /request-review @user to request a review from someone, or /unrequest-review @user to withdraw that request.

Each command must match exactly: /take this will not work, only /take does. For the full contribution flow, see CONTRIBUTING.md.

@github-actions

Copy link
Copy Markdown
Contributor

👋 Thanks for opening this pull request, @Mrudhulraj!

It looks like the pull request description doesn't quite follow our template yet:

  • The What changes were proposed in this PR? section is empty; please fill it in.

Filling out the template helps reviewers understand and triage your contribution faster. Please edit the description to complete it. This message will disappear automatically once the template is followed.

You can find the template prompts by editing the description, or see CONTRIBUTING.md for the full contribution flow.

@github-actions

Copy link
Copy Markdown
Contributor

Automated Reviewer Suggestions

Based on the git blame history of the changed files, we recommend the following reviewers:

  • Contributors with relevant context: @xuang7, @aglinxinyuan
    You can notify them by mentioning @xuang7, @aglinxinyuan in a comment.

@carloea2

Copy link
Copy Markdown
Contributor

Does not select distinct works?

@Mrudhulraj Mrudhulraj changed the title Fix(query): Remove duplicated rows when dataset/workflows shared Publicly in the hub page fix: Remove duplicated rows when dataset/workflows shared Publicly in the hub page Jun 28, 2026
@Mrudhulraj

Copy link
Copy Markdown
Author

No, SELECT DISTINCT wouldn't work here when it is explicitly shared and publicly available.

LEFT JOIN would produce one row per matching dataset_user_access row, then OR would make both branches true.
Applying SELECT DISTINCT would not work here as the privilege column in dataset_user_access will differ (WRITE vs NULL).

@Mrudhulraj

Mrudhulraj commented Jun 28, 2026

Copy link
Copy Markdown
Author

Let me explain a bit further @carloea2 with schema :

Table "texera_db.dataset_user_access"

Column | Type | Nullable | Default
-----------+----------------+----------+------------------------
did | integer | not null |
uid | integer | not null |
privilege| privilege_enum | not null | 'NONE'::privilege_enum

Indexes:
"dataset_user_access_pkey" PRIMARY KEY, btree (did, uid)

Foreign-key constraints:
"dataset_user_access_did_fkey" FOREIGN KEY (did) REFERENCES dataset(did) ON DELETE CASCADE
"dataset_user_access_uid_fkey" FOREIGN KEY (uid) REFERENCES "user"(uid) ON DELETE CASCADE

Table "texera_db.dataset"
Column | Type | Nullable | Default
--------------+------------------+----------+--------------------------------------
did | integer | not null | nextval('dataset_did_seq'::regclass)
owner_uid | integer | not null |
name | character varying| not null |
is_public | boolean | not null | true

Foreign-key constraints:
"dataset_owner_uid_fkey" FOREIGN KEY (owner_uid) REFERENCES "user"(uid) ON DELETE CASCADE

What we see is that when did is_public=true in dataset, we have one row with the same did in dua (dataset_user_access) where privilege is "NONE".

Also when did of the same dataset for one of the uids in dua changes we modify the privilege for that uid.

After applying joins with uid set and public=true we result in duplicate rows, because we get NONE(with is_public=true) from other users and the same dataset from the uid set which has WRITE access now.

PS:

  1. I am not sure if there would be a need to migrate is_public attribute to dua. OR
  2. Discuss the modified query I propose. OR
  3. Set either the privilege to READ/WRITE for all users and disable explicit sharing of dataset when shared publicly.

This applies to workflows too!

Hope this clarifies!!
cc: @chenlica

@carloea2

Copy link
Copy Markdown
Contributor

Thanks for working on this.

Would you mind splitting the dataset and workflow fixes into separate PRs? They share the same root cause, but they touch different query builders and the workflow case is more complex because it also includes project access.

I suggest:

  • PR 1: dataset duplicate fix only in DatasetSearchQueryBuilder.scala
  • PR 2: workflow duplicate fix only in WorkflowSearchQueryBuilder.scala

For the first PR, please use Refs #5957 instead of Fixes #5957, so the issue stays open until the workflow side is fixed too.

Also, let’s keep the changes focused on the duplicate-row issue and avoid unrelated changes such as ownerName / ownerUid fallback defaults unless is mandatory.

@Mrudhulraj

Copy link
Copy Markdown
Author

@carloea2 I have raised 2 fresh PRs #6016 and #6017 . Once accepted, I will close this PR. Is that fine?

@chenlica

Copy link
Copy Markdown
Contributor

@Mrudhulraj Thanks. I hope @carloea2 can take the lead to review these PRs. @carloea2 After that, feel free to add a committer to review and merge them.

@xuang7

xuang7 commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

Closing this PR since the changes have been split into #6016 and #6017.

@xuang7 xuang7 closed this Jun 29, 2026
@Mrudhulraj Mrudhulraj deleted the fix/dataset-workkflow-fix branch June 30, 2026 04:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Join Fan-out issue: Publicly shared dataset/workflows rows duplicated in the hub. (Has RCA and suggested fix)

4 participants