[SPARK-57035][DOCS] Always target /docs/latest/ in DocSearch index#56080
[SPARK-57035][DOCS] Always target /docs/latest/ in DocSearch index#56080gengliangwang wants to merge 1 commit into
Conversation
c198a2e to
4774d18
Compare
### What changes were proposed in this pull request? Switch DocSearch to a single shared index built from https://spark.apache.org/docs/latest/, used by all release branches. - `docs/_config.yml`: rewrite the stale comment that pointed at the dead `github.com/algolia/docsearch-configs` repo. Document the new setup: the Algolia crawler at https://crawler.algolia.com/ indexes only `/docs/latest/` and tags every page with `version:latest`, so `facetFilters` stays pinned to `version:latest` on every branch. - `dev/create-release/release-tag.sh`: remove the two `sed` lines that rewrote `facetFilters` to `version:<release>` at release-cut and post-release-bump time. They are no longer needed (and stayed wrong on the last few releases, which is why the search box on https://spark.apache.org/docs/latest/ has been returning no results). ### Why are the changes needed? The legacy DocSearch v1 scheme crawled every released `/docs/<X.Y.Z>/` and assigned a `version:X.Y.Z` facet, so each release branch had to pin `facetFilters` to its own version. Since the SPARK-38122 migration to the new DocSearch infra, we no longer maintain per-version indexes. The release-script `sed` rewrites kept producing `version:<release>` filters that don't match anything in the new index, so post-release search on https://spark.apache.org/docs/latest/ returns empty results until the crawler config is manually re-pointed. Pinning to `version:latest` everywhere matches what the crawler tags and removes the manual release-time step entirely. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? N/A - documentation config and release-script change only. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude Code (claude-opus-4-7)
4774d18 to
4e40ad0
Compare
viirya
left a comment
There was a problem hiding this comment.
Thanks for picking this up, @gengliangwang. I went to verify the premise in the description and I think this PR needs a rework before it can land — the assumption that the Algolia index is latest-only doesn't match what the live index actually contains, and as written this change is a user-facing regression rather than a no-op. Details below.
The Algolia index still contains per-version facets. Querying the live apache_spark index with the public DocSearch key in _config.yml:
curl -s -X POST "https://rai69rxrsk-dsn.algolia.net/1/indexes/apache_spark/query" \
-H "X-Algolia-API-Key: d62f962a82bc9abb53471cb7b89da35e" \
-H "X-Algolia-Application-Id: RAI69RXRSK" \
-H "Content-Type: application/json" \
-d '{"query":"DataFrame","facets":["version"],"hitsPerPage":1}'
returns:
"facets": {
"version": {
"latest": 1075,
"4.1.2": 1075,
"4.1.1": 1075,
"4.1.0": 1075,
"4.0.0": 1066
}
}
and a follow-up query with facetFilters: ["version:4.1.2"] still returns 4.1.2-specific URLs (https://spark.apache.org/docs/4.1.2/sql-programming-guide.html#content, not /docs/latest/). So the crawler is indexing each released /docs/<X.Y.Z>/ and tagging it with the corresponding version facet — not "only /docs/latest/ tagged with version:latest" as the new comment claims.
Consistent with that, the currently published docs still ship per-version filters:
https://spark.apache.org/docs/latest/→'facetFilters': ["version:4.1.2"]https://spark.apache.org/docs/4.1.1/→'facetFilters': ["version:4.1.1"]
Why this matters for the PR:
-
The symptom in the description ("search on
/docs/latest/returns no results after a release") doesn't look like it's caused by the per-release sed inrelease-tag.sh. The index still has a populatedversion:latestfacet, andlatestcurrently points at 4.1.2 where search does work for me. It would be worth root-causing the original report (timing of thelatestsymlink flip vs. the next crawler run? a staleversion:latestsnapshot during a specific window?) before changing the script, otherwise we may be removing the wrong mechanism. -
After this PR, every new release branch and every released
/docs/<X.Y.Z>/HTML will hard-codefacetFilters: ["version:latest"]. That means searches performed from/docs/4.1.3/(or any future release) will start returning results from/docs/latest/— i.e. the search no longer stays on the user's release. That's the exact behavior SPARK-33479 set out to fix, and it contradicts the "Does this PR introduce any user-facing change? No" line in the description. -
The new comment ("indexes only
/docs/latest/", "tags every page withversion:latest", "no per-release update is required") will actively mislead the next release manager, because — going by the live index — version-specific search is still a supported and currently-working feature.
Suggested paths forward, depending on intent:
- If we want to keep release-specific search (status quo, matches what the index actually supports today): keep the two
sedlines inrelease-tag.sh, and only update the comment in_config.ymlto replace the deadalgolia/docsearch-configslink with a pointer to https://crawler.algolia.com/. The rest of this PR isn't needed. - If we genuinely want to drop release-specific search and always target
latest: this needs (a) a corresponding crawler-side config change so we stop spending crawler budget on/docs/<X.Y.Z>/, (b) an updated description that acknowledges the user-facing change ("search on/docs/<release>/will jump to/docs/latest/"), (c) a comment that says this is intentional rather than describing it as the crawler's behavior, and (d) likely cherry-picks to active release branches so theversion:X.Y.Zfilter there is also reset toversion:latestfor consistency.
Happy to help land either version once we agree on which direction is intended.
| # The DocSearch index is maintained by the Algolia crawler at https://crawler.algolia.com/. | ||
| # The crawler indexes only https://spark.apache.org/docs/latest/ and tags every page with | ||
| # `version:latest`. All release branches share this single index, so `facetFilters` stays | ||
| # pinned to `version:latest` everywhere and no per-release update is required. |
There was a problem hiding this comment.
This comment doesn't match the live index: querying apache_spark returns version facet values {latest, 4.1.0, 4.1.1, 4.1.2, 4.0.0}, and facetFilters: ["version:4.1.2"] still returns 4.1.2-specific URLs. The crawler isn't latest-only — release pages are still indexed and tagged with their version. As written this will mislead the next release manager into thinking version filters are unused; please either correct the description of the crawler's behavior, or — if the intent is to deliberately switch to latest-only — say so explicitly and link to the crawler-side change that makes it true.
There was a problem hiding this comment.
For the released pages, we will just keep using the indexes for those doc.
For new releases(4.1.3/4.2.0), we will start using the latest index only. Otherwise, release manager may forget to create new index and break the search function.
| # Set the release version in docs | ||
| sed -i".tmp1" 's/SPARK_VERSION:.*$/SPARK_VERSION: '"$RELEASE_VERSION"'/g' docs/_config.yml | ||
| sed -i".tmp2" 's/SPARK_VERSION_SHORT:.*$/SPARK_VERSION_SHORT: '"$RELEASE_VERSION"'/g' docs/_config.yml | ||
| sed -i".tmp3" "s/'facetFilters':.*$/'facetFilters': [\"version:$RELEASE_VERSION\"]/g" docs/_config.yml |
There was a problem hiding this comment.
Removing this rewrite means every future release branch (and the HTML shipped under /docs/<X.Y.Z>/) will ship facetFilters: ["version:latest"]. Combined with the live index still containing populated per-version facets, that turns release-page search into "jump to /docs/latest/" rather than staying on the user's release — a user-facing regression vs. the intent of SPARK-33479. If we do want this change, the PR description should reflect it; otherwise this sed (and the symmetric one below for R_NEXT_VERSION) should stay.
There was a problem hiding this comment.
updated PR description.
viirya
left a comment
There was a problem hiding this comment.
Thanks for the clarification and the description update, @gengliangwang — the two-phase intent ("existing released pages keep their per-version indexes; from 4.1.3 / 4.2.0 onward everything shares the latest index") is much clearer now. Before this lands, though, I'd like to push back on the direction rather than the wording, because I think pinning every future release to facetFilters: ["version:latest"] makes the on-release-page search quietly wrong in a way that's hard to recover from.
Why this worries me:
-
facetFilters: ["version:latest"]is not "no filter" — it's "only results taggedversion:latest". Whateverversion:latesthappens to point at when the user runs the query is what they get. After 4.2.0 ships and the/docs/latest/symlink flips, a user reading/docs/4.1.3/sql-programming-guide.htmland typing into its search box will get results pointing at 4.2.0 pages — silently, with no version-mismatch indicator. They click a result, land on a page where the API signature has changed, and don't realize they crossed a major boundary. "Search returns nothing" is loud and gets reported; "search returns the wrong version" is quiet and gets internalized as "the docs are bad". -
The frozen-HTML problem is one-way. Once
/docs/4.1.3/index.htmlships withfacetFilters: ["version:latest"]baked in, that HTML is immutable inspark-website. There is no recovery path if we later decide this was the wrong call — we can't go back and rewrite already-published release HTML to use a different filter. Every release we ship under this policy permanently inherits "search jumps to whatever latest is at query time". -
It moves the maintenance burden from a tracked place to an untracked place, rather than removing it. The motivation ("release manager may forget to create new index and break the search function") is a process problem. The fix being proposed isn't "make the process safer" — it's "remove the per-version contract so the process step is no longer needed". But the new model still requires the crawler to (a) keep populating
version:latestcorrectly forever, and (b) keep the existingversion:4.0.0/4.1.0/4.1.1/4.1.2facets alive for already-shipped HTML. That's at least as much ongoing crawler-side maintenance as before, except now it lives entirely outside the Spark repo, isn't reviewed, isn't version-controlled, and has no failure alarm visible to committers. Forgetting on the crawler side is just as easy as forgetting in the release script — and harder to notice. -
The original symptom hasn't actually been root-caused. The earlier version of the description attributed "search on /docs/latest/ returns no results after a release" to the per-release
sedrewrite. We established earlier in this thread that the live Algolia index does containversion:latestwith populated hits, and thatfacetFilters: ["version:latest"]does return results today — so that hypothesis doesn't fit. The updated description has wisely dropped the bug-fix framing, but that means we still don't know what actually broke search on/docs/latest/after the last release. Whatever the real cause is, this PR doesn't address it. I'd rather we diagnose the original report (crawler schedule vs.latestsymlink flip timing? a staleversion:latestsnapshot during a specific window? something on the crawler-config side?) before changing the contract for every future release.
Suggested alternatives I'd find easier to support:
-
(A) Minimal cleanup. Keep both
sedlines inrelease-tag.sh(preserve per-version search). Update only the comment in_config.ymlto replace the deadalgolia/docsearch-configslink with a pointer to https://crawler.algolia.com/. Open a separate JIRA to root-cause the post-release search outage on/docs/latest/— that's a real bug worth fixing, just not by this mechanism. -
(B) Drop per-version search explicitly. If after diagnosis we genuinely want to abandon version-scoped search, the cleaner expression of that intent is to remove the
facetFiltersline entirely (no filter → search across all indexed pages), with a comment that says so. That at least degrades gracefully ifversion:latestever stops being populated, and the "wrong version" failure mode becomes visible to the user (multi-version hits in the dropdown) rather than silent. It would still want a tracking link to the crawler-side change.
Happy to help land either direction. I just don't think shipping ["version:latest"] into every future release HTML is the right shape — it bakes a silent-failure mode into immutable artifacts, and it doesn't actually remove the maintenance burden it claims to remove.
What changes were proposed in this pull request?
dev/create-release/release-tag.shfrom rewriting'facetFilters'indocs/_config.ymlat release-cut and post-release-bump time. The line stays pinned to"version:latest"on every branch going forward.docs/_config.ymlto point at https://crawler.algolia.com/ instead of the legacy DocSearch v1 config repo.Why are the changes needed?
We are moving DocSearch to a single shared index built from https://spark.apache.org/docs/latest/, used by every release. With a shared index, all branches should pin
facetFiltersto"version:latest", so the per-release rewrite in the release script is no longer needed.The crawler-side change is being made separately on https://crawler.algolia.com/.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
N/A - documentation config and release-script change only.
Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Code (claude-opus-4-7)