fix: don't flag content-rich pages that reference reCAPTCHA as bot-blocked by ABHA61 · Pull Request #1690 · adobe/spacecat-shared

ABHA61 · 2026-06-18T10:01:29Z

Problem

detectBotBlocker (powering the LLMO "Unlock Opportunities" / bot-blocker card via GET /sites/{siteId}/bot-blocker) reported real customer sites as "blocking Adobe LLM Optimizer from accessing public pages" when they were not.

Root cause: on a 200 OK, the generic branch flagged the page as a challenge if its HTML matched any of CHALLENGE_PATTERNS.general — including the bare /captcha/i and /recaptcha/i. Those match the ubiquitous "This site is protected by reCAPTCHA" disclosure badge / reCAPTCHA form widget, which appears on perfectly normal content pages. Result: { crawlable: false, type: 'unknown', confidence: 0.7, reason: 'Generic challenge patterns detected' }.

Evidence it's a false positive

Confirmed on the affected customer sites by running the shipped function against the live pages and cross-checking the content scraper's own results:

The live bot-blocker check returned crawlable:false @0.7 ("Generic challenge patterns detected").
The only matched token was the benign badge (e.g. a recaptcha-text template / "protected by reCAPTCHA" notice) on 100KB+ content homepages.
The Puppeteer content scraper (which executes JS and actually fetches the pages) accessed the same sites fine — scrapeForbidden:false, thousands of URLs scraped with zero forbidden.

So the heuristic disagreed with the empirical scrape: nothing was actually being blocked.

Fix

Gate the generic-pattern check on an interstitial-shape heuristic (isLikelyInterstitial): on a 200, a challenge pattern only counts as a block when the page's visible text is content-thin. Visible text is measured by character length (not word count), so the check is not biased against languages that do not delimit words with spaces (CJK, Thai, etc.). The challenge pattern is evaluated first, so the content-shape strip only runs when a pattern actually matched.

✅ Content-rich page + reCAPTCHA badge → crawlable:true
✅ Thin challenge wall (reCAPTCHA / Press-and-Hold / GeeTest / Arkose / etc.) → still crawlable:false
✅ CDN-typed blocks (Cloudflare/Imperva/Akamai/Fastly/CloudFront) and 403/error-path detection → unchanged

Validation

Ran the real detectBotBlocker against 43 onboarded domains, current vs. fixed: it flipped only the known false positives to crawlable:true and changed nothing else — high-confidence akamai @0.99 blocks and non-captcha 0.7 blocks were untouched (no over-correction, no OK→BLOCK).
Regression tests added: a content-rich page (including a non-Latin / CJK page) that references reCAPTCHA → crawlable:true; a thin captcha interstitial → still crawlable:false. All existing challenge tests (thin HTML) still pass.
npm test (lint + tests + c8 coverage gate, 100%) passes; bot-blocker-detect.js at 100%.

Notes & limitations

Scope: only the generic 200 branch is gated. The CDN-specific challenge patterns (Just a moment…, Access Denied…Akamai, etc.) are precise/low-FP and are intentionally left ungated.
Known limit: this is a non-JS GET + regex. A JS-rendered SPA whose raw homepage shell is content-thin and references reCAPTCHA could still be flagged.

🤖 Generated with Claude Code

…ocked detectBotBlocker treated any 200 page whose HTML matched the generic CHALLENGE_PATTERNS (including the bare /captcha/i and /recaptcha/i) as a bot challenge. The "protected by reCAPTCHA" disclosure badge and reCAPTCHA form widgets appear on normal content pages, so real customers (westjet.com, mazdausa.com, repsol.com/.es/.pt) were reported as "blocking Adobe LLM Optimizer" (crawlable:false, confidence 0.7) even though the content scraper accessed them fine (scrapeForbidden:false, thousands of URLs scraped with zero forbidden). Gate the generic-pattern check on an interstitial-shape heuristic: a 200 body is only treated as a challenge when it is content-thin. Content-rich pages that merely reference a captcha now return crawlable:true, while thin challenge interstitials (reCAPTCHA, Press-and-Hold, GeeTest, Arkose, etc.) are still detected. CDN-typed blocks (Cloudflare/Imperva/Akamai/ Fastly/CloudFront) and 403/error-path detection are unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Address review feedback on the bot-blocker false-positive fix: - Count characters, not space-delimited words, so the content-thin check is not biased against languages without word spaces (CJK/Thai/etc.). A content-rich Japanese/Korean/Chinese homepage with a reCAPTCHA badge was still misclassified as a thin challenge interstitial under word count. - Evaluate the challenge pattern before the (more expensive) content-shape strip, so the strip only runs when a generic pattern actually matched. - Add regression tests: a thin captcha interstitial still blocks, and a content-rich CJK page that references reCAPTCHA does not. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

github-actions · 2026-06-18T11:33:11Z

This PR will trigger a patch release when merged.

- Collapse the duplicated inline rationale in the generic 200 branch; the isLikelyInterstitial helper and its JSDoc already explain the intent. - Drop specific customer names from the test comment (keep it generic). No behavior change. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

ABHA61 and others added 2 commits June 18, 2026 15:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: don't flag content-rich pages that reference reCAPTCHA as bot-blocked#1690

fix: don't flag content-rich pages that reference reCAPTCHA as bot-blocked#1690
ABHA61 wants to merge 3 commits into
mainfrom
fix/bot-blocker-recaptcha-false-positive

ABHA61 commented Jun 18, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ABHA61 commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Evidence it's a false positive

Fix

Validation

Notes & limitations

Uh oh!

github-actions Bot commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ABHA61 commented Jun 18, 2026 •

edited

Loading