Skip to content

fix: don't flag content-rich pages that reference reCAPTCHA as bot-blocked#1690

Open
ABHA61 wants to merge 3 commits into
mainfrom
fix/bot-blocker-recaptcha-false-positive
Open

fix: don't flag content-rich pages that reference reCAPTCHA as bot-blocked#1690
ABHA61 wants to merge 3 commits into
mainfrom
fix/bot-blocker-recaptcha-false-positive

Conversation

@ABHA61

@ABHA61 ABHA61 commented Jun 18, 2026

Copy link
Copy Markdown

Problem

detectBotBlocker (powering the LLMO "Unlock Opportunities" / bot-blocker card via GET /sites/{siteId}/bot-blocker) reported real customer sites as "blocking Adobe LLM Optimizer from accessing public pages" when they were not.

Root cause: on a 200 OK, the generic branch flagged the page as a challenge if its HTML matched any of CHALLENGE_PATTERNS.general — including the bare /captcha/i and /recaptcha/i. Those match the ubiquitous "This site is protected by reCAPTCHA" disclosure badge / reCAPTCHA form widget, which appears on perfectly normal content pages. Result: { crawlable: false, type: 'unknown', confidence: 0.7, reason: 'Generic challenge patterns detected' }.

Evidence it's a false positive

Confirmed on the affected customer sites by running the shipped function against the live pages and cross-checking the content scraper's own results:

  • The live bot-blocker check returned crawlable:false @0.7 ("Generic challenge patterns detected").
  • The only matched token was the benign badge (e.g. a recaptcha-text template / "protected by reCAPTCHA" notice) on 100KB+ content homepages.
  • The Puppeteer content scraper (which executes JS and actually fetches the pages) accessed the same sites fine — scrapeForbidden:false, thousands of URLs scraped with zero forbidden.

So the heuristic disagreed with the empirical scrape: nothing was actually being blocked.

Fix

Gate the generic-pattern check on an interstitial-shape heuristic (isLikelyInterstitial): on a 200, a challenge pattern only counts as a block when the page's visible text is content-thin. Visible text is measured by character length (not word count), so the check is not biased against languages that do not delimit words with spaces (CJK, Thai, etc.). The challenge pattern is evaluated first, so the content-shape strip only runs when a pattern actually matched.

  • ✅ Content-rich page + reCAPTCHA badge → crawlable:true
  • ✅ Thin challenge wall (reCAPTCHA / Press-and-Hold / GeeTest / Arkose / etc.) → still crawlable:false
  • ✅ CDN-typed blocks (Cloudflare/Imperva/Akamai/Fastly/CloudFront) and 403/error-path detection → unchanged

Validation

  • Ran the real detectBotBlocker against 43 onboarded domains, current vs. fixed: it flipped only the known false positives to crawlable:true and changed nothing else — high-confidence akamai @0.99 blocks and non-captcha 0.7 blocks were untouched (no over-correction, no OK→BLOCK).
  • Regression tests added: a content-rich page (including a non-Latin / CJK page) that references reCAPTCHA → crawlable:true; a thin captcha interstitial → still crawlable:false. All existing challenge tests (thin HTML) still pass.
  • npm test (lint + tests + c8 coverage gate, 100%) passes; bot-blocker-detect.js at 100%.

Notes & limitations

  • Scope: only the generic 200 branch is gated. The CDN-specific challenge patterns (Just a moment…, Access Denied…Akamai, etc.) are precise/low-FP and are intentionally left ungated.
  • Known limit: this is a non-JS GET + regex. A JS-rendered SPA whose raw homepage shell is content-thin and references reCAPTCHA could still be flagged.

🤖 Generated with Claude Code

ABHA61 and others added 2 commits June 18, 2026 15:28
…ocked

detectBotBlocker treated any 200 page whose HTML matched the generic
CHALLENGE_PATTERNS (including the bare /captcha/i and /recaptcha/i) as a
bot challenge. The "protected by reCAPTCHA" disclosure badge and reCAPTCHA
form widgets appear on normal content pages, so real customers
(westjet.com, mazdausa.com, repsol.com/.es/.pt) were reported as
"blocking Adobe LLM Optimizer" (crawlable:false, confidence 0.7) even
though the content scraper accessed them fine (scrapeForbidden:false,
thousands of URLs scraped with zero forbidden).

Gate the generic-pattern check on an interstitial-shape heuristic: a 200
body is only treated as a challenge when it is content-thin. Content-rich
pages that merely reference a captcha now return crawlable:true, while
thin challenge interstitials (reCAPTCHA, Press-and-Hold, GeeTest, Arkose,
etc.) are still detected. CDN-typed blocks (Cloudflare/Imperva/Akamai/
Fastly/CloudFront) and 403/error-path detection are unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Address review feedback on the bot-blocker false-positive fix:
- Count characters, not space-delimited words, so the content-thin check
  is not biased against languages without word spaces (CJK/Thai/etc.). A
  content-rich Japanese/Korean/Chinese homepage with a reCAPTCHA badge was
  still misclassified as a thin challenge interstitial under word count.
- Evaluate the challenge pattern before the (more expensive) content-shape
  strip, so the strip only runs when a generic pattern actually matched.
- Add regression tests: a thin captcha interstitial still blocks, and a
  content-rich CJK page that references reCAPTCHA does not.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@github-actions

Copy link
Copy Markdown

This PR will trigger a patch release when merged.

- Collapse the duplicated inline rationale in the generic 200 branch; the
  isLikelyInterstitial helper and its JSDoc already explain the intent.
- Drop specific customer names from the test comment (keep it generic).

No behavior change.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant