fix: don't flag content-rich pages that reference reCAPTCHA as bot-blocked#1690
Open
ABHA61 wants to merge 3 commits into
Open
fix: don't flag content-rich pages that reference reCAPTCHA as bot-blocked#1690ABHA61 wants to merge 3 commits into
ABHA61 wants to merge 3 commits into
Conversation
…ocked detectBotBlocker treated any 200 page whose HTML matched the generic CHALLENGE_PATTERNS (including the bare /captcha/i and /recaptcha/i) as a bot challenge. The "protected by reCAPTCHA" disclosure badge and reCAPTCHA form widgets appear on normal content pages, so real customers (westjet.com, mazdausa.com, repsol.com/.es/.pt) were reported as "blocking Adobe LLM Optimizer" (crawlable:false, confidence 0.7) even though the content scraper accessed them fine (scrapeForbidden:false, thousands of URLs scraped with zero forbidden). Gate the generic-pattern check on an interstitial-shape heuristic: a 200 body is only treated as a challenge when it is content-thin. Content-rich pages that merely reference a captcha now return crawlable:true, while thin challenge interstitials (reCAPTCHA, Press-and-Hold, GeeTest, Arkose, etc.) are still detected. CDN-typed blocks (Cloudflare/Imperva/Akamai/ Fastly/CloudFront) and 403/error-path detection are unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Address review feedback on the bot-blocker false-positive fix: - Count characters, not space-delimited words, so the content-thin check is not biased against languages without word spaces (CJK/Thai/etc.). A content-rich Japanese/Korean/Chinese homepage with a reCAPTCHA badge was still misclassified as a thin challenge interstitial under word count. - Evaluate the challenge pattern before the (more expensive) content-shape strip, so the strip only runs when a generic pattern actually matched. - Add regression tests: a thin captcha interstitial still blocks, and a content-rich CJK page that references reCAPTCHA does not. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
This PR will trigger a patch release when merged. |
- Collapse the duplicated inline rationale in the generic 200 branch; the isLikelyInterstitial helper and its JSDoc already explain the intent. - Drop specific customer names from the test comment (keep it generic). No behavior change. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
detectBotBlocker(powering the LLMO "Unlock Opportunities" / bot-blocker card viaGET /sites/{siteId}/bot-blocker) reported real customer sites as "blocking Adobe LLM Optimizer from accessing public pages" when they were not.Root cause: on a
200 OK, the generic branch flagged the page as a challenge if its HTML matched any ofCHALLENGE_PATTERNS.general— including the bare/captcha/iand/recaptcha/i. Those match the ubiquitous "This site is protected by reCAPTCHA" disclosure badge / reCAPTCHA form widget, which appears on perfectly normal content pages. Result:{ crawlable: false, type: 'unknown', confidence: 0.7, reason: 'Generic challenge patterns detected' }.Evidence it's a false positive
Confirmed on the affected customer sites by running the shipped function against the live pages and cross-checking the content scraper's own results:
crawlable:false @0.7("Generic challenge patterns detected").recaptcha-texttemplate / "protected by reCAPTCHA" notice) on 100KB+ content homepages.scrapeForbidden:false, thousands of URLs scraped with zero forbidden.So the heuristic disagreed with the empirical scrape: nothing was actually being blocked.
Fix
Gate the generic-pattern check on an interstitial-shape heuristic (
isLikelyInterstitial): on a200, a challenge pattern only counts as a block when the page's visible text is content-thin. Visible text is measured by character length (not word count), so the check is not biased against languages that do not delimit words with spaces (CJK, Thai, etc.). The challenge pattern is evaluated first, so the content-shape strip only runs when a pattern actually matched.crawlable:truecrawlable:false403/error-path detection → unchangedValidation
detectBotBlockeragainst 43 onboarded domains, current vs. fixed: it flipped only the known false positives tocrawlable:trueand changed nothing else — high-confidenceakamai @0.99blocks and non-captcha0.7blocks were untouched (no over-correction, noOK→BLOCK).crawlable:true; a thin captcha interstitial → stillcrawlable:false. All existing challenge tests (thin HTML) still pass.npm test(lint + tests + c8 coverage gate, 100%) passes;bot-blocker-detect.jsat 100%.Notes & limitations
200branch is gated. The CDN-specific challenge patterns (Just a moment…,Access Denied…Akamai, etc.) are precise/low-FP and are intentionally left ungated.GET+ regex. A JS-rendered SPA whose raw homepage shell is content-thin and references reCAPTCHA could still be flagged.🤖 Generated with Claude Code