Skip to content

feat: add preserve_classes/preserve_tags whitelist to PruningContentFilter (#1900)#1904

Open
hafezparast wants to merge 1 commit intounclecode:developfrom
hafezparast:fix/maysam-pruning-preserve-whitelist-1900
Open

feat: add preserve_classes/preserve_tags whitelist to PruningContentFilter (#1900)#1904
hafezparast wants to merge 1 commit intounclecode:developfrom
hafezparast:fix/maysam-pruning-preserve-whitelist-1900

Conversation

@hafezparast
Copy link
Copy Markdown
Contributor

Summary

Adds preserve_classes and preserve_tags params to PruningContentFilter so users can protect specific elements (author names, timestamps, attribution) from being pruned.

Addresses #1900

The Problem

md-fit uses density-based scoring to strip boilerplate. Short metadata elements (usernames, bylines, timestamps) score low and get removed alongside actual boilerplate — losing "who said what" on discussion pages.

The Fix

Two new optional params on PruningContentFilter:

PruningContentFilter(
    preserve_classes=["author", "byline", "comment-header"],
    preserve_tags=["time", "cite"],
)

Whitelisted nodes skip scoring entirely — always kept. Default is empty sets, so existing behavior is unchanged.

Usage

from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, DefaultMarkdownGenerator
from crawl4ai.content_filter_strategy import PruningContentFilter

config = CrawlerRunConfig(
    markdown_generator=DefaultMarkdownGenerator(
        content_filter=PruningContentFilter(
            preserve_classes=["author", "byline", "username"],
            preserve_tags=["time", "cite"],
        )
    )
)

async with AsyncWebCrawler() as crawler:
    result = await crawler.arun(url="https://gist.github.com/...", config=config)
    print(result.markdown.fit_markdown)  # Now includes author names

Test plan

  • 20 unit tests passing
  • Existing pipeline tests pass (16/16)
  • Default behavior unchanged (empty whitelist)
  • Nav/footer still removed (excluded_tags runs before pruning)
  • Nonexistent classes/tags in whitelist are harmless

🤖 Generated with Claude Code

…lecode#1900)

PruningContentFilter's density-based scoring strips short metadata
elements (author names, timestamps, attribution) alongside actual
boilerplate. Add opt-in whitelist params so users can protect specific
CSS classes or HTML tags from pruning.

- preserve_classes: set of CSS class names to always keep
- preserve_tags: set of HTML tag names to always keep
- Whitelisted nodes skip scoring entirely (score = always keep)
- Default: empty sets — no behavior change for existing users

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant