feat: add preserve_classes/preserve_tags whitelist to PruningContentFilter (#1900) by hafezparast · Pull Request #1904 · unclecode/crawl4ai

hafezparast · 2026-04-07T01:00:34Z

Summary

Adds preserve_classes and preserve_tags params to PruningContentFilter so users can protect specific elements (author names, timestamps, attribution) from being pruned.

Addresses #1900

The Problem

md-fit uses density-based scoring to strip boilerplate. Short metadata elements (usernames, bylines, timestamps) score low and get removed alongside actual boilerplate — losing "who said what" on discussion pages.

The Fix

Two new optional params on PruningContentFilter:

PruningContentFilter(
    preserve_classes=["author", "byline", "comment-header"],
    preserve_tags=["time", "cite"],
)

Whitelisted nodes skip scoring entirely — always kept. Default is empty sets, so existing behavior is unchanged.

Usage

from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, DefaultMarkdownGenerator
from crawl4ai.content_filter_strategy import PruningContentFilter

config = CrawlerRunConfig(
    markdown_generator=DefaultMarkdownGenerator(
        content_filter=PruningContentFilter(
            preserve_classes=["author", "byline", "username"],
            preserve_tags=["time", "cite"],
        )
    )
)

async with AsyncWebCrawler() as crawler:
    result = await crawler.arun(url="https://gist.github.com/...", config=config)
    print(result.markdown.fit_markdown)  # Now includes author names

Test plan

20 unit tests passing
Existing pipeline tests pass (16/16)
Default behavior unchanged (empty whitelist)
Nav/footer still removed (excluded_tags runs before pruning)
Nonexistent classes/tags in whitelist are harmless

🤖 Generated with Claude Code

…lecode#1900) PruningContentFilter's density-based scoring strips short metadata elements (author names, timestamps, attribution) alongside actual boilerplate. Add opt-in whitelist params so users can protect specific CSS classes or HTML tags from pruning. - preserve_classes: set of CSS class names to always keep - preserve_tags: set of HTML tag names to always keep - Whitelisted nodes skip scoring entirely (score = always keep) - Default: empty sets — no behavior change for existing users Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

hafezparast mentioned this pull request Apr 7, 2026

[Bug]: md-fit strips meaningful content metadata (usernames, attribution) not just boilerplate #1900

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add preserve_classes/preserve_tags whitelist to PruningContentFilter (#1900)#1904

feat: add preserve_classes/preserve_tags whitelist to PruningContentFilter (#1900)#1904
hafezparast wants to merge 1 commit intounclecode:developfrom
hafezparast:fix/maysam-pruning-preserve-whitelist-1900

hafezparast commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

hafezparast commented Apr 7, 2026

Summary

The Problem

The Fix

Usage

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant