LLM HTML Extractor Symfony Bundle

A powerful Symfony bundle for extracting structured data from HTML using LLM (Large Language Model) providers with a plugin architecture.

Features

LLM-Based Extraction: Uses LLM providers (starting with Jina Reader) to extract structured data from HTML
Type-Safe DTOs: Define extraction schemas using PHP attributes on your DTOs
Hybrid Extraction: Easily combine LLM extraction with code-based extraction - use AI for complex fields and DomCrawler/XPath for simple structured data
Extensible: Plugin architecture allows custom extractors for specific use cases
Cacheable: Built-in caching support for LLM responses
Logging: Optional logging for LLM requests/responses and cache operations
Configurable: Flexible configuration for different LLM providers and caching strategies

Installation

composer require llm-html-extractor/symfony-bundle

Configuration

Create or update config/packages/llm_html_extractor.yaml:

llm_html_extractor:
    llm_client:
        client: jina_reader  # or any service ID for custom client
        jina_reader:
            model: 'jinaai/readerlm-v2'  # or 'jinaai/readerlm-v1.5'
            default_temperature: 0.0  # Default temperature for LLM requests (0.0 = deterministic)
            default_max_tokens: 64000  # Default max tokens for LLM responses
            http_client:
                base_uri: 'https://r.jina.ai'
                api_key: '%env(JINA_API_KEY)%'
                timeout: 600
                headers:
                    X-Custom-Header: 'value'
    cache:
        enabled: true
        ttl: 36000  # 10 hours
        pool: 'cache.app'
    logs:
        enabled: true  # default: false
        logger: 'logger'  # service ID of the logger to use

Alternatively, you can use an existing HTTP client service:

llm_html_extractor:
    llm_client:
        client: jina_reader
        jina_reader:
            model: 'jinaai/readerlm-v2'
            http_client: 'my_custom_http_client_service'  # Service ID implementing HttpClientInterface

Using a Custom LLM Client

To use your own LLM client implementation, just set the client parameter to your service ID:

llm_html_extractor:
    llm_client:
        client: 'app.my_custom_llm_client'  # Your service ID
    cache:
        enabled: true  # Will automatically wrap your client with caching

Your custom client must implement LlmHtmlExtractor\SymfonyBundle\Client\LlmClientInterface. The bundle will validate this during container compilation and throw a clear error if the interface is not implemented.

Logging

The bundle provides comprehensive logging for debugging and monitoring:

Request/Response Logging: When logs.enabled: true, all LLM requests and responses are logged at info level
Cache Operations: Cache hits and misses are logged when both caching and logging are enabled
Error Logging: Failed LLM requests are logged at error level with exception details

The decorators are applied in this order:

Base LLM Client (e.g., JinaReaderLlmClient)
LoggingLlmClient (if logs enabled) - logs requests/responses
CacheableLlmClient (if cache enabled) - logs cache hits/misses

This means logged requests show the actual LLM calls (cache misses), not cached responses.

Usage

1. Define Your Extraction DTO

use LlmHtmlExtractor\SymfonyBundle\Attribute\AsLlmExtractableProperty;

class ArticleExtractionResult
{
    public function __construct(
        #[AsLlmExtractableProperty('Extract the article title')]
        public string $title,

        #[AsLlmExtractableProperty('Extract the author name')]
        public string $author,

        #[AsLlmExtractableProperty('Extract publication date in YYYY-MM-DD format')]
        public string $publishedAt,

        #[AsLlmExtractableProperty('Extract the main article content')]
        public string $content,
    ) {}
}

2. Use the Extraction Handler

use LlmHtmlExtractor\SymfonyBundle\Extractor\ExtractionHandler;

class ArticleScraper
{
    public function __construct(
        private ExtractionHandler $extractionHandler,
    ) {}

    public function scrape(string $html): ArticleExtractionResult
    {
        return $this->extractionHandler->handle(
            ArticleExtractionResult::class,
            $html
        );
    }
}

3. Create Custom Extractors (Optional)

For specific extraction needs, implement the FromHtmlExtractorInterface:

use LlmHtmlExtractor\SymfonyBundle\Extractor\FromHtmlExtractorInterface;
use Symfony\Component\DomCrawler\Crawler;
use Symfony\Component\DependencyInjection\Attribute\AutoconfigureTag;

#[AutoconfigureTag('llm_extractor.extractor', ['priority' => 50])]
class CustomPdfUrlExtractor implements FromHtmlExtractorInterface
{
    public function extract(string $html, array $context = []): mixed
    {
        $crawler = new Crawler($html);
        return $crawler->filterXPath('//a[contains(@href, ".pdf")]')
            ->each(fn($node) => $node->attr('href'));
    }

    public function supports(string $className, string $propertyName): bool
    {
        return $className === ArticleExtractionResult::class
            && $propertyName === 'pdfUrls';
    }
}

Supported LLM Providers

Currently supported:

Jina Reader (jinaai/readerlm-v2, jinaai/readerlm-v1.5)
- Uses vLLM OpenAI API standard endpoint (/openai/v1/chat/completions)
- Tested with Runpod serverless deployments
- Compatible with any vLLM deployment following the OpenAI API standard

License

MIT

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
src		src
tests		tests
.gitignore		.gitignore
.php-cs-fixer.dist.php		.php-cs-fixer.dist.php
README.md		README.md
composer.json		composer.json
phpstan.neon		phpstan.neon
phpunit.xml.dist		phpunit.xml.dist

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

LLM HTML Extractor Symfony Bundle

Features

Installation

Configuration

Using a Custom LLM Client

Logging

Usage

1. Define Your Extraction DTO

2. Use the Extraction Handler

3. Create Custom Extractors (Optional)

Supported LLM Providers

License

Contributing

About

Uh oh!

Releases 1

Packages

Languages

michalboryczko/llm-html-extractor-symfony-bundle

Folders and files

Latest commit

History

Repository files navigation

LLM HTML Extractor Symfony Bundle

Features

Installation

Configuration

Using a Custom LLM Client

Logging

Usage

1. Define Your Extraction DTO

2. Use the Extraction Handler

3. Create Custom Extractors (Optional)

Supported LLM Providers

License

Contributing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages