A powerful Symfony bundle for extracting structured data from HTML using LLM (Large Language Model) providers with a plugin architecture.
- LLM-Based Extraction: Uses LLM providers (starting with Jina Reader) to extract structured data from HTML
- Type-Safe DTOs: Define extraction schemas using PHP attributes on your DTOs
- Hybrid Extraction: Easily combine LLM extraction with code-based extraction - use AI for complex fields and DomCrawler/XPath for simple structured data
- Extensible: Plugin architecture allows custom extractors for specific use cases
- Cacheable: Built-in caching support for LLM responses
- Logging: Optional logging for LLM requests/responses and cache operations
- Configurable: Flexible configuration for different LLM providers and caching strategies
composer require llm-html-extractor/symfony-bundleCreate or update config/packages/llm_html_extractor.yaml:
llm_html_extractor:
llm_client:
client: jina_reader # or any service ID for custom client
jina_reader:
model: 'jinaai/readerlm-v2' # or 'jinaai/readerlm-v1.5'
default_temperature: 0.0 # Default temperature for LLM requests (0.0 = deterministic)
default_max_tokens: 64000 # Default max tokens for LLM responses
http_client:
base_uri: 'https://r.jina.ai'
api_key: '%env(JINA_API_KEY)%'
timeout: 600
headers:
X-Custom-Header: 'value'
cache:
enabled: true
ttl: 36000 # 10 hours
pool: 'cache.app'
logs:
enabled: true # default: false
logger: 'logger' # service ID of the logger to useAlternatively, you can use an existing HTTP client service:
llm_html_extractor:
llm_client:
client: jina_reader
jina_reader:
model: 'jinaai/readerlm-v2'
http_client: 'my_custom_http_client_service' # Service ID implementing HttpClientInterfaceTo use your own LLM client implementation, just set the client parameter to your service ID:
llm_html_extractor:
llm_client:
client: 'app.my_custom_llm_client' # Your service ID
cache:
enabled: true # Will automatically wrap your client with cachingYour custom client must implement LlmHtmlExtractor\SymfonyBundle\Client\LlmClientInterface. The bundle will validate this during container compilation and throw a clear error if the interface is not implemented.
The bundle provides comprehensive logging for debugging and monitoring:
- Request/Response Logging: When
logs.enabled: true, all LLM requests and responses are logged at info level - Cache Operations: Cache hits and misses are logged when both caching and logging are enabled
- Error Logging: Failed LLM requests are logged at error level with exception details
The decorators are applied in this order:
- Base LLM Client (e.g., JinaReaderLlmClient)
- LoggingLlmClient (if logs enabled) - logs requests/responses
- CacheableLlmClient (if cache enabled) - logs cache hits/misses
This means logged requests show the actual LLM calls (cache misses), not cached responses.
use LlmHtmlExtractor\SymfonyBundle\Attribute\AsLlmExtractableProperty;
class ArticleExtractionResult
{
public function __construct(
#[AsLlmExtractableProperty('Extract the article title')]
public string $title,
#[AsLlmExtractableProperty('Extract the author name')]
public string $author,
#[AsLlmExtractableProperty('Extract publication date in YYYY-MM-DD format')]
public string $publishedAt,
#[AsLlmExtractableProperty('Extract the main article content')]
public string $content,
) {}
}use LlmHtmlExtractor\SymfonyBundle\Extractor\ExtractionHandler;
class ArticleScraper
{
public function __construct(
private ExtractionHandler $extractionHandler,
) {}
public function scrape(string $html): ArticleExtractionResult
{
return $this->extractionHandler->handle(
ArticleExtractionResult::class,
$html
);
}
}For specific extraction needs, implement the FromHtmlExtractorInterface:
use LlmHtmlExtractor\SymfonyBundle\Extractor\FromHtmlExtractorInterface;
use Symfony\Component\DomCrawler\Crawler;
use Symfony\Component\DependencyInjection\Attribute\AutoconfigureTag;
#[AutoconfigureTag('llm_extractor.extractor', ['priority' => 50])]
class CustomPdfUrlExtractor implements FromHtmlExtractorInterface
{
public function extract(string $html, array $context = []): mixed
{
$crawler = new Crawler($html);
return $crawler->filterXPath('//a[contains(@href, ".pdf")]')
->each(fn($node) => $node->attr('href'));
}
public function supports(string $className, string $propertyName): bool
{
return $className === ArticleExtractionResult::class
&& $propertyName === 'pdfUrls';
}
}Currently supported:
- Jina Reader (jinaai/readerlm-v2, jinaai/readerlm-v1.5)
- Uses vLLM OpenAI API standard endpoint (
/openai/v1/chat/completions) - Tested with Runpod serverless deployments
- Compatible with any vLLM deployment following the OpenAI API standard
- Uses vLLM OpenAI API standard endpoint (
MIT
Contributions are welcome! Please feel free to submit a Pull Request.