Skip to content

Informatica provider: Add SQL auto-lineage and selective lineage control#66612

Open
cetingokhan wants to merge 3 commits intoapache:mainfrom
cetingokhan:informatica-provider-v0.2.0
Open

Informatica provider: Add SQL auto-lineage and selective lineage control#66612
cetingokhan wants to merge 3 commits intoapache:mainfrom
cetingokhan:informatica-provider-v0.2.0

Conversation

@cetingokhan
Copy link
Copy Markdown
Contributor

Add automatic SQL lineage detection and per-task/DAG lineage control to the Informatica provider (apache-airflow-providers-informatica).

Previously the provider only supported manual lineage through explicit inlets/outlets declarations.
This PR extends it with:

Automatic SQL Lineage

  • Adds a lineage/sql_parser.py module that uses sqlglot to parse SQL and extract source and target tables from SELECT, INSERT INTO, CREATE TABLE AS SELECT, and MERGE INTO statements.
  • Adds lineage/resolver.py which infers the SQL dialect from the task's connection ID string (e.g. postgres_conn_id → postgres, snowflake → snowflake) and resolves parsed table references against the Informatica EDC catalog. Supports 13 dialects: PostgreSQL, MySQL, Snowflake, BigQuery, Databricks, Redshift, SQLite, Oracle, Trino, Presto, Hive, Spark, and MSSQL (T-SQL).
  • When auto_lineage_enabled = True (the new default), the listener automatically detects SQL operators, parses their SQL, and creates lineage links — no inlets/outlets required on the task.

Fail-fast validation (two-phase listener)

  • Refactors InformaticaListener to a two-phase model:
    • on_task_instance_running — pre-validates and resolves all inlet/outlet URIs (manual) or parsed table references (auto) before the operator's execute() is called. Introduces InformaticaLineageResolutionError which immediately fails the task when any URI or table cannot be resolved in the Informatica catalog. Resolved EDC object IDs are cached in memory.
    • on_task_instance_success — creates lineage links using the cached IDs, avoiding a second round of EDC calls.
    • on_task_instance_failed — clears the cache to prevent stale state.
  • This prevents silent lineage gaps: tasks that reference catalog objects not present in EDC now fail clearly before execution rather than succeeding with missing lineage.

Selective lineage control

  • Adds lineage/selective.py with disable_informatica_lineage(task_or_dag) and enable_informatica_lineage(task_or_dag) helpers, exported from airflow.providers.informatica.lineage. These let users opt individual tasks or entire DAGs out of automatic lineage without touching inlets/outlets.
  • Adds disabled_for_operators config option to exclude entire operator classes (e.g. BashOperator) from lineage tracking via airflow.cfg.

New configuration options ([informatica] section in airflow.cfg):

  • auto_lineage_enabled (bool, default True) — enable/disable SQL auto-lineage globally.
  • disabled_for_operators (str, default "") — semicolon-separated FQCNs of operator classes to skip.
  • request_timeout (int, default 30) — timeout in seconds for EDC REST API calls.

Other changes

  • Adds example_dags/example_informatica_lineage.py demonstrating all four modes: auto-lineage, manual lineage, per-task disable, and operator-class exclusion.
  • Updates docs/guides/usage.rst with comprehensive documentation for all new features.
  • Adds is_operator_disabled() to conf.py for per-operator lookup.

Manual lineage still takes priority — if a task has any inlets or outlets defined, SQL parsing is skipped entirely.

closes: #ISSUE


Was generative AI tooling used to co-author this PR?
  • Yes — GitHub Copilot (Claude Sonnet 4.6, Claude Opus 4.6, GPT-5.3-Codex)

Generated-by: GitHub Copilot (Claude Sonnet 4.6) following the guidelines

@potiuk potiuk force-pushed the informatica-provider-v0.2.0 branch from f6677da to 144c81d Compare May 9, 2026 20:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant