
Trino
Fast distributed SQL query engine for big data analytics.

The industry standard for real-time metadata collection and cross-platform data lineage.

OpenLineage is an open-source standard for the collection of lineage metadata from data pipelines. It defines a formal specification for capturing information about data producers, the transformations they perform, and the datasets they consume or produce. As of 2026, OpenLineage has established itself as the core protocol within the Modern Data Stack, enabling interoperability between data orchestrators like Apache Airflow, processing engines like Apache Spark, and cataloging tools. Its technical architecture centers on a JSON-based schema that utilizes 'Facets'—extensible metadata units that describe specific aspects of a run, job, or dataset. By decoupling the collection of lineage from its consumption, OpenLineage allows organizations to build a comprehensive map of their data ecosystem without vendor lock-in. In the 2026 market, it is the fundamental infrastructure for AI/ML pipeline audits, providing the 'paper trail' necessary for regulatory compliance and AI model reproducibility. Its event-based model supports asynchronous metadata emission, ensuring that lineage tracking does not become a bottleneck for high-throughput data processing environments.
OpenLineage is an open-source standard for the collection of lineage metadata from data pipelines.
Explore all tools that specialize in metadata collection. This domain focus ensures OpenLineage delivers optimized results for this specific requirement.
Extensible atomic metadata units that allow for the addition of custom metadata without breaking the core schema.
Global unique identifier (UUID) that tracks a single execution across multiple disparate systems (e.g., Spark job triggered by Airflow).
Ability to track specific data fields from source to destination using the SqlParser facet.
Captures the state of a dataset at the time of a job run to ensure reproducibility.
Parses dbt manifest and catalog files to generate lineage without requiring code execution.
Integration with tools like Great Expectations to include data quality metrics directly in lineage events.
Uses non-blocking HTTP or Kafka transport mechanisms to send metadata.
Identify your data producer (e.g., Apache Spark, Airflow, or dbt).
Choose a backend for metadata storage (e.g., Marquez, Amundsen, or Atlan).
Install the OpenLineage integration package for your specific environment via pip or Maven.
Configure the OPENLINEAGE_URL environment variable to point to your metadata consumer.
Set up the OPENLINEAGE_NAMESPACE to logically group your data jobs.
Define custom Facets if you need to capture domain-specific metadata not in the core spec.
Initialize the OpenLineage listener within your application code or orchestrator config.
Execute a test run of your pipeline to emit initial lineage events.
Verify JSON event structures using the OpenLineage proxy or validator.
Visualize the resulting lineage graph in your chosen backend UI.
All Set
Ready to go
Verified feedback from other users.
"Users praise the tool for its vendor neutrality and the depth of its Airflow integration, though some note the complexity of setting up custom backends."
Post questions, share tips, and help other users.

Fast distributed SQL query engine for big data analytics.

Unlocking insights from unstructured data.

A visual data science platform combining visual analytics, data science, and data wrangling.

Open Source OCR Engine capable of recognizing over 100 languages.

Liberating data tables locked inside PDF files.

Move your data easily, securely, and efficiently with Stitch, now part of Qlik Talend Cloud.

Open Source High-Performance Data Warehouse delivering Sub-Second Analytics for End Users and Agents at Scale.