
Trino
Fast distributed SQL query engine for big data analytics.

The open-source standard for data lineage, metadata collection, and job observability.

Marquez is a highly scalable metadata server and visualization platform designed to aggregate, store, and visualize metadata about data production and consumption. Built as the reference implementation for the OpenLineage standard, Marquez provides a robust technical architecture for maintaining a complete history of dataset evolution and job execution. Its core architecture utilizes a relational backend (PostgreSQL) and exposes a comprehensive RESTful API for metadata ingestion and retrieval. By 2026, Marquez has solidified its position as the foundational layer for decentralized data mesh architectures, enabling data engineers to automate impact analysis and root cause identification across polyglot data stacks. It tracks job runs, versioning of both code and data schemas, and the physical location of datasets. Its design philosophy centers on late-binding metadata, allowing it to integrate seamlessly with various orchestrators like Apache Airflow and execution engines like Spark. As an LF AI & Data project, it benefits from a neutral governance model, ensuring its longevity and interoperability in the evolving AI and Data lifecycle management market.
Marquez is a highly scalable metadata server and visualization platform designed to aggregate, store, and visualize metadata about data production and consumption.
Explore all tools that specialize in metadata collection. This domain focus ensures Marquez delivers optimized results for this specific requirement.
Native support for the OpenLineage spec, ensuring consistent metadata collection across Spark, Airflow, and Flink.
Tracks both when a change happened in the source system and when it was recorded in Marquez.
Detects and records changes in dataset schemas across every job run.
A React-based UI that allows users to traverse complex dependency trees and zoom into specific job nodes.
Allows attaching custom facets (JSON metadata) to job runs, such as data quality scores or resource usage.
Connects job and dataset nodes across different organizational boundaries and namespaces.
Dual-API approach for both high-throughput ingestion and complex, nested metadata queries.
Clone the Marquez GitHub repository: git clone https://github.com/MarquezProject/marquez
Ensure Docker and Docker Compose are installed on the host machine.
Navigate to the project root and execute './docker-up.sh' to initialize the PostgreSQL database and API server.
Access the Marquez UI via browser at http://localhost:3000.
Install the Marquez/OpenLineage client for your orchestrator (e.g., 'pip install openlineage-airflow').
Configure environment variables for the orchestrator (OPENLINEAGE_URL and OPENLINEAGE_NAMESPACE).
Register your first Namespace using the Marquez REST API to logically group jobs.
Trigger a data pipeline run to generate initial metadata events.
Verify the job run and dataset appearance in the Marquez UI lineage graph.
Query the metadata programmatically using the GraphQL endpoint for custom reporting.
All Set
Ready to go
Verified feedback from other users.
"Users praise Marquez for its strict adherence to the OpenLineage standard and its ability to provide clear, visual clarity into complex data pipelines. Some find the initial setup of metadata listeners to be complex."
Post questions, share tips, and help other users.

Fast distributed SQL query engine for big data analytics.

Unlocking insights from unstructured data.

A visual data science platform combining visual analytics, data science, and data wrangling.

Open Source OCR Engine capable of recognizing over 100 languages.

Liberating data tables locked inside PDF files.

The fastest way to build and share data apps.

Move your data easily, securely, and efficiently with Stitch, now part of Qlik Talend Cloud.

Open Source High-Performance Data Warehouse delivering Sub-Second Analytics for End Users and Agents at Scale.