connectors

No menu items for this category

Lineage Ingestion

A large subset of connectors distributed with OpenMetadata include support for lineage ingestion. Lineage ingestion processes queries to determine upstream and downstream entities for data assets. Lineage is published to the OpenMetadata catalog when metadata is ingested.

Using the OpenMetadata user interface and API, you may trace the path of data across Tables, Pipelines, and Dashboards.

gif

Lineage ingestion is specific to the type of the Entity that we are processing. We are going to explain the ingestion process for the supported services.

The team is continuously working to increase the lineage coverage of the available services. Do not hesitate to reach out if you have any questions, issues or requests!

Here we have 3 lineage sources, divided in different workflows, but mostly built around a Query Parser.

During the Metadata Ingestion workflow we differentiate if a Table is a View. For those sources where we can obtain the query that generates the View (e.g., Snowflake allows us to pick up the View query from the DDL).

After all Tables have been ingested in the workflow, it's time to parse all the queries generating Views. During the query parsing, we will obtain the source and target tables, search if the Tables exist in OpenMetadata, and finally create the lineage relationship between the involved Entities.

Let's go over this process with an example. Suppose have the following DDL:

From this query we will extract the following information:

1. There are two source tables, represented by the string schema.table_a as another_schema.table_b 2. There is a target table schema.my_view.

In this case we suppose that the database connection requires us to write the table names as <schema>.<table>. However, there are other possible options. Sometimes we can find just <table> in a query, or even <database>.<schema>.<table>.

The point here is that we have limited information that we can use to identify the Table Entity that represents the table written down in SQL. To close this gap, we run a query against ElasticSearch using the Table FQN.

Once we have identified all the ingredients in OpenMetadata as Entities, we can run the Lineage API to add the relationship between the nodes.

query-parser

What we just described is the core process of identifying and ingesting lineage, and it will be reused (or partially reused) for the rest of the options as well.

When configuring an Ingestion adding dbt information we can parse the nodes on the Manifest JSON to get the data model lineage. Here we don't need to parse a query to obtain the source and target elements, but we still rely on querying ElasticSearch to identify the graph nodes as OpenMetadata Entities.

Note that if a Model is not materialized, its data won't be ingested.

The main difference here is between those sources that provide internal access to query logs and those that do not. For services such as BigQuery, Snowflake etc.

There are specific workflows (Usage & Lineage) that will use the query log information. An alternative for sources not listed here is to run the workflow by providing the Query Logs that you have previously exported and then running the workflow.