connectors

No menu items for this category

Lineage Workflow

Learn how to configure the Lineage workflow from the UI to ingest Lineage data from your data sources.

Checkout the documentation of the connector you are using to know if it supports automated lineage workflow.

If your database service is not yet supported, you can use this same workflow by providing a Query Log file!

Learn how to do so πŸ‘‡

Once the metadata ingestion runs correctly and we are able to explore the service Entities, we can add Entity Lineage information.

This will populate the Lineage tab from the Table Entity Page.

table-entity-page

Table Entity Page

We can create a workflow that will obtain the query log and table creation information from the underlying database and feed it to OpenMetadata. The Lineage Ingestion will be in charge of obtaining this data.

From the Service Page, go to the Ingestions tab to add a new ingestion and click on Add Lineage Ingestion.

add-ingestion

Add Ingestion

Here you can enter the Lineage Ingestion details:

configure-lineage-ingestion

Configure the Lineage Ingestion

Query Log Duration

Specify the duration in days for which the lineage should capture lineage data from the query logs. For example, if you specify 2 as the value for the duration, the data lineage will capture lineage information for 48 hours prior to when the ingestion workflow is run.

Result Limit

Set the limit for the query log results to be run at a time.

After clicking Next, you will be redirected to the Scheduling form. This will be the same as the Metadata Ingestion. Select your desired schedule and click on Deploy to find the lineage pipeline being added to the Service Ingestions.

schedule-and-deploy

View Service Ingestion pipelines

In the connectors section we showcase how to run the metadata ingestion from a JSON/YAML file using the Airflow SDK or the CLI via metadata ingest. Running a lineage workflow is also possible using a JSON/YAML configuration file.

This is a good option if you wish to execute your workflow via the Airflow SDK or using the CLI; if you use the CLI a lineage workflow can be triggered with the command metadata ingest -c FILENAME.yaml. The serviceConnection config will be specific to your connector (you can find more information in the connectors section), though the sourceConfig for the lineage will be similar across all connectors.

After running a Metadata Ingestion workflow, we can run Lineage workflow. While the serviceName will be the same to that was used in Metadata Ingestion, so the ingestion bot can get the serviceConnection details from the server.

This is a sample config for BigQuery Lineage:

You can find all the definitions and types for the sourceConfig here.

queryLogDuration: Configuration to tune how far we want to look back in query logs to process lineage data in days.

parsingTimeoutLimit: Configuration to set the timeout for parsing the query in seconds.

filterCondition: Condition to filter the query history.

resultLimit: Configuration to set the limit for query logs.

queryLogFilePath: Configuration to set the file path for query logs.

databaseFilterPattern: Regex to only fetch databases that matches the pattern.

schemaFilterPattern: Regex to only fetch tables or databases that matches the pattern.

tableFilterPattern: Regex to only fetch tables or databases that matches the pattern.

To send the metadata to OpenMetadata, it needs to be specified as type: metadata-rest.

The main property here is the openMetadataServerConfig, where you can define the host and security provider of your OpenMetadata installation.

For a simple, local installation using our docker containers, this looks like:

filename.yaml
  • You can learn more about how to configure and run the Lineage Workflow to extract Lineage data from here

After saving the YAML config, we will run the command the same way we did for the metadata ingestion: