how-to-guides

No menu items for this category

External Profiler Workflow

Consider a use case where you have a large database source with multiple databases and schemas which are maintained by different teams within your organization. You have created multiple database services within OpenMetadata depending on your use case by applying various filters on this large source. Now, instead of running a profiler pipeline for each service, you want to run a single workflow profiler for the entire source, irrespective of the OpenMetadata service which an asset would belong to. This document will guide you on how to achieve this.

You might also want to check out how to configure external sample data. You can find more information here:

To run the Ingestion via the UI you'll need to use the OpenMetadata Ingestion Container, which comes shipped with custom Airflow plugins to handle the workflow deployment.

If, instead, you want to manage your workflows externally on your preferred orchestrator, you can check the following docs to run the Ingestion Framework anywhere.

In order to run the external profiler with external sample data you will need to install the following packages:

Where <connector> is the name of the connector that you want to run against. Each specific installation command will be shared on its documentation page.

For example, to run against Athena, we need to install:

  • The athena plugin will bring all the requirements to connect to the Athena Service
  • The datalake plugin helps us connect to S3 to manage the sample data
  • The trino plugin will only be needed temporarily

You will need to prepare a yaml file for the data profiler depending on the database source. You can get details of how to define a yaml file for data profiler for each connector here.

For example, consider if the data source was snowflake, then the yaml file would have looked like as follows.

One option to running the workflow externally is by leveraging the metadata CLI.

After saving the YAML config, we will run the command:

If you'd rather have a Python script taking care of the execution, you can use: