Data Quality as Code
Data Quality as Code enables you to programmatically build, run, and manage data quality tests within your ETL workflows using the OpenMetadata Python SDK. This approach allows data engineers and developers to integrate data quality validation directly into their data pipelines, ensuring data quality is verified at every stage of the data lifecycle.
Why Data Quality as Code?
Traditional data quality testing often requires manual configuration through UIs or separate workflow systems. Data Quality as Code brings several advantages:
- Integration with ETL workflows: Run data quality tests directly within your existing Python-based ETL pipelines
- Version control: Manage test definitions alongside your code in version control systems
- Developer-friendly: Use familiar Python syntax and IDE features for test development
- Programmatic control: Dynamically generate tests based on data discovery or metadata
- Immediate feedback: Validate data transformations before loading to destinations
- Shared responsibility: Data stewards define tests in OpenMetadata UI, engineers execute them in code
Key Features
TestRunner API
Execute data quality tests against tables cataloged in OpenMetadata:
DataFrame Validation
Validate pandas DataFrames before loading them to destinations:
Multiple Test Definition Sources
Define tests in three flexible ways:
- Inline code: Define tests directly in your Python code
- From OpenMetadata: Load test definitions configured in the OpenMetadata UI
- From YAML files: Load test configurations from YAML workflow files
Comprehensive Test Library
Access all test cases supported by OpenMetadata, covering:
- Table tests: Row counts, column counts, custom SQL queries, table diffs
- Column tests: Null checks, uniqueness, regex patterns, value ranges, statistical metrics
Use Cases
1. ETL Data Validation
Validate data after extraction and transformation, before loading:
2. Collaborative Quality Management
Data stewards define tests in the UI, engineers run them in pipelines:
3. Chunk-Based Validation
Validate large datasets processed in chunks:
Getting Started
Install the SDK and configure authentication to get started.
TestRunner - Table TestingRun data quality tests against tables in OpenMetadata.
DataFrame ValidationValidate pandas DataFrames before loading to destinations.
Test Definitions ReferenceComplete reference of all available test types and their parameters.
Advanced UsageLearn advanced patterns including YAML workflows, custom configurations, and result publishing.
Run our tutorials with examplesLearn by doing by running our Jupyter Notebook examples
Requirements
- Python 3.10 or higher
openmetadata-ingestionpackage version 1.11.0.0 or later- Access to an OpenMetadata instance (1.11.0 or later)
- Valid JWT token for authentication
Architecture
Data Quality as Code integrates seamlessly with OpenMetadata's existing data quality infrastructure:
- Test Definitions: Tests can be defined in code, loaded from OpenMetadata, or imported from YAML files
- Execution Engine: Leverages OpenMetadata's proven test execution engine
- Result Publishing: Test results can be published back to OpenMetadata for visualization and alerting
- Service Connections: Automatically uses service connections configured in OpenMetadata
Next Steps
Ready to get started? Follow the Getting Started guide to install the SDK and run your first data quality test.