DataFrame Validation

The DataFrameValidator class enables you to validate pandas DataFrames directly within your ETL workflows, before data reaches its destination. This allows you to catch data quality issues early, preventing bad data from contaminating your data warehouse or analytics systems.

Overview

DataFrame validation is ideal for:

Validating transformed data before loading to destinations
Processing large datasets in chunks with memory efficiency
Short-circuiting ETL pipelines on validation failures
Providing immediate feedback during data transformations
Publishing validation results back to OpenMetadata

Basic Usage

Creating a Validator

Adding Tests

Add test definitions to validate your DataFrame:

Validating a DataFrame

Complete ETL Example

Here's a complete example of validating transformed data in an ETL pipeline:

Using Tests from OpenMetadata

Instead of defining tests in code, load tests that are configured in OpenMetadata:

This approach enables:

Separation of concerns: Data stewards define quality criteria in UI, engineers execute in code
Dynamic test updates: Test criteria changes don't require code deployments
Consistency: Same tests used for table validation and DataFrame validation

Chunk-Based Validation

For large datasets that don't fit in memory, validate data in chunks:

Method 1: Manual Chunk Validation

Method 2: Using the `run()` Method

The run() method provides a cleaner approach with automatic chunk handling:

Transaction-Safe Chunk Processing

Use a context manager to ensure atomic transactions:

Failure Modes

As of version 1.11.0.0 of the SDK, DataFrameValidator supports only one failure mode: short circuit.

Future versions will include additional modes to report back failing rows or skipping failing batches.

Working with Validation Results

Accessing Test Results

Merging Results from Multiple Chunks

Publishing Results to OpenMetadata

After validation, publish results back to OpenMetadata for tracking and alerting:

This enables:

Historical tracking of data quality trends
Alerting on validation failures
Visualization in OpenMetadata UI
Centralized data quality reporting

Important Considerations for Chunk-Based Validation

When using chunk-based validation, be aware of tests that require the full dataset:

Tests That Require Full Table

Some tests analyze the entire dataset and may produce incorrect results when run on chunks:

TableRowCountToBeBetween: Counts rows in each chunk, not the full dataset
TableRowCountToEqual: Validates chunk size, not full dataset size
ColumnValuesSumToBeBetween: Sums values per chunk, not across all data

The SDK will issue a warning when such tests are detected:

Recommended Approach

For datasets that don't fit in memory and require full-table tests:

Use TestRunner to validate after loading
Focus DataFrame validation on column-level tests that do not require aggregation
Split validation into two phases:
- During ETL: Validate column-level quality with DataFrameValidator
- After loading: Validate table-level metrics with TestRunner

Example two-phase approach:

Best Practices

Validate before loading: Catch issues before contaminating your warehouse
Use transactional chunk processing: Ensure atomic all-or-nothing behavior
Leverage OpenMetadata tests: Let data stewards define quality criteria
Publish results: Enable tracking and alerting
Handle failures gracefully: Don't silently fail
Use appropriate tests for chunks: Avoid full-table tests when processing chunks

Error Handling

Handle validation errors appropriately:

Next Steps

Review the Test Definitions Reference for all available tests
Learn about TestRunner for validating tables after loading
Explore Advanced Usage patterns and configurations