DataFrame Validation
The DataFrameValidator class enables you to validate pandas DataFrames directly within your ETL workflows, before data reaches its destination. This allows you to catch data quality issues early, preventing bad data from contaminating your data warehouse or analytics systems.
Overview
DataFrame validation is ideal for:
- Validating transformed data before loading to destinations
- Processing large datasets in chunks with memory efficiency
- Short-circuiting ETL pipelines on validation failures
- Providing immediate feedback during data transformations
- Publishing validation results back to OpenMetadata
Basic Usage
Creating a Validator
Adding Tests
Add test definitions to validate your DataFrame:
Validating a DataFrame
Complete ETL Example
Here's a complete example of validating transformed data in an ETL pipeline:
Using Tests from OpenMetadata
Instead of defining tests in code, load tests that are configured in OpenMetadata:
This approach enables:
- Separation of concerns: Data stewards define quality criteria in UI, engineers execute in code
- Dynamic test updates: Test criteria changes don't require code deployments
- Consistency: Same tests used for table validation and DataFrame validation
Chunk-Based Validation
For large datasets that don't fit in memory, validate data in chunks:
Method 1: Manual Chunk Validation
Method 2: Using the run() Method
The run() method provides a cleaner approach with automatic chunk handling:
Transaction-Safe Chunk Processing
Use a context manager to ensure atomic transactions:
Failure Modes
As of version 1.11.0.0 of the SDK, DataFrameValidator supports only one failure mode: short circuit.
Future versions will include additional modes to report back failing rows or skipping failing batches.
Working with Validation Results
Accessing Test Results
Merging Results from Multiple Chunks
Publishing Results to OpenMetadata
After validation, publish results back to OpenMetadata for tracking and alerting:
This enables:
- Historical tracking of data quality trends
- Alerting on validation failures
- Visualization in OpenMetadata UI
- Centralized data quality reporting
Important Considerations for Chunk-Based Validation
When using chunk-based validation, be aware of tests that require the full dataset:
Tests That Require Full Table
Some tests analyze the entire dataset and may produce incorrect results when run on chunks:
TableRowCountToBeBetween: Counts rows in each chunk, not the full datasetTableRowCountToEqual: Validates chunk size, not full dataset sizeColumnValuesSumToBeBetween: Sums values per chunk, not across all data
The SDK will issue a warning when such tests are detected:
Recommended Approach
For datasets that don't fit in memory and require full-table tests:
- Use TestRunner to validate after loading
- Focus DataFrame validation on column-level tests that do not require aggregation
- Split validation into two phases:
- During ETL: Validate column-level quality with DataFrameValidator
- After loading: Validate table-level metrics with TestRunner
Example two-phase approach:
Best Practices
Validate before loading: Catch issues before contaminating your warehouse
Use transactional chunk processing: Ensure atomic all-or-nothing behavior
Leverage OpenMetadata tests: Let data stewards define quality criteria
Publish results: Enable tracking and alerting
Handle failures gracefully: Don't silently fail
Use appropriate tests for chunks: Avoid full-table tests when processing chunks
Error Handling
Handle validation errors appropriately:
Next Steps
- Review the Test Definitions Reference for all available tests
- Learn about TestRunner for validating tables after loading
- Explore Advanced Usage patterns and configurations