DataFrame Validation
TheDataFrameValidator class enables you to validate pandas DataFrames directly within your ETL workflows, before data reaches its destination. This allows you to catch data quality issues early, preventing bad data from contaminating your data warehouse or analytics systems.
Overview
DataFrame validation is ideal for:- Validating transformed data before loading to destinations
- Processing large datasets in chunks with memory efficiency
- Short-circuiting ETL pipelines on validation failures
- Providing immediate feedback during data transformations
- Publishing validation results back to OpenMetadata
Basic Usage
Creating a Validator
Adding Tests
Add test definitions to validate your DataFrame:Validating a DataFrame
Complete ETL Example
Here’s a complete example of validating transformed data in an ETL pipeline:Using Tests from OpenMetadata
Instead of defining tests in code, load tests that are configured in OpenMetadata:- Separation of concerns: Data stewards define quality criteria in UI, engineers execute in code
- Dynamic test updates: Test criteria changes don’t require code deployments
- Consistency: Same tests used for table validation and DataFrame validation
Next Steps
Chunk-Based Validation
Validate large DataFrames in memory-efficient chunks with transactional safety and automatic failure handling.