Learn type checks, range checks, cross-reference validation, early validation benefits, and assert/try-except patterns for data validation.
What Is Data Validation?
Data validation is the process of checking whether data conforms to a set of predefined rules, constraints, or expectations before it is used in analysis, stored in a database, or passed to downstream systems. Validation acts as a gatekeeper: it prevents corrupt, incomplete, or nonsensical data from entering your pipeline.
At its core, validation answers one question: "Does this data look the way it should?" The three most common categories of validation are:
Validation Type
What It Checks
Example
Type validation
Data is of the correct type (int, float, str, date, etc.)
Age should be an integer, not a string
Range validation
Values fall within an acceptable range
Temperature between −50 and 60 °C
Cross-field validation
Logical consistency between two or more fields
end_date must be after start_date
Exam Note: This course expects you to know the difference between type, range, and cross-field validation and to identify the appropriate method for a given scenario.
Validation Tools and Approaches
In Python-based data workflows, validation is typically implemented through:
Python logic: Custom if statements, assert, and try/except blocks
Schema checks: Defining the expected shape and types of data (column names, dtypes, constraints) and programmatically verifying incoming data against that schema
Library-based validators: Tools like pandas dtype checks, isinstance(), or dedicated validation libraries
Type Checks: Ensuring Correct Data Types
Type validation ensures that every value in your dataset is of the expected Python or pandas type. When data is loaded from CSVs, APIs, or user input, types are often inferred — and frequently inferred incorrectly. A column that should contain integers may be read as strings. A date column may come through as plain text.
Using isinstance() in Python
The built-in isinstance() function checks whether an object belongs to a specified type (or tuple of types). It returns True or False.
# Basic isinstance() checks
age = 25
name = "Alice"
salary = 55000.50print(isinstance(age, int)) # Trueprint(isinstance(name, str)) # Trueprint(isinstance(salary, float)) # True# Check against multiple typesprint(isinstance(age, (int, float))) # True - accepts either# Validation function using isinstancedefvalidate_record(record):
"""Validate a single data record."""
errors = []
if notisinstance(record['id'], int):
errors.append("id must be an integer")
if notisinstance(record['name'], str):
errors.append("name must be a string")
if notisinstance(record['score'], (int, float)):
errors.append("score must be numeric")
return errors
record = {'id': 1, 'name': "Alice", 'score': 92.5}
print(validate_record(record)) # [] (no errors)
bad_record = {'id': "one", 'name': 42, 'score': "high"}
print(validate_record(bad_record))
# ['id must be an integer', 'name must be a string', 'score must be numeric']
Using pandas dtype Checks
When working with DataFrames, you validate types through the .dtypes attribute or the .select_dtypes() method. You can also enforce types with .astype().
import pandas as pd
df = pd.DataFrame({
'employee_id': [101, 102, 103],
'name': ['Alice', 'Bob', 'Charlie'],
'salary': [55000, 62000, 48000],
'hire_date': ['2020-01-15', '2019-06-01', '2021-03-20']
})
# Inspect current dtypesprint(df.dtypes)
# employee_id int64# name object# salary int64# hire_date object <-- should be datetime!# Define expected dtypes
expected_dtypes = {
'employee_id': 'int64',
'name': 'object',
'salary': 'int64',
'hire_date': 'datetime64[ns]'
}
# Validate and fix dtypesdefvalidate_dtypes(df, expected):
"""Check if DataFrame columns match expected dtypes."""
mismatches = {}
for col, dtype in expected.items():
if col not in df.columns:
mismatches[col] = "column missing"elifstr(df[col].dtype) != dtype:
mismatches[col] = f"expected {dtype}, got {df[col].dtype}"return mismatches
# Convert hire_date to datetime first
df['hire_date'] = pd.to_datetime(df['hire_date'])
print(validate_dtypes(df, expected_dtypes)) # {} (all match)
Best Practice: Always check and explicitly convert dtypes immediately after loading data. Use pd.to_datetime() for dates, pd.to_numeric() for numbers, and .astype() for categorical or string conversions.
Range Checks: Ensuring Values Fall Within Expected Ranges
Range validation ensures that numeric or date values fall within acceptable boundaries. Values outside these bounds are likely errors — for example, a human age of 250 or a percentage greater than 100.
import pandas as pd
# Simple range check functiondefvalidate_range(value, min_val, max_val, field_name="value"):
"""Check if a value is within the expected range."""if value isNone:
returnf"{field_name} is missing (None)"if not (min_val <= value <= max_val):
returnf"{field_name}={value} is out of range [{min_val}, {max_val}]"returnNone# No error# Test the functionprint(validate_range(25, 0, 120, "age")) # None (valid)print(validate_range(-5, 0, 120, "age")) # "age=-5 is out of range [0, 120]"print(validate_range(105, 0, 100, "score")) # "score=105 is out of range [0, 100]"# Range validation on a pandas DataFrame
df = pd.DataFrame({
'student': ['Alice', 'Bob', 'Charlie', 'Diana'],
'age': [20, -3, 22, 200],
'grade': [85, 92, 110, 78]
})
# Flag rows with out-of-range ages (valid: 0-120)
invalid_age = df[(df['age'] < 0) | (df['age'] > 120)]
print("Invalid ages:")
print(invalid_age)
# student age grade# 1 Bob -3 92# 3 Diana 200 78# Flag rows with out-of-range grades (valid: 0-100)
invalid_grade = df[(df['grade'] < 0) | (df['grade'] > 100)]
print("Invalid grades:")
print(invalid_grade)
# student age grade# 2 Charlie 22 110# Using .between() for cleaner range checks
df['age_valid'] = df['age'].between(0, 120)
df['grade_valid'] = df['grade'].between(0, 100)
print(df)
Common Pitfall: Be careful with floating-point comparisons. A value like 99.999999999 might technically be within range but could indicate a precision issue. Always consider how your data is produced when setting boundaries.
Cross-Reference Checks: Validating Data Against Other Fields
Cross-field (or cross-reference) validation checks the logical relationship between two or more columns in a dataset. Unlike type or range checks that examine a single field in isolation, cross-field checks ensure that the combination of values makes sense.
Common examples include:
end_date must be greater than or equal to start_date
min_salary must be less than or equal to max_salary
If status is "shipped", then shipping_date must not be null
discount_price must be less than original_price
import pandas as pd
# Cross-field validation: end_date > start_date
df = pd.DataFrame({
'project': ['Alpha', 'Beta', 'Gamma', 'Delta'],
'start_date': pd.to_datetime(
['2024-01-01', '2024-03-15', '2024-06-01', '2024-09-01']
),
'end_date': pd.to_datetime(
['2024-06-30', '2024-02-28', '2024-12-31', '2024-08-15']
)
})
# Find rows where end_date is before start_date
invalid_dates = df[df['end_date'] < df['start_date']]
print("Projects with invalid date ranges:")
print(invalid_dates)
# project start_date end_date# 1 Beta 2024-03-15 2024-02-28# 3 Delta 2024-09-01 2024-08-15# Cross-field validation function for multiple rulesdefvalidate_cross_fields(df):
"""Apply multiple cross-field validation rules."""
errors = []
# Rule 1: end_date must be >= start_date
mask = df['end_date'] < df['start_date']
if mask.any():
bad_rows = df[mask].index.tolist()
errors.append(f"end_date < start_date at rows: {bad_rows}")
# Rule 2: budget must be positive if status is 'active'if'budget'in df.columns and'status'in df.columns:
mask2 = (df['status'] == 'active') & (df['budget'] <= 0)
if mask2.any():
bad_rows = df[mask2].index.tolist()
errors.append(f"Active projects with non-positive budget at rows: {bad_rows}")
return errors
errors = validate_cross_fields(df)
for e in errors:
print(f"VALIDATION ERROR: {e}")
Real-World Pattern: In production pipelines, cross-field validation rules are often stored in a configuration file or database table so they can be updated without changing code. Each rule is a function or expression that returns True/False for each row.
Benefits of Early Type Checks in Ingestion Scripts
Performing validation as early as possible in a data pipeline — ideally during ingestion (the point where data first enters your system) — offers several critical advantages:
Benefit
Explanation
Fail fast
Bad data is caught immediately, before it contaminates downstream processes or storage
Lower debugging cost
Errors detected at the source are much easier to trace and fix than errors found in final reports
Data trust
Analysts and stakeholders can trust that data passing validation is clean and consistent
Reduced rework
Catching a type mismatch during ingestion prevents hours of wasted analysis on corrupt data
Cleaner pipelines
Downstream code can assume correct types, reducing defensive programming throughout the codebase
# Example: validation during data ingestionimport pandas as pd
defingest_csv(filepath, schema):
"""Load a CSV with upfront validation.
Parameters:
filepath (str): Path to the CSV file
schema (dict): Expected column names mapped to dtypes
Returns:
pd.DataFrame: Validated DataFrame
Raises:
ValueError: If validation fails
"""
df = pd.read_csv(filepath)
# Step 1: Check all required columns exist
missing_cols = set(schema.keys()) - set(df.columns)
if missing_cols:
raiseValueError(f"Missing columns: {missing_cols}")
# Step 2: Convert and validate dtypesfor col, dtype in schema.items():
try:
if dtype == 'datetime':
df[col] = pd.to_datetime(df[col])
elif dtype == 'numeric':
df[col] = pd.to_numeric(df[col])
else:
df[col] = df[col].astype(dtype)
except (ValueError, TypeError) as e:
raiseValueError(f"Column '{col}' failed {dtype} conversion: {e}")
# Step 3: Check for unexpected nulls
null_counts = df[list(schema.keys())].isnull().sum()
cols_with_nulls = null_counts[null_counts > 0]
if not cols_with_nulls.empty:
print(f"WARNING: Null values found:\n{cols_with_nulls}")
return df
# Usage
schema = {
'employee_id': 'int64',
'name': 'object',
'salary': 'numeric',
'hire_date': 'datetime'
}
# df = ingest_csv('employees.csv', schema)
Using Assert Statements and Try/Except for Validation
Python provides two built-in mechanisms that are commonly used for data validation:
Assert Statements
An assert statement tests a condition and raises an AssertionError if the condition is False. Assertions are ideal for catching programming errors and enforcing assumptions during development.
# Assert statements for quick validation checks# Basic assertion
data = {'age': 25, 'name': 'Alice', 'score': 88}
assertisinstance(data['age'], int), "Age must be an integer"assert0 <= data['age'] <= 120, "Age must be between 0 and 120"assertlen(data['name']) > 0, "Name cannot be empty"assert0 <= data['score'] <= 100, "Score must be between 0 and 100"# Schema validation with assertionsdefvalidate_schema(df, required_columns):
"""Assert that a DataFrame contains all required columns."""for col in required_columns:
assert col in df.columns, f"Missing required column: {col}"print("Schema validation passed!")
# DataFrame shape assertionsimport pandas as pd
df = pd.DataFrame({'a': [1, 2], 'b': [3, 4]})
assert df.shape[0] > 0, "DataFrame must not be empty"assert df.shape[1] == 2, "DataFrame must have exactly 2 columns"
Important Caveat:assert statements are removed when Python runs with the -O (optimize) flag. Never use assert for validation that must always run in production. Use explicit if/raise instead.
Try/Except for Robust Validation
The try/except pattern handles validation errors gracefully without crashing the program. This is essential in production data pipelines where you want to log errors and continue processing valid records.
import pandas as pd
# Try/except for type conversion validationdefsafe_convert_to_numeric(series, column_name):
"""Attempt to convert a Series to numeric, returning a report."""try:
converted = pd.to_numeric(series, errors='raise')
print(f"Column '{column_name}': successfully converted to numeric")
return converted
exceptValueErroras e:
print(f"Column '{column_name}': conversion failed - {e}")
# Coerce bad values to NaN instead of failing
converted = pd.to_numeric(series, errors='coerce')
bad_count = converted.isna().sum() - series.isna().sum()
print(f" {bad_count} values coerced to NaN")
return converted
# Example with mixed data
s = pd.Series(['10', '20', 'thirty', '40', 'N/A'])
result = safe_convert_to_numeric(s, 'quantity')
print(result)
# 0 10.0# 1 20.0# 2 NaN# 3 40.0# 4 NaN# Comprehensive validation with try/exceptdefvalidate_and_clean(records):
"""Validate a list of records, returning valid and invalid sets."""
valid = []
invalid = []
for i, rec inenumerate(records):
try:
# Type checksassertisinstance(rec['id'], int), "id must be int"assertisinstance(rec['value'], (int, float)), "value must be numeric"# Range checksassert rec['value'] >= 0, "value must be non-negative"
valid.append(rec)
except (AssertionError, KeyError) as e:
invalid.append({'index': i, 'record': rec, 'error': str(e)})
print(f"Valid: {len(valid)}, Invalid: {len(invalid)}")
return valid, invalid
records = [
{'id': 1, 'value': 100},
{'id': 'two', 'value': 200}, # bad id type
{'id': 3, 'value': -50}, # negative value
{'id': 4, 'value': 400},
]
valid, invalid = validate_and_clean(records)
# Valid: 2, Invalid: 2
1.3.2 Data Integrity
Understand entity, referential, and domain integrity; schema validation; and data quality frameworks.
Concept of Data Integrity and Its Importance
Data integrity refers to the accuracy, consistency, and reliability of data throughout its lifecycle — from creation and storage through processing and retrieval. While validation checks individual values, integrity focuses on the broader question: "Can we trust this data as a whole?"
Data integrity encompasses four key dimensions:
Integrity Type
Definition
Enforced By
Entity integrity
Every row is uniquely identifiable
Primary keys
Referential integrity
Relationships between tables are consistent
Foreign keys
Domain integrity
Values belong to a defined set of acceptable values
Constraints, enums, check rules
User-defined integrity
Business-specific rules are satisfied
Custom validation logic
Why Data Integrity Matters: Without integrity, analysis results are unreliable. A broken foreign key means your JOIN produces incorrect counts. A duplicated primary key means aggregations are inflated. Violations cascade through every query and report that touches the data.
Referential Integrity in Databases
Referential integrity ensures that relationships between tables remain valid. If table B references table A through a foreign key, then every foreign key value in table B must correspond to an existing primary key value in table A.
Common violations include:
Orphan records: A row in a child table references a parent record that does not exist (e.g., an order with a customer_id that has no matching customer)
Dangling references: A parent record is deleted while child records still reference it
Entity integrity requires that every table has a primary key and that the primary key is unique and not null. This ensures that every record can be uniquely identified. Violations include duplicate IDs and null primary keys.
import pandas as pd
# Entity integrity checks
df = pd.DataFrame({
'employee_id': [1, 2, 3, 2, None, 5], # duplicate 2, null value'name': ['Alice', 'Bob', 'Charlie', 'Bob2', 'Eve', 'Frank'],
'department': ['HR', 'IT', 'IT', 'Sales', 'HR', 'IT']
})
defcheck_entity_integrity(df, pk_column):
"""Verify entity integrity for a given primary key column."""
issues = []
# Check 1: Primary key must not contain nulls
null_count = df[pk_column].isna().sum()
if null_count > 0:
issues.append(f"NULL values in primary key: {null_count} row(s)")
# Check 2: Primary key must be unique
duplicate_count = df[pk_column].duplicated().sum()
if duplicate_count > 0:
dupes = df[df[pk_column].duplicated(keep=False)]
issues.append(f"Duplicate primary keys: {duplicate_count} duplicate row(s)")
issues.append(f"Duplicate values: {df[pk_column][df[pk_column].duplicated()].tolist()}")
if issues:
print("ENTITY INTEGRITY VIOLATIONS:")
for issue in issues:
print(f" - {issue}")
else:
print("Entity integrity: OK")
returnlen(issues) == 0check_entity_integrity(df, 'employee_id')
# ENTITY INTEGRITY VIOLATIONS:# - NULL values in primary key: 1 row(s)# - Duplicate primary keys: 1 duplicate row(s)# - Duplicate values: [2.0]
Domain Integrity (Valid Values)
Domain integrity ensures that all values in a column come from a defined set of acceptable values (the "domain"). This includes data type restrictions, allowable value ranges, format patterns, and enumerated sets.
import pandas as pd
# Domain integrity: Allowed values for categorical columns
VALID_DEPARTMENTS = {'HR', 'IT', 'Sales', 'Marketing', 'Finance'}
VALID_STATUSES = {'active', 'inactive', 'on_leave'}
df = pd.DataFrame({
'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
'department': ['HR', 'IT', 'Accounting', 'Sales'], # 'Accounting' not valid!'status': ['active', 'fired', 'active', 'on_leave'], # 'fired' not valid!'salary': [55000, 62000, -1000, 48000] # negative salary!
})
defcheck_domain_integrity(df, domain_rules):
"""Check domain integrity against a set of rules.
domain_rules: dict with column names as keys and either:
- a set of valid values (for categorical columns)
- a tuple (min, max) for numeric ranges
"""
violations = []
for col, rule in domain_rules.items():
if col not in df.columns:
violations.append(f"Column '{col}' not found")
continueifisinstance(rule, set):
# Categorical check: values must be in the allowed set
invalid = df[~df[col].isin(rule)]
if not invalid.empty:
bad_vals = invalid[col].unique().tolist()
violations.append(
f"'{col}' contains invalid values: {bad_vals}"
)
elifisinstance(rule, tuple) andlen(rule) == 2:
# Numeric range check
min_val, max_val = rule
invalid = df[~df[col].between(min_val, max_val)]
if not invalid.empty:
violations.append(
f"'{col}' has {len(invalid)} values outside [{min_val}, {max_val}]"
)
return violations
# Define domain rules
rules = {
'department': VALID_DEPARTMENTS,
'status': VALID_STATUSES,
'salary': (0, 500000)
}
violations = check_domain_integrity(df, rules)
for v in violations:
print(f"DOMAIN VIOLATION: {v}")
# DOMAIN VIOLATION: 'department' contains invalid values: ['Accounting']# DOMAIN VIOLATION: 'status' contains invalid values: ['fired']# DOMAIN VIOLATION: 'salary' has 1 values outside [0, 500000]
Schema Validation Approaches
A data schema defines the expected structure of a dataset: column names, data types, constraints, and relationships. Schema validation checks incoming data against this specification to ensure conformity.
Schema validation typically covers:
Column presence: All required columns exist
Column types: Each column has the expected dtype
Non-null constraints: Required fields are not missing
Uniqueness: Columns that should be unique (e.g., IDs) have no duplicates
Value constraints: Domain rules, ranges, and patterns
import pandas as pd
# Comprehensive schema validation for DataFramesclassDataFrameSchema:
"""A simple schema validator for pandas DataFrames."""def__init__(self):
self.columns = {}
defadd_column(self, name, dtype=None, nullable=True,
unique=False, allowed_values=None,
min_val=None, max_val=None):
"""Define expectations for a column."""
self.columns[name] = {
'dtype': dtype,
'nullable': nullable,
'unique': unique,
'allowed_values': allowed_values,
'min_val': min_val,
'max_val': max_val
}
return self # enable method chainingdefvalidate(self, df):
"""Validate a DataFrame against this schema."""
errors = []
for col_name, rules in self.columns.items():
# Check column existsif col_name not in df.columns:
errors.append(f"Missing column: '{col_name}'")
continue
col = df[col_name]
# Check nullableif not rules['nullable'] and col.isna().any():
errors.append(f"'{col_name}' contains nulls but is non-nullable")
# Check uniqueif rules['unique'] and col.duplicated().any():
errors.append(f"'{col_name}' contains duplicate values but must be unique")
# Check allowed valuesif rules['allowed_values'] is notNone:
invalid = col[~col.isin(rules['allowed_values']) & col.notna()]
if not invalid.empty:
errors.append(
f"'{col_name}' has invalid values: {invalid.unique().tolist()}"
)
# Check rangeif rules['min_val'] is notNone:
below = col[col < rules['min_val']]
if not below.empty:
errors.append(f"'{col_name}' has values below {rules['min_val']}")
if rules['max_val'] is notNone:
above = col[col > rules['max_val']]
if not above.empty:
errors.append(f"'{col_name}' has values above {rules['max_val']}")
return errors
# Define and use the schema
schema = DataFrameSchema()
schema.add_column('id', nullable=False, unique=True)
schema.add_column('name', nullable=False)
schema.add_column('age', min_val=0, max_val=120)
schema.add_column('status', allowed_values=['active', 'inactive'])
df = pd.DataFrame({
'id': [1, 2, 2, 4],
'name': ['Alice', None, 'Charlie', 'Diana'],
'age': [25, 130, 35, -5],
'status': ['active', 'inactive', 'pending', 'active']
})
errors = schema.validate(df)
for e in errors:
print(f"SCHEMA ERROR: {e}")
# SCHEMA ERROR: 'id' contains duplicate values but must be unique# SCHEMA ERROR: 'name' contains nulls but is non-nullable# SCHEMA ERROR: 'age' has values above 120# SCHEMA ERROR: 'age' has values below 0# SCHEMA ERROR: 'status' has invalid values: ['pending']
Data Quality Frameworks
A data quality framework provides a structured approach to measuring, monitoring, and improving data quality across an organization. Rather than ad-hoc checks, a framework defines consistent dimensions of quality that all data should meet.
The most widely recognized data quality dimensions are:
Dimension
Definition
Example Check
Accuracy
Data correctly represents the real-world entity
Customer address matches postal records
Completeness
All required data is present
No null values in mandatory fields
Consistency
Data does not contradict itself across sources
Total in summary matches sum of line items
Timeliness
Data is available when needed and is current
Records are updated within 24 hours of change
Uniqueness
No unintended duplicates exist
Each customer appears exactly once
Validity
Data conforms to the expected format and rules
Email addresses match a valid pattern
import pandas as pd
# Data quality assessment functiondefassess_data_quality(df, pk_column=None):
"""Generate a data quality report for a DataFrame."""
report = {}
# 1. Completeness: percentage of non-null values per column
completeness = ((1 - df.isnull().mean()) * 100).round(2)
report['completeness'] = completeness.to_dict()
# 2. Uniqueness: percentage of unique values per column
uniqueness = ((df.nunique() / len(df)) * 100).round(2)
report['uniqueness'] = uniqueness.to_dict()
# 3. Entity integrity: duplicate check on primary keyif pk_column and pk_column in df.columns:
dup_count = df[pk_column].duplicated().sum()
null_pk = df[pk_column].isna().sum()
report['entity_integrity'] = {
'duplicates': int(dup_count),
'null_keys': int(null_pk),
'valid': dup_count == 0and null_pk == 0
}
# 4. Row count and shape
report['total_rows'] = len(df)
report['total_columns'] = len(df.columns)
return report
# Example usage
df = pd.DataFrame({
'id': [1, 2, 3, 4, 5],
'name': ['Alice', 'Bob', None, 'Diana', 'Eve'],
'score': [85, 92, 78, None, 95],
'dept': ['HR', 'IT', 'IT', 'HR', 'IT']
})
quality = assess_data_quality(df, pk_column='id')
print("=== Data Quality Report ===")
print(f"Rows: {quality['total_rows']}, Columns: {quality['total_columns']}")
print(f"\nCompleteness (% non-null):")
for col, pct in quality['completeness'].items():
print(f" {col}: {pct}%")
print(f"\nEntity Integrity: {quality['entity_integrity']}")
Putting It All Together: Data Integrity Checks Before Database Insertion
In a real pipeline, you validate data comprehensively before inserting it into a database. Here is a complete example that combines type, range, cross-field, entity, domain, and referential checks:
import pandas as pd
defvalidate_for_insertion(df):
"""Run all validation checks before inserting data into a database.
Returns a tuple: (is_valid: bool, errors: list, clean_df: DataFrame)
"""
errors = []
# 1. Entity integrity - check primary keyif df['id'].isna().any():
errors.append("Primary key 'id' contains NULL values")
if df['id'].duplicated().any():
dupes = df[df['id'].duplicated()]['id'].tolist()
errors.append(f"Duplicate IDs found: {dupes}")
# 2. Type validationtry:
df['salary'] = pd.to_numeric(df['salary'], errors='raise')
exceptValueError:
errors.append("'salary' contains non-numeric values")
# 3. Range validationif (df['salary'] < 0).any():
errors.append("Negative salary values detected")
# 4. Domain validation
valid_depts = {'HR', 'IT', 'Sales', 'Marketing'}
invalid_depts = set(df['department'].dropna()) - valid_depts
if invalid_depts:
errors.append(f"Invalid departments: {invalid_depts}")
# 5. Cross-field validationif'start_date'in df.columns and'end_date'in df.columns:
bad_dates = df[df['end_date'] < df['start_date']]
if not bad_dates.empty:
errors.append(f"end_date < start_date in {len(bad_dates)} rows")
# 6. Completeness check
required_cols = ['id', 'name', 'department', 'salary']
for col in required_cols:
null_count = df[col].isna().sum()
if null_count > 0:
errors.append(f"Required column '{col}' has {null_count} null(s)")
is_valid = len(errors) == 0return is_valid, errors, df
# Test it
df = pd.DataFrame({
'id': [1, 2, 3],
'name': ['Alice', 'Bob', 'Charlie'],
'department': ['HR', 'IT', 'Sales'],
'salary': [55000, 62000, 48000]
})
is_valid, errors, clean_df = validate_for_insertion(df)
if is_valid:
print("All checks passed. Safe to insert into database.")
else:
print("Validation failed:")
for e in errors:
print(f" - {e}")
Exam Tip: You may encounter scenarios where you need to identify which type of integrity is being violated (entity, referential, or domain). Focus on the definitions: entity = unique identifiers, referential = valid relationships, domain = valid values within a column.
Practice Quiz — Topic 1.3
Test your understanding with 10 multiple-choice questions. Click an option to see the answer and explanation.
Q1. Which validation method checks whether a value is of the expected data type (e.g., integer, string, float)?
A) Range validation
B) Type validation
C) Cross-field validation
D) Referential validation
Correct: B) Type validation. Type validation uses checks like isinstance() or pandas .dtypes to ensure values are of the correct data type. Range validation checks boundaries, cross-field checks relationships between columns, and referential validation checks foreign key relationships.
Q2. What will isinstance(3.14, (int, float)) return?
A) True
B) False
C) TypeError
D) None
Correct: A) True. When a tuple of types is passed as the second argument, isinstance() returns True if the object is an instance of any type in the tuple. Since 3.14 is a float, and float is in the tuple, the result is True.
Q3. You have an orders table where each order references a customer_id. Some orders reference customer IDs that do not exist in the customers table. Which type of integrity is violated?
A) Entity integrity
B) Referential integrity
C) Domain integrity
D) Type integrity
Correct: B) Referential integrity. Referential integrity ensures that foreign key values in a child table correspond to valid primary key values in the parent table. Orders referencing non-existent customers are "orphan records" that violate referential integrity.
Q4. What is the primary risk of using assert for validation in production code?
A) Assert statements are slower than if/else
B) Assert statements cannot include custom error messages
C) Assert statements are disabled when Python runs with the -O flag
D) Assert statements can only check boolean values
Correct: C) When Python is invoked with the -O (optimize) flag, all assert statements are stripped from the bytecode and never executed. This means validation logic inside assert silently disappears in optimized production builds. Use explicit if/raise for critical validation.
Q5. A dataset has a start_date and end_date column. You need to ensure that the end date is always after the start date. Which validation method applies?
A) Type validation
B) Range validation
C) Cross-field validation
D) Domain validation
Correct: C) Cross-field validation. Cross-field (cross-reference) validation checks the logical relationship between two or more fields. Verifying that end_date > start_date requires comparing values across two columns, which is the definition of cross-field validation.
Q6. Which of the following is an example of entity integrity violation?
A) A salary column contains the value "N/A"
B) Two rows in a table have the same primary key value
C) An order references a customer_id that does not exist
D) A temperature reading is 999 degrees Celsius
Correct: B) Two rows with the same primary key. Entity integrity requires that every row has a unique, non-null primary key. Duplicate primary keys mean two records cannot be distinguished, violating entity integrity. Option A is a domain issue, C is referential, and D is a range issue.
Q7. What does the pandas method df['age'].between(0, 120) return?
A) A filtered DataFrame with only rows where age is between 0 and 120
B) A boolean Series indicating whether each value is between 0 and 120 (inclusive)
C) The count of values between 0 and 120
D) A new Series with values clipped to the range [0, 120]
Correct: B) A boolean Series. The .between() method returns a boolean Series of the same length as the original, with True where the value falls within the specified range (inclusive by default) and False otherwise. It does not filter or modify the data.
Q8. A column called status should only contain values from the set {"active", "inactive", "on_leave"}. A row contains the value "terminated". Which integrity type is violated?
A) Entity integrity
B) Referential integrity
C) Domain integrity
D) Cross-field integrity
Correct: C) Domain integrity. Domain integrity ensures that all values in a column come from a defined set of acceptable values. "terminated" is not in the allowed set, so it violates the domain constraint for the status column.
Q9. What is the main benefit of performing data validation early in the data pipeline (at ingestion)?
A) It makes the code run faster
B) It eliminates the need for database constraints
C) It prevents bad data from propagating through downstream processes
D) It automatically corrects invalid data
Correct: C) It prevents bad data from propagating downstream. Early validation ("fail fast") catches errors at the source before they contaminate storage, analysis, and reports. It does not automatically fix data or replace database constraints, but it dramatically reduces debugging costs and rework.
Q10. In the following code, what happens when the conversion fails? try: df['price'] = pd.to_numeric(df['price'], errors='raise') except ValueError: df['price'] = pd.to_numeric(df['price'], errors='coerce')
A) The program crashes with a ValueError
B) Non-numeric values are dropped from the DataFrame
C) Non-numeric values are replaced with NaN
D) Non-numeric values are replaced with 0
Correct: C) Non-numeric values are replaced with NaN. The try block attempts strict conversion (errors='raise'), which raises a ValueError if any value cannot be converted. The except block catches that error and re-runs the conversion with errors='coerce', which converts invalid values to NaN instead of failing.