Vector unit tests are your first line of defense against broken data transformations, catching issues before they impact production systems.

Let’s see it in action. Imagine you have a transform_user_data function that takes raw user events and outputs a structured format suitable for your data warehouse.

import pandas as pd

def transform_user_data(df: pd.DataFrame) -> pd.DataFrame:
    """
    Transforms raw user event data into a structured format.
    """
    # Ensure required columns exist
    required_cols = ['user_id', 'event_name', 'timestamp', 'properties']
    for col in required_cols:
        if col not in df.columns:
            raise ValueError(f"Missing required column: {col}")

    # Convert timestamp to datetime objects
    df['timestamp'] = pd.to_datetime(df['timestamp'])

    # Extract specific properties if they exist
    if 'properties' in df.columns and df['properties'].notna().any():
        # Assuming properties is a JSON string or dict
        try:
            # Attempt to parse if it's a string
            properties_df = df['properties'].apply(json.loads)
            properties_df = pd.json_normalize(properties_df)
            df = pd.concat([df.drop('properties', axis=1), properties_df], axis=1)
        except (AttributeError, json.JSONDecodeError):
            # Handle cases where properties might not be JSON or is already a dict
            if isinstance(df['properties'].iloc[0], dict):
                properties_df = pd.json_normalize(df['properties'])
                df = pd.concat([df.drop('properties', axis=1), properties_df], axis=1)
            else:
                df['properties'] = df['properties'].astype(str) # Fallback to string

    # Rename columns for consistency
    df.rename(columns={'user_id': 'userID', 'event_name': 'eventName'}, inplace=True)

    # Add a derived field
    df['event_hour'] = df['timestamp'].dt.hour

    return df

# Example usage (for demonstration, not part of the test itself)
import json
raw_data = {
    'user_id': [1, 2, 3],
    'event_name': ['click', 'view', 'purchase'],
    'timestamp': ['2023-10-27T10:00:00Z', '2023-10-27T11:30:00Z', '2023-10-27T12:45:00Z'],
    'properties': ['{"item_id": "A", "price": 10}', '{}', '{"item_id": "B", "price": 25, "quantity": 2}']
}
raw_df = pd.DataFrame(raw_data)
transformed_df = transform_user_data(raw_df.copy())
print(transformed_df)

This transform_user_data function is designed to take a Pandas DataFrame representing raw events, parse timestamps, unpack nested JSON properties, rename fields, and add a new derived field (event_hour).

The core problem this solves is data drift and schema evolution. As your upstream data sources change (e.g., a new field is added, a timestamp format is altered, or a JSON structure shifts), your transformations can break silently, leading to corrupted or incomplete data in your downstream systems. Unit tests act as a contract, ensuring your transformation logic behaves as expected under various conditions.

Internally, the function relies on Pandas for its powerful data manipulation capabilities. It checks for essential columns, uses pd.to_datetime for date parsing, pd.json_normalize to flatten JSON structures, df.rename for schema alignment, and datetime accessor properties (.dt.hour) for feature engineering. Each step is a potential point of failure if the input data doesn’t conform to expectations.

The key levers you control are the input data provided to the test and the assertions you make about the output. For instance, you’d write tests to cover:

  • Schema Validation: Does the output have the expected columns and data types?
  • Data Type Conversion: Are timestamps correctly parsed? Are numerical strings converted to numbers?
  • JSON Parsing: Does nested JSON in a properties column get correctly unpacked into distinct columns?
  • Edge Cases: What happens with empty input DataFrames, missing optional columns, or malformed JSON?
  • Derived Fields: Is the event_hour field accurately calculated from the timestamp?

Here’s a basic unit test using pytest:

import pandas as pd
import pytest
import json
from your_module import transform_user_data # Assuming your function is in 'your_module.py'

def test_transform_user_data_basic():
    raw_data = {
        'user_id': [101, 102, 103],
        'event_name': ['page_load', 'click', 'submit'],
        'timestamp': ['2023-10-27T09:15:00Z', '2023-10-27T14:05:00Z', '2023-10-27T20:30:00Z'],
        'properties': ['{"page": "/home"}', '{"element": "button_A"}', '{"form_id": "contact"}']
    }
    input_df = pd.DataFrame(raw_data)

    expected_data = {
        'userID': [101, 102, 103],
        'eventName': ['page_load', 'click', 'submit'],
        'timestamp': pd.to_datetime(['2023-10-27T09:15:00Z', '2023-10-27T14:05:00Z', '2023-10-27T20:30:00Z']),
        'page': ['/home', None, None],
        'element': [None, 'button_A', None],
        'form_id': [None, None, 'contact'],
        'event_hour': [9, 14, 20]
    }
    expected_df = pd.DataFrame(expected_data)

    actual_df = transform_user_data(input_df.copy())

    # Use pandas testing for robust DataFrame comparison
    pd.testing.assert_frame_equal(actual_df.sort_index(axis=1), expected_df.sort_index(axis=1), check_dtype=True)

def test_transform_user_data_missing_columns():
    raw_data = {
        'user_id': [101],
        'timestamp': ['2023-10-27T09:15:00Z'],
        'properties': ['{"page": "/home"}']
    }
    input_df = pd.DataFrame(raw_data)

    with pytest.raises(ValueError, match="Missing required column: event_name"):
        transform_user_data(input_df.copy())

def test_transform_user_data_malformed_json():
    raw_data = {
        'user_id': [101, 102],
        'event_name': ['page_load', 'click'],
        'timestamp': ['2023-10-27T09:15:00Z', '2023-10-27T14:05:00Z'],
        'properties': ['{"page": "/home"}', 'this is not json']
    }
    input_df = pd.DataFrame(raw_data)

    # The current implementation will likely convert the malformed JSON to string or raise an error.
    # Depending on desired behavior, you might assert a specific outcome or a raised error.
    # For this example, let's assume we want it to be handled gracefully and converted to string.
    # If JSONDecodeError is raised, the test would need to catch that.
    # The current function might fall back to string conversion if json.loads fails.
    # A more robust solution might involve separate error handling for malformed JSON.

    # Let's test the fallback to string if that's the intended behavior for malformed:
    transformed_df = transform_user_data(input_df.copy())
    assert transformed_df['properties'].iloc[1] == 'this is not json'
    assert transformed_df['userID'].iloc[1] == 102 # Other transformations should still work

def test_transform_user_data_empty_properties():
    raw_data = {
        'user_id': [101, 102],
        'event_name': ['page_load', 'click'],
        'timestamp': ['2023-10-27T09:15:00Z', '2023-10-27T14:05:00Z'],
        'properties': [None, '{}']
    }
    input_df = pd.DataFrame(raw_data)

    expected_data = {
        'userID': [101, 102],
        'eventName': ['page_load', 'click'],
        'timestamp': pd.to_datetime(['2023-10-27T09:15:00Z', '2023-10-27T14:05:00Z']),
        'event_hour': [9, 14]
    }
    expected_df = pd.DataFrame(expected_data)

    actual_df = transform_user_data(input_df.copy())
    pd.testing.assert_frame_equal(actual_df.sort_index(axis=1), expected_df.sort_index(axis=1), check_dtype=True)

The most surprising thing about unit testing data transformations is how much time it saves downstream. Many teams treat transformations as "just code" and defer testing to integration or end-to-end suites, only to discover data quality issues in production dashboards or ETL failures. By writing specific, isolated tests for each transformation function, you create a safety net that catches regressions early and often, making your data pipelines far more robust.

When dealing with JSON properties, the pd.json_normalize function is a powerful tool, but it assumes a consistent structure. If your JSON can have deeply nested or varying keys, you might need to write custom logic to flatten it selectively or use a recursive approach before passing it to json_normalize, or explicitly define the record_path and meta arguments to guide its flattening.

The next challenge you’ll face is orchestrating these unit tests within a CI/CD pipeline to ensure they run automatically on every code change.

Want structured learning?

Take the full Vector course →