Feature columns are TensorFlow’s way of abstracting away the complexity of how raw input data is transformed into a format suitable for machine learning models. They’re not just a data structure; they’re a set of instructions for TensorFlow to perform specific transformations.
Let’s see this in action. Imagine you have a dataset with a categorical feature like "city" and a numerical feature like "age."
import tensorflow as tf
import pandas as pd
# Sample data
data = {
'city': ['New York', 'London', 'Paris', 'New York', 'London'],
'age': [25, 32, 45, 28, 35],
'target': [0, 1, 0, 1, 0]
}
df = pd.DataFrame(data)
# Define feature columns
city_vocab = ['New York', 'London', 'Paris']
city_column = tf.feature_column.categorical_column_with_vocabulary_list(
'city', vocabulary_list=city_vocab
)
age_column = tf.feature_column.numeric_column('age')
# Create a feature layer from the feature columns
feature_layer = tf.keras.layers.DenseFeatures(
[
tf.feature_column.indicator_column(city_column), # One-hot encode city
age_column
]
)
# Prepare input data as a dictionary of tensors
input_data = {
'city': tf.constant(['New York', 'London', 'Tokyo']), # Tokyo is out-of-vocabulary
'age': tf.constant([30, 40, 22])
}
# Pass data through the feature layer
processed_features = feature_layer(input_data)
print("Original Input:")
for key, value in input_data.items():
print(f"{key}: {value.numpy()}")
print("\nProcessed Features:")
print(processed_features.numpy())
The output of processed_features will show the one-hot encoded "city" (with a vector of zeros for "Tokyo" since it’s out-of-vocabulary) and the raw "age" values. This feature_layer can then be directly plugged into a Keras model.
The problem feature columns solve is the "last mile" of data preparation: bridging the gap between raw, human-readable data and the numerical tensors your model expects. They encapsulate common transformations like one-hot encoding, embedding, bucketization, and normalization, making your data pipeline reproducible and model-agnostic. Instead of writing custom Python code for each transformation, you declare it using a feature column API. This allows you to define your features once and then easily reuse them across different models or experiments.
Internally, when you use tf.keras.layers.DenseFeatures, it compiles your declared feature columns into a computation graph. For a categorical_column_with_vocabulary_list, it builds a lookup table. When data comes in, it uses this table to map strings to indices. If you wrap that with indicator_column, it then converts those indices into dense, one-hot vectors. For numeric_column, it’s essentially a pass-through, but it ensures the data is in the correct shape and type. The DenseFeatures layer then concatenates all these transformed features into a single tensor, ready for your model’s dense layers.
The "out-of-vocabulary" (OOV) handling for categorical_column_with_vocabulary_list is managed by the num_oov_buckets parameter. If you set num_oov_buckets=1 during column creation, any vocabulary item not explicitly listed will be mapped to a single, shared bucket. This is useful for handling rare or unseen categories without explicitly defining them, effectively turning them into a separate category.
Preprocessing layers, like tf.keras.layers.Normalization or tf.keras.layers.StringLookup, offer a more explicit and often more flexible way to perform these transformations. You can build a tf.keras.Sequential model using these layers, train them (e.g., compute mean and variance for normalization), and then integrate this preprocessing model into your main model. This provides better control and allows for stateful preprocessing (like learning vocabulary or statistics from data).
The most surprising thing is how feature columns, particularly when combined with DenseFeatures, can dynamically adapt to different input data shapes. If you have a multi-valued categorical feature (e.g., a user’s list of purchased items), DenseFeatures will correctly handle the ragged or variable-length input, often by applying pooling operations (like summing or averaging embeddings) if you don’t explicitly specify how to handle the multiple values.
The next step is understanding how to combine these feature columns with embedding layers for high-cardinality categorical features.