TensorFlow NLP Text Preprocessing: Tokenize and Encode (2026)

TensorFlow’s TextVectorization layer is a surprisingly powerful tool that does more than just split text into words; it builds a dynamic, trainable vocabulary that adapts to your specific dataset.

Let’s see it in action. Imagine we have a few sentences and want to turn them into numerical IDs that a model can understand.

import tensorflow as tf
from tensorflow.keras.layers import TextVectorization

# Sample data
sentences = [
    "The quick brown fox jumps over the lazy dog.",
    "The lazy dog slept.",
    "A quick brown rabbit hops.",
    "The fox chased the rabbit."
]

# Initialize the TextVectorization layer
# max_tokens: limits the vocabulary size to the most frequent tokens
# output_sequence_length: pads or truncates sequences to a fixed length
vectorize_layer = TextVectorization(
    max_tokens=10,
    output_sequence_length=8
)

# Adapt the layer to the data to build the vocabulary
vectorize_layer.adapt(sentences)

# Get the vocabulary
print("Vocabulary:", vectorize_layer.get_vocabulary())

# Vectorize the sentences
vectorized_sentences = vectorize_layer(sentences)
print("Vectorized sentences:\n", vectorized_sentences)

The output shows the vocabulary and the numerical representation of our sentences:

Vocabulary: ['', '[UNK]', 'the', 'quick', 'brown', 'fox', 'lazy', 'dog', 'jumps', 'over']
Vectorized sentences:
 tf.Tensor(
[[ 2  3  4  5  6  7  8  0]
 [ 2  7  8  9  0  0  0  0]
 [ 1  3  4 10  0  0  0  0]
 [ 2  5  9  1  0  0  0  0]], shape=(4, 8), dtype=int64)

Notice that the vocabulary includes '' (empty string) and [UNK] (unknown token) by default. The max_tokens=10 means we get the 8 most frequent actual words, plus these two special tokens. output_sequence_length=8 ensures every output sequence is exactly 8 integers long.

The core problem TextVectorization solves is bridging the gap between raw text and numerical input for machine learning models. Models can’t directly process strings; they need numbers. This layer automates the process of:

Tokenization: Breaking down text into smaller units (words, sub-words, characters). By default, it splits on whitespace and punctuation.
Vocabulary Creation: Building a mapping from these tokens to unique integer IDs.
Encoding: Converting input text into sequences of these integer IDs.
Padding/Truncation: Ensuring all sequences have the same length, which is crucial for batching data for neural networks.

Internally, when you call .adapt(data), the layer iterates through your dataset. It tokenizes each piece of text, counts the frequency of each token, and then selects the max_tokens - 2 most frequent tokens to form the vocabulary. The remaining tokens are mapped to the [UNK] ID (usually 1). The '' token (usually 0) is reserved for padding.

The TextVectorization layer offers several configurable parameters to fine-tune its behavior:

max_tokens: As seen, this is the maximum size of the vocabulary. If you have a very large corpus, you might set this to 20,000 or 50,000 to capture most common words.
output_sequence_length: Determines the fixed length of the output integer sequences. If a tokenized sentence is longer, it’s truncated. If it’s shorter, it’s padded with the ID for the empty string (0).
standardize: Controls how text is cleaned before tokenization. Options include 'lower_and_strip_punctuation' (default), 'lower', 'strip_punctuation', or None.
split: Defines how tokens are separated. The default is 'whitespace', but you can use custom splitting logic.
ngrams: Allows you to generate n-grams (sequences of n tokens) instead of just individual tokens. For example, ngrams=2 would create bigrams.

One aspect that often trips people up is how the max_tokens parameter interacts with the special tokens. If you set max_tokens=100, you don’t get 100 words. You get 98 words, plus the '' and [UNK] tokens. This means the 99th most frequent word in your data will be mapped to [UNK] if max_tokens is 100.

After vectorization, the output is a tensor of integers. These integers can then be fed directly into an embedding layer (tf.keras.layers.Embedding) in your neural network, which learns dense vector representations for each token ID.

The next step after basic tokenization and encoding is often handling rare words or out-of-vocabulary terms more gracefully, perhaps through subword tokenization like WordPiece or SentencePiece, or by using pre-trained embeddings.

More Deep Dives in Tensorflow