SQLite’s FTS5 module is surprisingly powerful, but its core strength lies not in speeding up searches, but in how it represents text to make searching possible at all.

Let’s see FTS5 in action. Imagine a small books table:

CREATE TABLE books (
    id INTEGER PRIMARY KEY,
    title TEXT,
    author TEXT,
    content TEXT
);

To make content searchable with FTS5, we create a virtual table:

CREATE VIRTUAL TABLE books_fts USING fts5(title, author, content, tokenize='porter');

Notice we’re not just indexing content. FTS5 indexes all specified columns. Now, populate it:

INSERT INTO books_fts (title, author, content)
SELECT title, author, content FROM books;

Here’s a basic search:

SELECT title, author
FROM books_fts
WHERE content MATCH 'database';

This query will return rows where the content column contains the word "database". The MATCH operator is the gateway to FTS5’s query language.

The problem FTS5 solves is the naive approach to text search: SELECT * FROM books WHERE content LIKE '%database%';. This is excruciatingly slow because it requires a full table scan and inefficient string matching. FTS5 transforms text into a searchable index, similar to how a traditional B-tree index works, but for words and their positions.

Internally, FTS5 breaks down your text into tokens. The tokenize='porter' option tells FTS5 to use the Porter stemming algorithm. This means words like "running", "runs", and "ran" are all reduced to a common root, "run", so a search for "run" will match all of them. It also handles punctuation and case insensitivity by default.

When you perform a MATCH query, FTS5 doesn’t scan your original content column. Instead, it consults its internal FTS5 index, which is structured as an inverted index. For every unique token (like "database", "run", "book"), the index stores a list of the documents (rows) that contain that token and, optionally, the positions within those documents.

The MATCH operator parses the query string, tokenizes it using the same rules as the index creation, and then efficiently retrieves the documents that satisfy the query logic. For content MATCH 'database', it looks up "database" in its inverted index and returns the IDs of all rows containing that token.

You can combine terms using boolean operators:

SELECT title
FROM books_fts
WHERE content MATCH 'database AND performance'; -- Documents containing both 'database' and 'performance'

Or use proximity searches:

SELECT title
FROM books_fts
WHERE content MATCH 'database NEAR 5 performance'; -- 'database' and 'performance' within 5 tokens of each other

The fts5.tokenizer() function allows you to inspect how FTS5 tokenizes text. For example, to see how "The quick brown fox jumps!" is tokenized with the Porter algorithm:

SELECT tokens FROM fts5_tokenizer('porter', 'The quick brown fox jumps!');

This would output something like ['the', 'quick', 'brown', 'fox', 'jump'].

The most surprising aspect of FTS5 is how it handles ranking. By default, queries are ranked by relevance, but the calculation is not immediately obvious. It’s based on the number of terms that match, the inverse document frequency (rarer terms contribute more to relevance), and the proximity of terms if multiple terms are searched. You can explicitly control this by using the rank= parameter in your MATCH query, for instance rank='bm25' or rank='unranked'.

The next step is to explore FTS5’s advanced features like prefix searching, phrase searching, and custom tokenizers.

Want structured learning?

Take the full Sqlite course →