SQLite lets you define custom aggregate functions, which is way cooler than just using SUM() or AVG().
Let’s say we want to calculate the "variance" of a set of numbers. Standard SQL has AVG, SUM, COUNT, MIN, MAX, but not VARIANCE. We’ll build it.
First, we need a way for SQLite to call our custom function. This is done through the sqlite3_create_function API. In Python, we use the sqlite3 module’s create_aggregate method.
Here’s how we define the variance aggregate in Python:
import sqlite3
class Variance:
def __init__(self):
self.n = 0
self.mean = 0.0
self.m2 = 0.0
def step(self, value):
if value is None:
return
value = float(value)
self.n += 1
delta = value - self.mean
self.mean += delta / self.n
delta2 = value - self.mean
self.m2 += delta * delta2
def finalize(self):
if self.n < 2:
return None # Variance is undefined for less than 2 elements
return self.m2 / (self.n - 1) # Sample variance
conn = sqlite3.connect(':memory:')
conn.create_aggregate('variance', 1, Variance)
# Now let's use it
cursor = conn.cursor()
cursor.execute("CREATE TABLE numbers (val REAL)")
cursor.executemany("INSERT INTO numbers VALUES (?)", [(1.0,), (2.0,), (3.0,), (4.0,), (5.0,)])
conn.commit()
cursor.execute("SELECT variance(val) FROM numbers")
print(cursor.fetchone()[0])
cursor.execute("SELECT variance(val) FROM numbers WHERE val > 2")
print(cursor.fetchone()[0])
cursor.execute("SELECT variance(val) FROM numbers WHERE val > 10")
print(cursor.fetchone()[0])
This prints 2.5, 1.5, and None.
The Variance class is the core.
__init__initializes the state. We need to keep track of the count (n), the running mean (mean), and a sum of squared differences from the mean (m2). This is Welford’s online algorithm, which is numerically stable and avoids needing to store all values.step(self, value)is called for each row. It updates then,mean, andm2based on the currentvalue.finalize(self)is called once at the end. It computes the final variance from the accumulated state. For sample variance, we divide byn-1.
The conn.create_aggregate('variance', 1, Variance) call registers this Python class as a new SQL aggregate function named variance. The 1 indicates it takes one argument. SQLite then knows how to call the step and finalize methods when variance() is used in a query.
The CREATE TABLE and INSERT statements populate a simple table. The SELECT variance(val) FROM numbers query demonstrates its usage. The first call gives the variance of 1, 2, 3, 4, 5. The second filters to 3, 4, 5. The third filters to an empty set, correctly returning None because variance is undefined for fewer than two data points.
The beauty here is that the aggregation logic happens within the database engine, efficiently processing data as it’s scanned, without needing to pull intermediate results back to the application. You can define aggregates for any kind of stateful computation: string concatenation, statistical measures, or even custom JSON aggregation.
When you define a custom aggregate, you’re essentially providing a mini-program that the SQLite query engine can execute. This program has an initialization phase (__init__), an update phase for each row (step), and a finalization phase (finalize). The query engine manages the state of your aggregate object across rows, making it a powerful extension.
The most surprising thing is how easily you can bypass the standard SQL aggregate functions. You aren’t limited to what the SQL standard provides; you can implement complex logic directly in your database queries using Python (or C, if you’re embedding SQLite directly). This allows for significant optimization by keeping computation close to the data.
The next frontier is understanding how to handle ORDER BY within aggregate functions, which requires a different signature and a more complex state management.