The most surprising thing about graph database storage is that it’s fundamentally about minimizing data movement, not just about clever indexing.
Let’s look at Neo4j. Imagine you’re querying for friends of friends. In a traditional relational database, you’d be joining multiple tables, hopping back and forth between disk and memory, potentially reading the same data many times. Neo4j, however, stores data in a way that makes these "hops" incredibly efficient.
Here’s a simplified view of a Neo4j node and its relationships on disk. Notice how the relationships (edges) are stored directly with the nodes they connect to.
// Conceptual representation, not actual file format
Node {
id: 123,
labels: ["Person"],
properties: { name: "Alice" },
relationships: [
{ type: "FRIENDS_WITH", direction: "OUT", target_node_id: 456 },
{ type: "WORKS_AT", direction: "OUT", target_node_id: 789 }
]
}
When Neo4j needs to find Alice’s friends, it reads Alice’s node data. The relationships field directly points to other nodes. If a friend’s node is already in memory (due to recent access or caching), it’s a near-instantaneous lookup. If not, Neo4j knows exactly where on disk to find it based on the target_node_id. This is a phenomenon often called "pointer chasing" or "index-free adjacency."
TigerGraph takes a similar approach but with some key differences in its underlying data model, often described as a "distributed native graph." While Neo4j is typically a single-server or clustered architecture where data is sharded but still fundamentally co-located within a shard, TigerGraph is designed from the ground up for distributed execution.
Consider a query that spans multiple machines in a TigerGraph cluster. Each machine holds a partition of the graph. When a query needs to traverse a relationship that crosses these partitions, TigerGraph uses a sophisticated distributed messaging system. The "storage" here isn’t just about disk layout; it’s about how efficiently these distributed partitions can communicate and resolve traversals.
Here’s a glimpse of how a distributed traversal might work conceptually:
- Local Traversal: A query starts on machine A, finds a node, and needs to follow its outgoing edges.
- Partition Check: For each edge, TigerGraph checks if the target node resides on machine A or a different machine (B, C, etc.).
- Remote Fetch (if needed): If the target node is on machine B, a message is sent from A to B requesting the node’s data or the next hop from that node.
- Response & Continuation: Machine B processes the request and sends back the relevant information, allowing the traversal to continue.
The magic in TigerGraph’s storage is its ability to minimize the latency and overhead of these inter-machine communications. This involves strategies like intelligent data partitioning, efficient serialization of messages, and optimized network protocols.
The core problem both solve is the "N+1" problem or the "join explosion" common in relational systems when modeling highly connected data. Instead of performing costly joins across separate tables, graph databases store relationships as direct pointers or references, leading to traversal speeds that scale with the number of relationships traversed, not the total size of the dataset.
The levers you control are primarily around data modeling and query writing. For Neo4j, this means designing your node labels and relationship types effectively and understanding how Cypher queries translate into internal traversal plans. For TigerGraph, it involves partitioning strategies, understanding how your data is distributed across the cluster, and optimizing GSQL queries for distributed execution.
What most people don’t realize is that the performance benefits of graph databases are heavily influenced by the locality of data access, both within a single server (Neo4j) and across distributed nodes (TigerGraph). A query that jumps wildly across network partitions in a distributed system, even with efficient protocols, will be slower than one that stays local. This is why thoughtful partitioning and data modeling are paramount in distributed graph databases.
The next frontier to explore is how these databases handle schema evolution and updates in a high-throughput, distributed environment.