The most surprising thing about data modeling is that your database’s structure doesn’t actually dictate how you should model your data; your application’s needs do.

Let’s say you’re building a social media platform. Users post updates, and other users can "like" those posts.

Here’s how you might model that in SQL:

CREATE TABLE users (
    user_id INT PRIMARY KEY AUTO_INCREMENT,
    username VARCHAR(50) NOT NULL UNIQUE,
    email VARCHAR(100) NOT NULL UNIQUE
);

CREATE TABLE posts (
    post_id INT PRIMARY KEY AUTO_INCREMENT,
    user_id INT NOT NULL,
    content TEXT NOT NULL,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    FOREIGN KEY (user_id) REFERENCES users(user_id)
);

CREATE TABLE likes (
    like_id INT PRIMARY KEY AUTO_INCREMENT,
    user_id INT NOT NULL,
    post_id INT NOT NULL,
    liked_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    FOREIGN KEY (user_id) REFERENCES users(user_id),
    FOREIGN KEY (post_id) REFERENCES posts(post_id),
    UNIQUE (user_id, post_id) -- Prevent duplicate likes
);

In this SQL model, each entity (users, posts, likes) gets its own table, and relationships are defined using foreign keys. When a user likes a post, a new row is inserted into the likes table. To get all likes for a post, you’d query SELECT COUNT(*) FROM likes WHERE post_id = ?. This is normalized, meaning data isn’t repeated unnecessarily, which is great for consistency and preventing update anomalies.

Now, consider the same scenario with NoSQL, specifically a document database like MongoDB. If your primary access pattern is fetching a post and all its associated data (including who liked it), you might embed the likes within the post document:

{
  "_id": "post123",
  "userId": "userABC",
  "content": "Just had a great day!",
  "createdAt": "2023-10-27T10:00:00Z",
  "likes": [
    {
      "userId": "userXYZ",
      "likedAt": "2023-10-27T10:05:00Z"
    },
    {
      "userId": "userPQR",
      "likedAt": "2023-10-27T10:15:00Z"
    }
  ]
}

Here, the likes are an array directly within the posts document. To get all likes for a post, you’d query the posts collection and access the likes array: db.posts.findOne({ _id: "post123" }). This is denormalized. It’s fantastic for read performance when you need all related data together, as it avoids joins. However, if a user likes a post, you’re updating a potentially large document, and if you wanted to find all posts a specific user liked, you’d have to scan the posts collection.

A hybrid approach might use SQL for user profiles and post metadata (ensuring referential integrity) but use a NoSQL store (like Redis) for high-volume, ephemeral data like "likes" or "view counts" that benefit from extremely fast in-memory operations and don’t require complex relational querying.

The core problem data modeling solves is efficiently storing and retrieving data to meet application requirements. SQL’s strength lies in its ACID properties and ability to enforce complex relationships, making it ideal for transactional systems where data integrity is paramount (e.g., financial transactions, inventory management). NoSQL, in its various forms (document, key-value, column-family, graph), excels in scenarios demanding high scalability, flexible schemas, and specific access patterns, such as real-time analytics, content management, or IoT data ingestion.

A hybrid model leverages the strengths of both. For instance, you might store user account details and their primary posts in a relational database for strong consistency and complex querying. However, a user’s "feed" – a dynamic, personalized list of posts from people they follow – could be generated and stored in a denormalized fashion in a NoSQL database or even a specialized graph database to optimize for fast retrieval of this highly interconnected data.

Many developers get hung up on choosing "the right database" before understanding their data and access patterns. The reality is that your application’s specific read and write patterns, query complexity, scalability needs, and consistency requirements are the true drivers of your data model decisions, not the database technology itself. Sometimes, a single SQL database can be modeled (using different tables, indexes, and views) to handle a wide range of access patterns, and other times, a polyglot persistence strategy using multiple NoSQL databases or a mix of SQL and NoSQL is the only way to achieve optimal performance and scalability.

When you move from modeling a single piece of data to modeling relationships between many pieces of data, you’ll start encountering the challenges of query performance and the trade-offs between consistency and availability.

Want structured learning?

Take the full System Design course →