epoll and kqueue aren’t just about doing I/O faster; they fundamentally change how you think about managing network connections.

Imagine you’re a waiter at a restaurant, and each table is a client connection. The old way (like select or poll) is like the waiter having to walk to every single table after every order to see if anyone needs anything. It’s exhausting and inefficient, especially when most tables are just sitting there.

// Old way: "Are you ready?" to everyone, all the time.
int max_fd = 0;
fd_set read_fds;

// ... populate read_fds with all connected sockets ...

while (true) {
    // This call blocks until *any* activity, but we don't know *which* fd.
    int activity = select(max_fd + 1, &read_fds, NULL, NULL, NULL);

    if (activity < 0) { /* handle error */ }

    // Now, loop through *all* fds to find out which one is ready.
    for (int i = 0; i <= max_fd; i++) {
        if (FD_ISSET(i, &read_fds)) {
            // This fd is ready! Process it.
            // ... read/write ...
        }
    }
}

This select call is the bottleneck. It has to check every single file descriptor (fd) you’ve told it about, even if only one is active. The operating system has to iterate through a list of all possible connections for every event, which scales poorly as the number of connections grows.

Now, enter epoll (Linux) and kqueue (BSD/macOS). These are like having a smart notification system. Instead of checking every table, the waiter is given a buzzer for each table. When a buzzer rings, the waiter only goes to that specific table.

epoll in Action (Linux)

epoll uses a data structure managed by the kernel to track file descriptors you’re interested in. You don’t pass a big list of fds on every call; you tell epoll which fds to watch, and it tells you which ones are ready.

  1. Create an epoll instance:

    # No direct command, this is a syscall.
    # In C: int epfd = epoll_create1(0);
    

    This gives you a new epoll file descriptor (epfd). Think of this as your central notification hub.

  2. Add file descriptors to watch:

    # In C:
    struct epoll_event ev;
    ev.events = EPOLLIN | EPOLLET; // Event to watch for (EPOLLIN = data ready to read)
                                 // EPOLLET = edge-triggered mode (more on this later)
    // Add a listening socket (e.g., for incoming connections)
    epoll_ctl(epfd, EPOLL_CTL_ADD, listen_fd, &ev);
    // Add a connected client socket
    epoll_ctl(epfd, EPOLL_CTL_ADD, client_fd, &ev);
    

    You register your listen_fd and any client_fds you’ve accepted. You tell epoll what you’re interested in (e.g., EPOLLIN for incoming data).

  3. Wait for events:

    # In C:
    struct epoll_event events[MAX_EVENTS]; // Array to store ready events
    int num_events = epoll_wait(epfd, events, MAX_EVENTS, -1); // -1 means block indefinitely
    

    This is the magic. epoll_wait blocks until at least one of the registered fds has an event. It returns the number of events that occurred and populates the events array with the specific fds and their event types.

  4. Process ready events:

    # In C:
    for (int i = 0; i < num_events; i++) {
        if (events[i].data.fd == listen_fd) {
            // New connection! Accept it and add the new client_fd to epoll.
            int client_fd = accept(listen_fd, NULL, NULL);
            struct epoll_event new_ev;
            new_ev.events = EPOLLIN | EPOLLET;
            new_ev.data.fd = client_fd;
            epoll_ctl(epfd, EPOLL_CTL_ADD, client_fd, &new_ev);
        } else {
            // Existing client has data to read or is ready to write.
            int fd = events[i].data.fd;
            if (events[i].events & EPOLLIN) {
                // Read data from fd
                char buffer[1024];
                ssize_t bytes_read = read(fd, buffer, sizeof(buffer));
                if (bytes_read > 0) {
                    // Process data...
                } else if (bytes_read == 0) {
                    // Client closed connection. Remove from epoll and close.
                    epoll_ctl(epfd, EPOLL_CTL_DEL, fd, NULL);
                    close(fd);
                } else {
                    // Read error. Handle and potentially remove.
                }
            }
            // Handle other events like EPOLLOUT (write ready)
        }
    }
    

    You iterate only through the events that actually happened. This is a massive performance win.

kqueue in Action (BSD/macOS)

kqueue is conceptually similar but has a slightly different API. It’s more general-purpose and can monitor more than just network sockets (e.g., file system events, process status).

  1. Create a kqueue instance:

    # In C: int kq = kqueue();
    

    This gives you a kq file descriptor.

  2. Register events: kqueue uses a two-step process for registering events: you add struct kevent entries to a list and then submit that list to kqueue.

    // In C:
    struct kevent change;
    // For a listening socket: EVFILT_READ watches for data to read (e.g., incoming connections)
    // EV_ADD: Add this event filter.
    // EV_ENABLE: Enable it.
    // EV_SET: A helper to set the parameters.
    EV_SET(&change, listen_fd, EVFILT_READ, EV_ADD | EV_ENABLE, 0, 0, NULL); // NULL is for user data
    
    struct kevent event_list[MAX_EVENTS]; // Array to store events you want to monitor
    // Submit the change list to the kqueue
    int nev = kevent(kq, &change, 1, NULL, 0, NULL); // The last NULL,0 means don't wait for events yet
    if (nev < 0) { /* handle error */ }
    

    You register EVFILT_READ on your listen_fd. kqueue uses "filters" (EVFILT_READ, EVFILT_WRITE, EVFILT_VNODE for file system, etc.).

  3. Wait for events:

    // In C:
    struct kevent events[MAX_EVENTS];
    // The last parameter (NULL) is a timeout. NULL means block indefinitely.
    int num_events = kevent(kq, NULL, 0, events, MAX_EVENTS, NULL);
    if (num_events < 0) { /* handle error */ }
    

    kevent is used for both registering and waiting. When the first NULL, 0 arguments are passed, it means "don’t register anything, just wait for up to MAX_EVENTS events."

  4. Process ready events:

    // In C:
    for (int i = 0; i < num_events; i++) {
        if (events[i].ident == listen_fd) { // ident is the file descriptor or other identifier
            // New connection! Accept it.
            int client_fd = accept(listen_fd, NULL, NULL);
    
            // Now, register this new client_fd with kqueue to watch for reads.
            struct kevent new_ev;
            EV_SET(&new_ev, client_fd, EVFILT_READ, EV_ADD | EV_ENABLE, 0, 0, NULL);
            kevent(kq, &new_ev, 1, NULL, 0, NULL); // Submit the registration
        } else {
            // Existing client has data to read.
            int fd = events[i].ident;
            if (events[i].filter == EVFILT_READ) {
                char buffer[1024];
                ssize_t bytes_read = read(fd, buffer, sizeof(buffer));
                if (bytes_read > 0) {
                    // Process data...
                } else if (bytes_read == 0) {
                    // Client closed connection. Remove from kqueue and close.
                    struct kevent del_ev;
                    EV_SET(&del_ev, fd, EVFILT_READ, EV_DELETE, 0, 0, NULL);
                    kevent(kq, &del_ev, 1, NULL, 0, NULL); // Submit deletion
                    close(fd);
                } else {
                    // Read error. Handle and potentially remove.
                }
            }
            // Handle EVFILT_WRITE, etc.
        }
    }
    

    You check the ident (the fd) and filter to see what happened. If it’s a new connection, you accept it and then register the new client_fd with kqueue itself.

Edge-Triggered vs. Level-Triggered

This is a crucial distinction, especially for epoll.

  • Level-Triggered (LT): This is the default and the behavior of select/poll. epoll_wait will keep telling you an event is pending until you’ve fully processed it. If you read part of the data, epoll_wait will report the EPOLLIN event again on the next call. This is safer and easier to get right, but can lead to busy-waiting if you’re not careful (e.g., if read returns EAGAIN but there’s still data).
  • Edge-Triggered (ET): When epoll_wait reports an event, it means the state changed. If you read data, epoll_wait will not report EPOLLIN again until new data arrives. This requires you to read/write until you get EAGAIN (or EWOULDBLOCK) to ensure you’ve consumed all available data or written all you can. This is more efficient because you get fewer notifications, but it’s more complex to implement correctly. If you miss an event (e.g., you don’t read all the data before the next epoll_wait), you might never get notified about that data again until more arrives.

The common pattern for high-performance servers is to use ET mode with epoll and ensure you read/write in a loop until EAGAIN.

The Mental Model

The core idea is shifting the burden of polling from the application to the kernel. Instead of your application constantly asking "Is anything ready?", the kernel tells your application "This is ready."

This is why epoll and kqueue scale to hundreds of thousands or millions of connections. The epoll_wait or kevent call itself is O(1) in terms of the number of connections – it just checks its internal event queue. The work is proportional to the number of active connections, not the total number of connections.

The "state" of which connections are being watched is maintained by the kernel in the epoll or kqueue data structure. Your application only needs to track the state of connections that are actively doing something.

The most surprising thing is how kqueue consolidates event registration and retrieval into a single syscall. You use kevent for both "register this filter" and "tell me what’s happened." This unified interface for diverse event types (sockets, files, processes) is a powerful design choice that makes it incredibly flexible.

The next hurdle is understanding how to properly manage resource cleanup when connections are closed or errors occur, ensuring you always remove event registrations from epoll or kqueue.

Want structured learning?

Take the full Tcp course →