epoll and kqueue aren’t just about doing I/O faster; they fundamentally change how you think about managing network connections.
Imagine you’re a waiter at a restaurant, and each table is a client connection. The old way (like select or poll) is like the waiter having to walk to every single table after every order to see if anyone needs anything. It’s exhausting and inefficient, especially when most tables are just sitting there.
// Old way: "Are you ready?" to everyone, all the time.
int max_fd = 0;
fd_set read_fds;
// ... populate read_fds with all connected sockets ...
while (true) {
// This call blocks until *any* activity, but we don't know *which* fd.
int activity = select(max_fd + 1, &read_fds, NULL, NULL, NULL);
if (activity < 0) { /* handle error */ }
// Now, loop through *all* fds to find out which one is ready.
for (int i = 0; i <= max_fd; i++) {
if (FD_ISSET(i, &read_fds)) {
// This fd is ready! Process it.
// ... read/write ...
}
}
}
This select call is the bottleneck. It has to check every single file descriptor (fd) you’ve told it about, even if only one is active. The operating system has to iterate through a list of all possible connections for every event, which scales poorly as the number of connections grows.
Now, enter epoll (Linux) and kqueue (BSD/macOS). These are like having a smart notification system. Instead of checking every table, the waiter is given a buzzer for each table. When a buzzer rings, the waiter only goes to that specific table.
epoll in Action (Linux)
epoll uses a data structure managed by the kernel to track file descriptors you’re interested in. You don’t pass a big list of fds on every call; you tell epoll which fds to watch, and it tells you which ones are ready.
-
Create an epoll instance:
# No direct command, this is a syscall. # In C: int epfd = epoll_create1(0);This gives you a new
epollfile descriptor (epfd). Think of this as your central notification hub. -
Add file descriptors to watch:
# In C: struct epoll_event ev; ev.events = EPOLLIN | EPOLLET; // Event to watch for (EPOLLIN = data ready to read) // EPOLLET = edge-triggered mode (more on this later) // Add a listening socket (e.g., for incoming connections) epoll_ctl(epfd, EPOLL_CTL_ADD, listen_fd, &ev); // Add a connected client socket epoll_ctl(epfd, EPOLL_CTL_ADD, client_fd, &ev);You register your
listen_fdand anyclient_fds you’ve accepted. You tellepollwhat you’re interested in (e.g.,EPOLLINfor incoming data). -
Wait for events:
# In C: struct epoll_event events[MAX_EVENTS]; // Array to store ready events int num_events = epoll_wait(epfd, events, MAX_EVENTS, -1); // -1 means block indefinitelyThis is the magic.
epoll_waitblocks until at least one of the registered fds has an event. It returns the number of events that occurred and populates theeventsarray with the specific fds and their event types. -
Process ready events:
# In C: for (int i = 0; i < num_events; i++) { if (events[i].data.fd == listen_fd) { // New connection! Accept it and add the new client_fd to epoll. int client_fd = accept(listen_fd, NULL, NULL); struct epoll_event new_ev; new_ev.events = EPOLLIN | EPOLLET; new_ev.data.fd = client_fd; epoll_ctl(epfd, EPOLL_CTL_ADD, client_fd, &new_ev); } else { // Existing client has data to read or is ready to write. int fd = events[i].data.fd; if (events[i].events & EPOLLIN) { // Read data from fd char buffer[1024]; ssize_t bytes_read = read(fd, buffer, sizeof(buffer)); if (bytes_read > 0) { // Process data... } else if (bytes_read == 0) { // Client closed connection. Remove from epoll and close. epoll_ctl(epfd, EPOLL_CTL_DEL, fd, NULL); close(fd); } else { // Read error. Handle and potentially remove. } } // Handle other events like EPOLLOUT (write ready) } }You iterate only through the events that actually happened. This is a massive performance win.
kqueue in Action (BSD/macOS)
kqueue is conceptually similar but has a slightly different API. It’s more general-purpose and can monitor more than just network sockets (e.g., file system events, process status).
-
Create a kqueue instance:
# In C: int kq = kqueue();This gives you a
kqfile descriptor. -
Register events:
kqueueuses a two-step process for registering events: you addstruct kevententries to a list and then submit that list tokqueue.// In C: struct kevent change; // For a listening socket: EVFILT_READ watches for data to read (e.g., incoming connections) // EV_ADD: Add this event filter. // EV_ENABLE: Enable it. // EV_SET: A helper to set the parameters. EV_SET(&change, listen_fd, EVFILT_READ, EV_ADD | EV_ENABLE, 0, 0, NULL); // NULL is for user data struct kevent event_list[MAX_EVENTS]; // Array to store events you want to monitor // Submit the change list to the kqueue int nev = kevent(kq, &change, 1, NULL, 0, NULL); // The last NULL,0 means don't wait for events yet if (nev < 0) { /* handle error */ }You register
EVFILT_READon yourlisten_fd.kqueueuses "filters" (EVFILT_READ,EVFILT_WRITE,EVFILT_VNODEfor file system, etc.). -
Wait for events:
// In C: struct kevent events[MAX_EVENTS]; // The last parameter (NULL) is a timeout. NULL means block indefinitely. int num_events = kevent(kq, NULL, 0, events, MAX_EVENTS, NULL); if (num_events < 0) { /* handle error */ }keventis used for both registering and waiting. When the firstNULL, 0arguments are passed, it means "don’t register anything, just wait for up toMAX_EVENTSevents." -
Process ready events:
// In C: for (int i = 0; i < num_events; i++) { if (events[i].ident == listen_fd) { // ident is the file descriptor or other identifier // New connection! Accept it. int client_fd = accept(listen_fd, NULL, NULL); // Now, register this new client_fd with kqueue to watch for reads. struct kevent new_ev; EV_SET(&new_ev, client_fd, EVFILT_READ, EV_ADD | EV_ENABLE, 0, 0, NULL); kevent(kq, &new_ev, 1, NULL, 0, NULL); // Submit the registration } else { // Existing client has data to read. int fd = events[i].ident; if (events[i].filter == EVFILT_READ) { char buffer[1024]; ssize_t bytes_read = read(fd, buffer, sizeof(buffer)); if (bytes_read > 0) { // Process data... } else if (bytes_read == 0) { // Client closed connection. Remove from kqueue and close. struct kevent del_ev; EV_SET(&del_ev, fd, EVFILT_READ, EV_DELETE, 0, 0, NULL); kevent(kq, &del_ev, 1, NULL, 0, NULL); // Submit deletion close(fd); } else { // Read error. Handle and potentially remove. } } // Handle EVFILT_WRITE, etc. } }You check the
ident(the fd) andfilterto see what happened. If it’s a new connection, you accept it and then register the newclient_fdwithkqueueitself.
Edge-Triggered vs. Level-Triggered
This is a crucial distinction, especially for epoll.
- Level-Triggered (LT): This is the default and the behavior of
select/poll.epoll_waitwill keep telling you an event is pending until you’ve fully processed it. If you read part of the data,epoll_waitwill report theEPOLLINevent again on the next call. This is safer and easier to get right, but can lead to busy-waiting if you’re not careful (e.g., ifreadreturnsEAGAINbut there’s still data). - Edge-Triggered (ET): When
epoll_waitreports an event, it means the state changed. If you read data,epoll_waitwill not reportEPOLLINagain until new data arrives. This requires you to read/write until you getEAGAIN(orEWOULDBLOCK) to ensure you’ve consumed all available data or written all you can. This is more efficient because you get fewer notifications, but it’s more complex to implement correctly. If you miss an event (e.g., you don’t read all the data before the nextepoll_wait), you might never get notified about that data again until more arrives.
The common pattern for high-performance servers is to use ET mode with epoll and ensure you read/write in a loop until EAGAIN.
The Mental Model
The core idea is shifting the burden of polling from the application to the kernel. Instead of your application constantly asking "Is anything ready?", the kernel tells your application "This is ready."
This is why epoll and kqueue scale to hundreds of thousands or millions of connections. The epoll_wait or kevent call itself is O(1) in terms of the number of connections – it just checks its internal event queue. The work is proportional to the number of active connections, not the total number of connections.
The "state" of which connections are being watched is maintained by the kernel in the epoll or kqueue data structure. Your application only needs to track the state of connections that are actively doing something.
The most surprising thing is how kqueue consolidates event registration and retrieval into a single syscall. You use kevent for both "register this filter" and "tell me what’s happened." This unified interface for diverse event types (sockets, files, processes) is a powerful design choice that makes it incredibly flexible.
The next hurdle is understanding how to properly manage resource cleanup when connections are closed or errors occur, ensuring you always remove event registrations from epoll or kqueue.