nginx_thundering_herd_accept

Summary

  • old issue: All threads/processes who called accept() are awoken when a connection comes: already fixed after kernel 2.6

  • select/epoll have similar issue like accept(): already fixed at some kernel version

  • epoll_create before fork, same issue as accept() already fixed at some kernel version, but has other concern, never do this, see below

  • epoll_create after fork, still has thundering herd issue, see how nginx solves it

epoll_create before fork(never use it)

if epoll_create is called before fork, all child processes share the same epoll instance in the kernel, if a connection comes, kernel selects just one process[**WQ_FLAG_EXCLUSIVE**](other is not awoken), wake it up, no thundering issue

The big issue for this case is that if one process creates fd and adds it to epoll, all processes may be awoken when event happens, process CAN NOT have private fd added to epoll, only for itself.

epoll_create after fork

This is what nginx uses, as each worker needs private fd, say established connection added to epoll should be only processed by the creator, not other worker process. so that different workers have their own epoll instances.

As each worker adds listening fd(created by master mostly for VIP) to its own epoll instance, so for new connection setup, all workers could be awoken if without any solution.

how nginx solve such issue:

  • option1: accept mutex lock(deprecated)
    nginx uses accept mutex lock, for each loop of epoll_wait(), worker first gets the accept mutex lock, if gets it, adds the listening fd by epoll_ctl, otherwise, removes the listening fd, so that at any time there is only one worker has the listening fd, note the accept mutex is shared by all workers.

  • option2: SO_REUSEPORT(best one, balanced from kernel)
    SO_REUSEPORT(kernel 3.9) with this option set, kernel selects a process based on hash(src, port, dst, port), only wakes up that process it’s hash balancing from kernel, no lock from user space, nginx always uses this if available. master will create different fd(same ip+port) for each worker(fd for same ip+port(clone_listening)) before fork, after fork, each worker adds its fd to its epoll instance which is created after fork. that means for each listing_op_t, nginx creates several sockets(same with worker number). each worker has its own epoll instance and different fd.

  • option3 new epoll flag EPOLLEXCLUSIVE(since kernel 4.5)(better one, no balance)
    EPOLLEXCLUSIVE is thus useful for avoiding thundering herd problems in certain scenarios, only wake up one process If the same file descriptor is in multiple epoll instances. if use it, nginx master does not create different fd(VIP) for each worker, but all workers use the same(fd number), still epoll is created after fork, add the same fd to each worker’s epoll instance. when new connection comes, only the first on the waiting queue is wake up.
    nginx uses it if SO_REUSPORT is not available.

nginx chooses these options in below order

  • SO_REUSEPORT(need kernel support and configure at listen 1.1.1.1:80 reuseport, both needs)
  • EPOLLEXCLUSIVE(need kernel support)
  • accept mutex(needs user configure at events {accept_mutex on;})
  • None