| .. SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause) |
| |
| .. _napi: |
| |
| ==== |
| NAPI |
| ==== |
| |
| NAPI is the event handling mechanism used by the Linux networking stack. |
| The name NAPI no longer stands for anything in particular [#]_. |
| |
| In basic operation the device notifies the host about new events |
| via an interrupt. |
| The host then schedules a NAPI instance to process the events. |
| The device may also be polled for events via NAPI without receiving |
| interrupts first (:ref:`busy polling<poll>`). |
| |
| NAPI processing usually happens in the software interrupt context, |
| but there is an option to use :ref:`separate kernel threads<threaded>` |
| for NAPI processing. |
| |
| All in all NAPI abstracts away from the drivers the context and configuration |
| of event (packet Rx and Tx) processing. |
| |
| Driver API |
| ========== |
| |
| The two most important elements of NAPI are the struct napi_struct |
| and the associated poll method. struct napi_struct holds the state |
| of the NAPI instance while the method is the driver-specific event |
| handler. The method will typically free Tx packets that have been |
| transmitted and process newly received packets. |
| |
| .. _drv_ctrl: |
| |
| Control API |
| ----------- |
| |
| netif_napi_add() and netif_napi_del() add/remove a NAPI instance |
| from the system. The instances are attached to the netdevice passed |
| as argument (and will be deleted automatically when netdevice is |
| unregistered). Instances are added in a disabled state. |
| |
| napi_enable() and napi_disable() manage the disabled state. |
| A disabled NAPI can't be scheduled and its poll method is guaranteed |
| to not be invoked. napi_disable() waits for ownership of the NAPI |
| instance to be released. |
| |
| The control APIs are not idempotent. Control API calls are safe against |
| concurrent use of datapath APIs but an incorrect sequence of control API |
| calls may result in crashes, deadlocks, or race conditions. For example, |
| calling napi_disable() multiple times in a row will deadlock. |
| |
| Datapath API |
| ------------ |
| |
| napi_schedule() is the basic method of scheduling a NAPI poll. |
| Drivers should call this function in their interrupt handler |
| (see :ref:`drv_sched` for more info). A successful call to napi_schedule() |
| will take ownership of the NAPI instance. |
| |
| Later, after NAPI is scheduled, the driver's poll method will be |
| called to process the events/packets. The method takes a ``budget`` |
| argument - drivers can process completions for any number of Tx |
| packets but should only process up to ``budget`` number of |
| Rx packets. Rx processing is usually much more expensive. |
| |
| In other words for Rx processing the ``budget`` argument limits how many |
| packets driver can process in a single poll. Rx specific APIs like page |
| pool or XDP cannot be used at all when ``budget`` is 0. |
| skb Tx processing should happen regardless of the ``budget``, but if |
| the argument is 0 driver cannot call any XDP (or page pool) APIs. |
| |
| .. warning:: |
| |
| The ``budget`` argument may be 0 if core tries to only process |
| skb Tx completions and no Rx or XDP packets. |
| |
| The poll method returns the amount of work done. If the driver still |
| has outstanding work to do (e.g. ``budget`` was exhausted) |
| the poll method should return exactly ``budget``. In that case, |
| the NAPI instance will be serviced/polled again (without the |
| need to be scheduled). |
| |
| If event processing has been completed (all outstanding packets |
| processed) the poll method should call napi_complete_done() |
| before returning. napi_complete_done() releases the ownership |
| of the instance. |
| |
| .. warning:: |
| |
| The case of finishing all events and using exactly ``budget`` |
| must be handled carefully. There is no way to report this |
| (rare) condition to the stack, so the driver must either |
| not call napi_complete_done() and wait to be called again, |
| or return ``budget - 1``. |
| |
| If the ``budget`` is 0 napi_complete_done() should never be called. |
| |
| Call sequence |
| ------------- |
| |
| Drivers should not make assumptions about the exact sequencing |
| of calls. The poll method may be called without the driver scheduling |
| the instance (unless the instance is disabled). Similarly, |
| it's not guaranteed that the poll method will be called, even |
| if napi_schedule() succeeded (e.g. if the instance gets disabled). |
| |
| As mentioned in the :ref:`drv_ctrl` section - napi_disable() and subsequent |
| calls to the poll method only wait for the ownership of the instance |
| to be released, not for the poll method to exit. This means that |
| drivers should avoid accessing any data structures after calling |
| napi_complete_done(). |
| |
| .. _drv_sched: |
| |
| Scheduling and IRQ masking |
| -------------------------- |
| |
| Drivers should keep the interrupts masked after scheduling |
| the NAPI instance - until NAPI polling finishes any further |
| interrupts are unnecessary. |
| |
| Drivers which have to mask the interrupts explicitly (as opposed |
| to IRQ being auto-masked by the device) should use the napi_schedule_prep() |
| and __napi_schedule() calls: |
| |
| .. code-block:: c |
| |
| if (napi_schedule_prep(&v->napi)) { |
| mydrv_mask_rxtx_irq(v->idx); |
| /* schedule after masking to avoid races */ |
| __napi_schedule(&v->napi); |
| } |
| |
| IRQ should only be unmasked after a successful call to napi_complete_done(): |
| |
| .. code-block:: c |
| |
| if (budget && napi_complete_done(&v->napi, work_done)) { |
| mydrv_unmask_rxtx_irq(v->idx); |
| return min(work_done, budget - 1); |
| } |
| |
| napi_schedule_irqoff() is a variant of napi_schedule() which takes advantage |
| of guarantees given by being invoked in IRQ context (no need to |
| mask interrupts). napi_schedule_irqoff() will fall back to napi_schedule() if |
| IRQs are threaded (such as if ``PREEMPT_RT`` is enabled). |
| |
| Instance to queue mapping |
| ------------------------- |
| |
| Modern devices have multiple NAPI instances (struct napi_struct) per |
| interface. There is no strong requirement on how the instances are |
| mapped to queues and interrupts. NAPI is primarily a polling/processing |
| abstraction without specific user-facing semantics. That said, most networking |
| devices end up using NAPI in fairly similar ways. |
| |
| NAPI instances most often correspond 1:1:1 to interrupts and queue pairs |
| (queue pair is a set of a single Rx and single Tx queue). |
| |
| In less common cases a NAPI instance may be used for multiple queues |
| or Rx and Tx queues can be serviced by separate NAPI instances on a single |
| core. Regardless of the queue assignment, however, there is usually still |
| a 1:1 mapping between NAPI instances and interrupts. |
| |
| It's worth noting that the ethtool API uses a "channel" terminology where |
| each channel can be either ``rx``, ``tx`` or ``combined``. It's not clear |
| what constitutes a channel; the recommended interpretation is to understand |
| a channel as an IRQ/NAPI which services queues of a given type. For example, |
| a configuration of 1 ``rx``, 1 ``tx`` and 1 ``combined`` channel is expected |
| to utilize 3 interrupts, 2 Rx and 2 Tx queues. |
| |
| User API |
| ======== |
| |
| User interactions with NAPI depend on NAPI instance ID. The instance IDs |
| are only visible to the user thru the ``SO_INCOMING_NAPI_ID`` socket option. |
| It's not currently possible to query IDs used by a given device. |
| |
| Software IRQ coalescing |
| ----------------------- |
| |
| NAPI does not perform any explicit event coalescing by default. |
| In most scenarios batching happens due to IRQ coalescing which is done |
| by the device. There are cases where software coalescing is helpful. |
| |
| NAPI can be configured to arm a repoll timer instead of unmasking |
| the hardware interrupts as soon as all packets are processed. |
| The ``gro_flush_timeout`` sysfs configuration of the netdevice |
| is reused to control the delay of the timer, while |
| ``napi_defer_hard_irqs`` controls the number of consecutive empty polls |
| before NAPI gives up and goes back to using hardware IRQs. |
| |
| .. _poll: |
| |
| Busy polling |
| ------------ |
| |
| Busy polling allows a user process to check for incoming packets before |
| the device interrupt fires. As is the case with any busy polling it trades |
| off CPU cycles for lower latency (production uses of NAPI busy polling |
| are not well known). |
| |
| Busy polling is enabled by either setting ``SO_BUSY_POLL`` on |
| selected sockets or using the global ``net.core.busy_poll`` and |
| ``net.core.busy_read`` sysctls. An io_uring API for NAPI busy polling |
| also exists. |
| |
| IRQ mitigation |
| --------------- |
| |
| While busy polling is supposed to be used by low latency applications, |
| a similar mechanism can be used for IRQ mitigation. |
| |
| Very high request-per-second applications (especially routing/forwarding |
| applications and especially applications using AF_XDP sockets) may not |
| want to be interrupted until they finish processing a request or a batch |
| of packets. |
| |
| Such applications can pledge to the kernel that they will perform a busy |
| polling operation periodically, and the driver should keep the device IRQs |
| permanently masked. This mode is enabled by using the ``SO_PREFER_BUSY_POLL`` |
| socket option. To avoid system misbehavior the pledge is revoked |
| if ``gro_flush_timeout`` passes without any busy poll call. |
| |
| The NAPI budget for busy polling is lower than the default (which makes |
| sense given the low latency intention of normal busy polling). This is |
| not the case with IRQ mitigation, however, so the budget can be adjusted |
| with the ``SO_BUSY_POLL_BUDGET`` socket option. |
| |
| .. _threaded: |
| |
| Threaded NAPI |
| ------------- |
| |
| Threaded NAPI is an operating mode that uses dedicated kernel |
| threads rather than software IRQ context for NAPI processing. |
| The configuration is per netdevice and will affect all |
| NAPI instances of that device. Each NAPI instance will spawn a separate |
| thread (called ``napi/${ifc-name}-${napi-id}``). |
| |
| It is recommended to pin each kernel thread to a single CPU, the same |
| CPU as the CPU which services the interrupt. Note that the mapping |
| between IRQs and NAPI instances may not be trivial (and is driver |
| dependent). The NAPI instance IDs will be assigned in the opposite |
| order than the process IDs of the kernel threads. |
| |
| Threaded NAPI is controlled by writing 0/1 to the ``threaded`` file in |
| netdev's sysfs directory. |
| |
| .. rubric:: Footnotes |
| |
| .. [#] NAPI was originally referred to as New API in 2.4 Linux. |