| Netdev features mess and how to get out from it alive |
| ===================================================== |
| |
| Author: |
| Michał Mirosław <mirq-linux@rere.qmqm.pl> |
| |
| |
| |
| Part I: Feature sets |
| ====================== |
| |
| Long gone are the days when a network card would just take and give packets |
| verbatim. Today's devices add multiple features and bugs (read: offloads) |
| that relieve an OS of various tasks like generating and checking checksums, |
| splitting packets, classifying them. Those capabilities and their state |
| are commonly referred to as netdev features in Linux kernel world. |
| |
| There are currently three sets of features relevant to the driver, and |
| one used internally by network core: |
| |
| 1. netdev->hw_features set contains features whose state may possibly |
| be changed (enabled or disabled) for a particular device by user's |
| request. This set should be initialized in ndo_init callback and not |
| changed later. |
| |
| 2. netdev->features set contains features which are currently enabled |
| for a device. This should be changed only by network core or in |
| error paths of ndo_set_features callback. |
| |
| 3. netdev->vlan_features set contains features whose state is inherited |
| by child VLAN devices (limits netdev->features set). This is currently |
| used for all VLAN devices whether tags are stripped or inserted in |
| hardware or software. |
| |
| 4. netdev->wanted_features set contains feature set requested by user. |
| This set is filtered by ndo_fix_features callback whenever it or |
| some device-specific conditions change. This set is internal to |
| networking core and should not be referenced in drivers. |
| |
| |
| |
| Part II: Controlling enabled features |
| ======================================= |
| |
| When current feature set (netdev->features) is to be changed, new set |
| is calculated and filtered by calling ndo_fix_features callback |
| and netdev_fix_features(). If the resulting set differs from current |
| set, it is passed to ndo_set_features callback and (if the callback |
| returns success) replaces value stored in netdev->features. |
| NETDEV_FEAT_CHANGE notification is issued after that whenever current |
| set might have changed. |
| |
| The following events trigger recalculation: |
| 1. device's registration, after ndo_init returned success |
| 2. user requested changes in features state |
| 3. netdev_update_features() is called |
| |
| ndo_*_features callbacks are called with rtnl_lock held. Missing callbacks |
| are treated as always returning success. |
| |
| A driver that wants to trigger recalculation must do so by calling |
| netdev_update_features() while holding rtnl_lock. This should not be done |
| from ndo_*_features callbacks. netdev->features should not be modified by |
| driver except by means of ndo_fix_features callback. |
| |
| |
| |
| Part III: Implementation hints |
| ================================ |
| |
| * ndo_fix_features: |
| |
| All dependencies between features should be resolved here. The resulting |
| set can be reduced further by networking core imposed limitations (as coded |
| in netdev_fix_features()). For this reason it is safer to disable a feature |
| when its dependencies are not met instead of forcing the dependency on. |
| |
| This callback should not modify hardware nor driver state (should be |
| stateless). It can be called multiple times between successive |
| ndo_set_features calls. |
| |
| Callback must not alter features contained in NETIF_F_SOFT_FEATURES or |
| NETIF_F_NEVER_CHANGE sets. The exception is NETIF_F_VLAN_CHALLENGED but |
| care must be taken as the change won't affect already configured VLANs. |
| |
| * ndo_set_features: |
| |
| Hardware should be reconfigured to match passed feature set. The set |
| should not be altered unless some error condition happens that can't |
| be reliably detected in ndo_fix_features. In this case, the callback |
| should update netdev->features to match resulting hardware state. |
| Errors returned are not (and cannot be) propagated anywhere except dmesg. |
| (Note: successful return is zero, >0 means silent error.) |
| |
| |
| |
| Part IV: Features |
| =================== |
| |
| For current list of features, see include/linux/netdev_features.h. |
| This section describes semantics of some of them. |
| |
| * Transmit checksumming |
| |
| For complete description, see comments near the top of include/linux/skbuff.h. |
| |
| Note: NETIF_F_HW_CSUM is a superset of NETIF_F_IP_CSUM + NETIF_F_IPV6_CSUM. |
| It means that device can fill TCP/UDP-like checksum anywhere in the packets |
| whatever headers there might be. |
| |
| * Transmit TCP segmentation offload |
| |
| NETIF_F_TSO_ECN means that hardware can properly split packets with CWR bit |
| set, be it TCPv4 (when NETIF_F_TSO is enabled) or TCPv6 (NETIF_F_TSO6). |
| |
| * Transmit DMA from high memory |
| |
| On platforms where this is relevant, NETIF_F_HIGHDMA signals that |
| ndo_start_xmit can handle skbs with frags in high memory. |
| |
| * Transmit scatter-gather |
| |
| Those features say that ndo_start_xmit can handle fragmented skbs: |
| NETIF_F_SG --- paged skbs (skb_shinfo()->frags), NETIF_F_FRAGLIST --- |
| chained skbs (skb->next/prev list). |
| |
| * Software features |
| |
| Features contained in NETIF_F_SOFT_FEATURES are features of networking |
| stack. Driver should not change behaviour based on them. |
| |
| * LLTX driver (deprecated for hardware drivers) |
| |
| NETIF_F_LLTX is meant to be used by drivers that don't need locking at all, |
| e.g. software tunnels. |
| |
| This is also used in a few legacy drivers that implement their |
| own locking, don't use it for new (hardware) drivers. |
| |
| * netns-local device |
| |
| NETIF_F_NETNS_LOCAL is set for devices that are not allowed to move between |
| network namespaces (e.g. loopback). |
| |
| Don't use it in drivers. |
| |
| * VLAN challenged |
| |
| NETIF_F_VLAN_CHALLENGED should be set for devices which can't cope with VLAN |
| headers. Some drivers set this because the cards can't handle the bigger MTU. |
| [FIXME: Those cases could be fixed in VLAN code by allowing only reduced-MTU |
| VLANs. This may be not useful, though.] |
| |
| * rx-fcs |
| |
| This requests that the NIC append the Ethernet Frame Checksum (FCS) |
| to the end of the skb data. This allows sniffers and other tools to |
| read the CRC recorded by the NIC on receipt of the packet. |
| |
| * rx-all |
| |
| This requests that the NIC receive all possible frames, including errored |
| frames (such as bad FCS, etc). This can be helpful when sniffing a link with |
| bad packets on it. Some NICs may receive more packets if also put into normal |
| PROMISC mode. |
| |
| * rx-gro-hw |
| |
| This requests that the NIC enables Hardware GRO (generic receive offload). |
| Hardware GRO is basically the exact reverse of TSO, and is generally |
| stricter than Hardware LRO. A packet stream merged by Hardware GRO must |
| be re-segmentable by GSO or TSO back to the exact original packet stream. |
| Hardware GRO is dependent on RXCSUM since every packet successfully merged |
| by hardware must also have the checksum verified by hardware. |