blob: 9e4cccb90b8700aea49bb586ca0da79f2fe185b9 [file] [log] [blame]
Mauro Carvalho Chehab482a4362020-04-30 18:04:04 +02001.. SPDX-License-Identifier: GPL-2.0
Linus Torvalds1da177e2005-04-16 15:20:36 -07002
Mauro Carvalho Chehab482a4362020-04-30 18:04:04 +02003=====================================
Linus Torvalds1da177e2005-04-16 15:20:36 -07004Network Devices, the Kernel, and You!
Mauro Carvalho Chehab482a4362020-04-30 18:04:04 +02005=====================================
Linus Torvalds1da177e2005-04-16 15:20:36 -07006
7
8Introduction
9============
10The following is a random collection of documentation regarding
11network devices.
12
Jakub Kicinski2b446e62021-01-06 10:40:05 -080013struct net_device lifetime rules
14================================
Linus Torvalds1da177e2005-04-16 15:20:36 -070015Network device structures need to persist even after module is unloaded and
Eric Dumazet74d332c2013-10-30 13:10:44 -070016must be allocated with alloc_netdev_mqs() and friends.
17If device has registered successfully, it will be freed on last use
Jakub Kicinski2b446e62021-01-06 10:40:05 -080018by free_netdev(). This is required to handle the pathological case cleanly
19(example: ``rmmod mydriver </sys/class/net/myeth/mtu``)
Linus Torvalds1da177e2005-04-16 15:20:36 -070020
Jakub Kicinski2b446e62021-01-06 10:40:05 -080021alloc_netdev_mqs() / alloc_netdev() reserve extra space for driver
Linus Torvalds1da177e2005-04-16 15:20:36 -070022private data which gets freed when the network device is freed. If
23separately allocated data is attached to the network device
Jakub Kicinski2b446e62021-01-06 10:40:05 -080024(netdev_priv()) then it is up to the module exit handler to free that.
25
26There are two groups of APIs for registering struct net_device.
27First group can be used in normal contexts where ``rtnl_lock`` is not already
28held: register_netdev(), unregister_netdev().
29Second group can be used when ``rtnl_lock`` is already held:
30register_netdevice(), unregister_netdevice(), free_netdevice().
31
32Simple drivers
33--------------
34
35Most drivers (especially device drivers) handle lifetime of struct net_device
36in context where ``rtnl_lock`` is not held (e.g. driver probe and remove paths).
37
38In that case the struct net_device registration is done using
39the register_netdev(), and unregister_netdev() functions:
40
41.. code-block:: c
42
43 int probe()
44 {
45 struct my_device_priv *priv;
46 int err;
47
48 dev = alloc_netdev_mqs(...);
49 if (!dev)
50 return -ENOMEM;
51 priv = netdev_priv(dev);
52
53 /* ... do all device setup before calling register_netdev() ...
54 */
55
56 err = register_netdev(dev);
57 if (err)
58 goto err_undo;
59
60 /* net_device is visible to the user! */
61
62 err_undo:
63 /* ... undo the device setup ... */
64 free_netdev(dev);
65 return err;
66 }
67
68 void remove()
69 {
70 unregister_netdev(dev);
71 free_netdev(dev);
72 }
73
74Note that after calling register_netdev() the device is visible in the system.
75Users can open it and start sending / receiving traffic immediately,
76or run any other callback, so all initialization must be done prior to
77registration.
78
79unregister_netdev() closes the device and waits for all users to be done
80with it. The memory of struct net_device itself may still be referenced
81by sysfs but all operations on that device will fail.
82
83free_netdev() can be called after unregister_netdev() returns on when
84register_netdev() failed.
85
86Device management under RTNL
87----------------------------
88
89Registering struct net_device while in context which already holds
90the ``rtnl_lock`` requires extra care. In those scenarios most drivers
91will want to make use of struct net_device's ``needs_free_netdev``
92and ``priv_destructor`` members for freeing of state.
93
94Example flow of netdev handling under ``rtnl_lock``:
95
96.. code-block:: c
97
98 static void my_setup(struct net_device *dev)
99 {
100 dev->needs_free_netdev = true;
101 }
102
103 static void my_destructor(struct net_device *dev)
104 {
105 some_obj_destroy(priv->obj);
106 some_uninit(priv);
107 }
108
109 int create_link()
110 {
111 struct my_device_priv *priv;
112 int err;
113
114 ASSERT_RTNL();
115
116 dev = alloc_netdev(sizeof(*priv), "net%d", NET_NAME_UNKNOWN, my_setup);
117 if (!dev)
118 return -ENOMEM;
119 priv = netdev_priv(dev);
120
121 /* Implicit constructor */
122 err = some_init(priv);
123 if (err)
124 goto err_free_dev;
125
126 priv->obj = some_obj_create();
127 if (!priv->obj) {
128 err = -ENOMEM;
129 goto err_some_uninit;
130 }
131 /* End of constructor, set the destructor: */
132 dev->priv_destructor = my_destructor;
133
134 err = register_netdevice(dev);
135 if (err)
136 /* register_netdevice() calls destructor on failure */
137 goto err_free_dev;
138
139 /* If anything fails now unregister_netdevice() (or unregister_netdev())
140 * will take care of calling my_destructor and free_netdev().
141 */
142
143 return 0;
144
145 err_some_uninit:
146 some_uninit(priv);
147 err_free_dev:
148 free_netdev(dev);
149 return err;
150 }
151
152If struct net_device.priv_destructor is set it will be called by the core
153some time after unregister_netdevice(), it will also be called if
154register_netdevice() fails. The callback may be invoked with or without
155``rtnl_lock`` held.
156
157There is no explicit constructor callback, driver "constructs" the private
158netdev state after allocating it and before registration.
159
160Setting struct net_device.needs_free_netdev makes core call free_netdevice()
161automatically after unregister_netdevice() when all references to the device
162are gone. It only takes effect after a successful call to register_netdevice()
163so if register_netdevice() fails driver is responsible for calling
164free_netdev().
165
166free_netdev() is safe to call on error paths right after unregister_netdevice()
167or when register_netdevice() fails. Parts of netdev (de)registration process
168happen after ``rtnl_lock`` is released, therefore in those cases free_netdev()
169will defer some of the processing until ``rtnl_lock`` is released.
170
171Devices spawned from struct rtnl_link_ops should never free the
172struct net_device directly.
173
174.ndo_init and .ndo_uninit
175~~~~~~~~~~~~~~~~~~~~~~~~~
176
177``.ndo_init`` and ``.ndo_uninit`` callbacks are called during net_device
178registration and de-registration, under ``rtnl_lock``. Drivers can use
179those e.g. when parts of their init process need to run under ``rtnl_lock``.
180
181``.ndo_init`` runs before device is visible in the system, ``.ndo_uninit``
182runs during de-registering after device is closed but other subsystems
183may still have outstanding references to the netdevice.
Linus Torvalds1da177e2005-04-16 15:20:36 -0700184
Stephen Hemminger1c8c7d62007-07-07 23:03:44 -0700185MTU
186===
187Each network device has a Maximum Transfer Unit. The MTU does not
188include any link layer protocol overhead. Upper layer protocols must
189not pass a socket buffer (skb) to a device to transmit with more data
190than the mtu. The MTU does not include link layer header overhead, so
191for example on Ethernet if the standard MTU is 1500 bytes used, the
192actual skb will contain up to 1514 bytes because of the Ethernet
193header. Devices should allow for the 4 byte VLAN header as well.
194
195Segmentation Offload (GSO, TSO) is an exception to this rule. The
196upper layer protocol may pass a large socket buffer to the device
197transmit routine, and the device will break that up into separate
198packets based on the current MTU.
199
200MTU is symmetrical and applies both to receive and transmit. A device
201must be able to receive at least the maximum size packet allowed by
202the MTU. A network device may use the MTU as mechanism to size receive
203buffers, but the device should allow packets with VLAN header. With
204standard Ethernet mtu of 1500 bytes, the device should allow up to
2051518 byte packets (1500 + 14 header + 4 tag). The device may either:
206drop, truncate, or pass up oversize packets, but dropping oversize
207packets is preferred.
208
209
Linus Torvalds1da177e2005-04-16 15:20:36 -0700210struct net_device synchronization rules
211=======================================
Ben Hutchingsb3cf6542012-04-05 14:39:47 +0000212ndo_open:
Linus Torvalds1da177e2005-04-16 15:20:36 -0700213 Synchronization: rtnl_lock() semaphore.
214 Context: process
215
Ben Hutchingsb3cf6542012-04-05 14:39:47 +0000216ndo_stop:
Linus Torvalds1da177e2005-04-16 15:20:36 -0700217 Synchronization: rtnl_lock() semaphore.
218 Context: process
Ben Hutchings93b6a3a2012-04-05 14:39:10 +0000219 Note: netif_running() is guaranteed false
Linus Torvalds1da177e2005-04-16 15:20:36 -0700220
Ben Hutchingsb3cf6542012-04-05 14:39:47 +0000221ndo_do_ioctl:
Linus Torvalds1da177e2005-04-16 15:20:36 -0700222 Synchronization: rtnl_lock() semaphore.
223 Context: process
224
Arnd Bergmann3d9d00bd2021-07-27 15:45:17 +0200225 This is only called by network subsystems internally,
226 not by user space calling ioctl as it was in before
227 linux-5.14.
228
229ndo_siocbond:
230 Synchronization: rtnl_lock() semaphore.
231 Context: process
232
233 Used by the bonding driver for the SIOCBOND family of
234 ioctl commands.
235
Arnd Bergmannad7eab2a2021-07-27 15:45:14 +0200236ndo_siocwandev:
237 Synchronization: rtnl_lock() semaphore.
238 Context: process
239
240 Used by the drivers/net/wan framework to handle
241 the SIOCWANDEV ioctl with the if_settings structure.
242
Arnd Bergmannb9067f52021-07-27 15:44:47 +0200243ndo_siocdevprivate:
244 Synchronization: rtnl_lock() semaphore.
245 Context: process
246
247 This is used to implement SIOCDEVPRIVATE ioctl helpers.
248 These should not be added to new drivers, so don't use.
249
Arnd Bergmanna7605372021-07-27 15:45:13 +0200250ndo_eth_ioctl:
251 Synchronization: rtnl_lock() semaphore.
252 Context: process
253
Ben Hutchingsb3cf6542012-04-05 14:39:47 +0000254ndo_get_stats:
Jakub Kicinski9f9d41f2021-01-04 17:22:24 -0800255 Synchronization: rtnl_lock() semaphore, dev_base_lock rwlock, or RCU.
256 Context: atomic (can't sleep under rwlock or RCU)
Linus Torvalds1da177e2005-04-16 15:20:36 -0700257
Ben Hutchingsb3cf6542012-04-05 14:39:47 +0000258ndo_start_xmit:
Ben Hutchings04fd3d32012-04-05 14:39:30 +0000259 Synchronization: __netif_tx_lock spinlock.
Stephen Hemminger17229332007-07-07 22:59:14 -0700260
Linus Torvalds1da177e2005-04-16 15:20:36 -0700261 When the driver sets NETIF_F_LLTX in dev->features this will be
Herbert Xu932ff272006-06-09 12:20:56 -0700262 called without holding netif_tx_lock. In this case the driver
Florian Westphalf0cdf762016-04-24 21:38:14 +0200263 has to lock by itself when needed.
264 The locking there should also properly protect against
265 set_rx_mode. WARNING: use of NETIF_F_LLTX is deprecated.
Matt LaPlante19f59462009-04-27 15:06:31 +0200266 Don't use it for new drivers.
Stephen Hemminger17229332007-07-07 22:59:14 -0700267
268 Context: Process with BHs disabled or BH (timer),
Mauro Carvalho Chehab482a4362020-04-30 18:04:04 +0200269 will be called with interrupts disabled by netconsole.
Stephen Hemminger17229332007-07-07 22:59:14 -0700270
Mauro Carvalho Chehab482a4362020-04-30 18:04:04 +0200271 Return codes:
272
273 * NETDEV_TX_OK everything ok.
274 * NETDEV_TX_BUSY Cannot transmit packet, try later
Linus Torvalds1da177e2005-04-16 15:20:36 -0700275 Usually a bug, means queue start/stop flow control is broken in
276 the driver. Note: the driver must NOT put the skb in its DMA ring.
Linus Torvalds1da177e2005-04-16 15:20:36 -0700277
Ben Hutchingsb3cf6542012-04-05 14:39:47 +0000278ndo_tx_timeout:
Ben Hutchings04fd3d32012-04-05 14:39:30 +0000279 Synchronization: netif_tx_lock spinlock; all TX queues frozen.
Linus Torvalds1da177e2005-04-16 15:20:36 -0700280 Context: BHs disabled
281 Notes: netif_queue_stopped() is guaranteed true
282
Ben Hutchingsb3cf6542012-04-05 14:39:47 +0000283ndo_set_rx_mode:
Ben Hutchings04fd3d32012-04-05 14:39:30 +0000284 Synchronization: netif_addr_lock spinlock.
Linus Torvalds1da177e2005-04-16 15:20:36 -0700285 Context: BHs disabled
286
Stephen Hemmingerbea33482007-10-03 16:41:36 -0700287struct napi_struct synchronization rules
288========================================
289napi->poll:
Mauro Carvalho Chehab482a4362020-04-30 18:04:04 +0200290 Synchronization:
291 NAPI_STATE_SCHED bit in napi->state. Device
Ben Hutchingsb3cf6542012-04-05 14:39:47 +0000292 driver's ndo_stop method will invoke napi_disable() on
Stephen Hemmingerbea33482007-10-03 16:41:36 -0700293 all NAPI instances which will do a sleeping poll on the
294 NAPI_STATE_SCHED napi->state bit, waiting for all pending
295 NAPI activity to cease.
Mauro Carvalho Chehab482a4362020-04-30 18:04:04 +0200296
297 Context:
298 softirq
299 will be called with interrupts disabled by netconsole.