Daniel Borkmann | 7924cd5 | 2013-12-11 23:43:45 +0100 | [diff] [blame] | 1 | Linux Socket Filtering aka Berkeley Packet Filter (BPF) |
| 2 | ======================================================= |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 3 | |
| 4 | Introduction |
Daniel Borkmann | 7924cd5 | 2013-12-11 23:43:45 +0100 | [diff] [blame] | 5 | ------------ |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 6 | |
Daniel Borkmann | 7924cd5 | 2013-12-11 23:43:45 +0100 | [diff] [blame] | 7 | Linux Socket Filtering (LSF) is derived from the Berkeley Packet Filter. |
| 8 | Though there are some distinct differences between the BSD and Linux |
| 9 | Kernel filtering, but when we speak of BPF or LSF in Linux context, we |
| 10 | mean the very same mechanism of filtering in the Linux kernel. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 11 | |
Daniel Borkmann | 7924cd5 | 2013-12-11 23:43:45 +0100 | [diff] [blame] | 12 | BPF allows a user-space program to attach a filter onto any socket and |
| 13 | allow or disallow certain types of data to come through the socket. LSF |
| 14 | follows exactly the same filter code structure as BSD's BPF, so referring |
| 15 | to the BSD bpf.4 manpage is very helpful in creating filters. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 16 | |
Daniel Borkmann | 7924cd5 | 2013-12-11 23:43:45 +0100 | [diff] [blame] | 17 | On Linux, BPF is much simpler than on BSD. One does not have to worry |
| 18 | about devices or anything like that. You simply create your filter code, |
| 19 | send it to the kernel via the SO_ATTACH_FILTER option and if your filter |
| 20 | code passes the kernel check on it, you then immediately begin filtering |
| 21 | data on that socket. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 22 | |
Daniel Borkmann | 7924cd5 | 2013-12-11 23:43:45 +0100 | [diff] [blame] | 23 | You can also detach filters from your socket via the SO_DETACH_FILTER |
| 24 | option. This will probably not be used much since when you close a socket |
| 25 | that has a filter on it the filter is automagically removed. The other |
| 26 | less common case may be adding a different filter on the same socket where |
| 27 | you had another filter that is still running: the kernel takes care of |
| 28 | removing the old one and placing your new one in its place, assuming your |
| 29 | filter has passed the checks, otherwise if it fails the old filter will |
| 30 | remain on that socket. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 31 | |
Daniel Borkmann | 7924cd5 | 2013-12-11 23:43:45 +0100 | [diff] [blame] | 32 | SO_LOCK_FILTER option allows to lock the filter attached to a socket. Once |
| 33 | set, a filter cannot be removed or changed. This allows one process to |
| 34 | setup a socket, attach a filter, lock it then drop privileges and be |
| 35 | assured that the filter will be kept until the socket is closed. |
Vincent Bernat | d59577b | 2013-01-16 22:55:49 +0100 | [diff] [blame] | 36 | |
Daniel Borkmann | 7924cd5 | 2013-12-11 23:43:45 +0100 | [diff] [blame] | 37 | The biggest user of this construct might be libpcap. Issuing a high-level |
| 38 | filter command like `tcpdump -i em1 port 22` passes through the libpcap |
| 39 | internal compiler that generates a structure that can eventually be loaded |
| 40 | via SO_ATTACH_FILTER to the kernel. `tcpdump -i em1 port 22 -ddd` |
| 41 | displays what is being placed into this structure. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 42 | |
Daniel Borkmann | 7924cd5 | 2013-12-11 23:43:45 +0100 | [diff] [blame] | 43 | Although we were only speaking about sockets here, BPF in Linux is used |
| 44 | in many more places. There's xt_bpf for netfilter, cls_bpf in the kernel |
| 45 | qdisc layer, SECCOMP-BPF (SECure COMPuting [1]), and lots of other places |
| 46 | such as team driver, PTP code, etc where BPF is being used. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 47 | |
Pavel Machek | 2130c02 | 2017-09-16 16:28:02 +0200 | [diff] [blame] | 48 | [1] Documentation/userspace-api/seccomp_filter.rst |
Daniel Borkmann | 7924cd5 | 2013-12-11 23:43:45 +0100 | [diff] [blame] | 49 | |
| 50 | Original BPF paper: |
| 51 | |
| 52 | Steven McCanne and Van Jacobson. 1993. The BSD packet filter: a new |
| 53 | architecture for user-level packet capture. In Proceedings of the |
| 54 | USENIX Winter 1993 Conference Proceedings on USENIX Winter 1993 |
| 55 | Conference Proceedings (USENIX'93). USENIX Association, Berkeley, |
| 56 | CA, USA, 2-2. [http://www.tcpdump.org/papers/bpf-usenix93.pdf] |
| 57 | |
| 58 | Structure |
| 59 | --------- |
| 60 | |
| 61 | User space applications include <linux/filter.h> which contains the |
| 62 | following relevant structures: |
| 63 | |
| 64 | struct sock_filter { /* Filter block */ |
| 65 | __u16 code; /* Actual filter code */ |
| 66 | __u8 jt; /* Jump true */ |
| 67 | __u8 jf; /* Jump false */ |
| 68 | __u32 k; /* Generic multiuse field */ |
| 69 | }; |
| 70 | |
| 71 | Such a structure is assembled as an array of 4-tuples, that contains |
| 72 | a code, jt, jf and k value. jt and jf are jump offsets and k a generic |
| 73 | value to be used for a provided code. |
| 74 | |
| 75 | struct sock_fprog { /* Required for SO_ATTACH_FILTER. */ |
| 76 | unsigned short len; /* Number of filter blocks */ |
| 77 | struct sock_filter __user *filter; |
| 78 | }; |
| 79 | |
| 80 | For socket filtering, a pointer to this structure (as shown in |
| 81 | follow-up example) is being passed to the kernel through setsockopt(2). |
| 82 | |
| 83 | Example |
| 84 | ------- |
| 85 | |
| 86 | #include <sys/socket.h> |
| 87 | #include <sys/types.h> |
| 88 | #include <arpa/inet.h> |
| 89 | #include <linux/if_ether.h> |
| 90 | /* ... */ |
| 91 | |
| 92 | /* From the example above: tcpdump -i em1 port 22 -dd */ |
| 93 | struct sock_filter code[] = { |
| 94 | { 0x28, 0, 0, 0x0000000c }, |
| 95 | { 0x15, 0, 8, 0x000086dd }, |
| 96 | { 0x30, 0, 0, 0x00000014 }, |
| 97 | { 0x15, 2, 0, 0x00000084 }, |
| 98 | { 0x15, 1, 0, 0x00000006 }, |
| 99 | { 0x15, 0, 17, 0x00000011 }, |
| 100 | { 0x28, 0, 0, 0x00000036 }, |
| 101 | { 0x15, 14, 0, 0x00000016 }, |
| 102 | { 0x28, 0, 0, 0x00000038 }, |
| 103 | { 0x15, 12, 13, 0x00000016 }, |
| 104 | { 0x15, 0, 12, 0x00000800 }, |
| 105 | { 0x30, 0, 0, 0x00000017 }, |
| 106 | { 0x15, 2, 0, 0x00000084 }, |
| 107 | { 0x15, 1, 0, 0x00000006 }, |
| 108 | { 0x15, 0, 8, 0x00000011 }, |
| 109 | { 0x28, 0, 0, 0x00000014 }, |
| 110 | { 0x45, 6, 0, 0x00001fff }, |
| 111 | { 0xb1, 0, 0, 0x0000000e }, |
| 112 | { 0x48, 0, 0, 0x0000000e }, |
| 113 | { 0x15, 2, 0, 0x00000016 }, |
| 114 | { 0x48, 0, 0, 0x00000010 }, |
| 115 | { 0x15, 0, 1, 0x00000016 }, |
| 116 | { 0x06, 0, 0, 0x0000ffff }, |
| 117 | { 0x06, 0, 0, 0x00000000 }, |
| 118 | }; |
| 119 | |
| 120 | struct sock_fprog bpf = { |
| 121 | .len = ARRAY_SIZE(code), |
| 122 | .filter = code, |
| 123 | }; |
| 124 | |
| 125 | sock = socket(PF_PACKET, SOCK_RAW, htons(ETH_P_ALL)); |
| 126 | if (sock < 0) |
| 127 | /* ... bail out ... */ |
| 128 | |
| 129 | ret = setsockopt(sock, SOL_SOCKET, SO_ATTACH_FILTER, &bpf, sizeof(bpf)); |
| 130 | if (ret < 0) |
| 131 | /* ... bail out ... */ |
| 132 | |
| 133 | /* ... */ |
| 134 | close(sock); |
| 135 | |
| 136 | The above example code attaches a socket filter for a PF_PACKET socket |
| 137 | in order to let all IPv4/IPv6 packets with port 22 pass. The rest will |
| 138 | be dropped for this socket. |
| 139 | |
| 140 | The setsockopt(2) call to SO_DETACH_FILTER doesn't need any arguments |
| 141 | and SO_LOCK_FILTER for preventing the filter to be detached, takes an |
| 142 | integer value with 0 or 1. |
| 143 | |
| 144 | Note that socket filters are not restricted to PF_PACKET sockets only, |
| 145 | but can also be used on other socket families. |
| 146 | |
| 147 | Summary of system calls: |
| 148 | |
| 149 | * setsockopt(sockfd, SOL_SOCKET, SO_ATTACH_FILTER, &val, sizeof(val)); |
| 150 | * setsockopt(sockfd, SOL_SOCKET, SO_DETACH_FILTER, &val, sizeof(val)); |
| 151 | * setsockopt(sockfd, SOL_SOCKET, SO_LOCK_FILTER, &val, sizeof(val)); |
| 152 | |
| 153 | Normally, most use cases for socket filtering on packet sockets will be |
| 154 | covered by libpcap in high-level syntax, so as an application developer |
| 155 | you should stick to that. libpcap wraps its own layer around all that. |
| 156 | |
| 157 | Unless i) using/linking to libpcap is not an option, ii) the required BPF |
| 158 | filters use Linux extensions that are not supported by libpcap's compiler, |
| 159 | iii) a filter might be more complex and not cleanly implementable with |
| 160 | libpcap's compiler, or iv) particular filter codes should be optimized |
| 161 | differently than libpcap's internal compiler does; then in such cases |
| 162 | writing such a filter "by hand" can be of an alternative. For example, |
| 163 | xt_bpf and cls_bpf users might have requirements that could result in |
| 164 | more complex filter code, or one that cannot be expressed with libpcap |
| 165 | (e.g. different return codes for various code paths). Moreover, BPF JIT |
| 166 | implementors may wish to manually write test cases and thus need low-level |
| 167 | access to BPF code as well. |
| 168 | |
| 169 | BPF engine and instruction set |
| 170 | ------------------------------ |
| 171 | |
Wang Sheng-Hui | c246fd3 | 2018-04-15 16:07:12 +0800 | [diff] [blame] | 172 | Under tools/bpf/ there's a small helper tool called bpf_asm which can |
Daniel Borkmann | 7924cd5 | 2013-12-11 23:43:45 +0100 | [diff] [blame] | 173 | be used to write low-level filters for example scenarios mentioned in the |
| 174 | previous section. Asm-like syntax mentioned here has been implemented in |
| 175 | bpf_asm and will be used for further explanations (instead of dealing with |
| 176 | less readable opcodes directly, principles are the same). The syntax is |
| 177 | closely modelled after Steven McCanne's and Van Jacobson's BPF paper. |
| 178 | |
| 179 | The BPF architecture consists of the following basic elements: |
| 180 | |
| 181 | Element Description |
| 182 | |
| 183 | A 32 bit wide accumulator |
| 184 | X 32 bit wide X register |
| 185 | M[] 16 x 32 bit wide misc registers aka "scratch memory |
| 186 | store", addressable from 0 to 15 |
| 187 | |
| 188 | A program, that is translated by bpf_asm into "opcodes" is an array that |
| 189 | consists of the following elements (as already mentioned): |
| 190 | |
| 191 | op:16, jt:8, jf:8, k:32 |
| 192 | |
| 193 | The element op is a 16 bit wide opcode that has a particular instruction |
| 194 | encoded. jt and jf are two 8 bit wide jump targets, one for condition |
| 195 | "jump if true", the other one "jump if false". Eventually, element k |
| 196 | contains a miscellaneous argument that can be interpreted in different |
| 197 | ways depending on the given instruction in op. |
| 198 | |
| 199 | The instruction set consists of load, store, branch, alu, miscellaneous |
| 200 | and return instructions that are also represented in bpf_asm syntax. This |
| 201 | table lists all bpf_asm instructions available resp. what their underlying |
| 202 | opcodes as defined in linux/filter.h stand for: |
| 203 | |
| 204 | Instruction Addressing mode Description |
| 205 | |
Arthur Fabre | 31ce8c4 | 2018-10-07 09:45:19 +0100 | [diff] [blame] | 206 | ld 1, 2, 3, 4, 12 Load word into A |
Daniel Borkmann | 7924cd5 | 2013-12-11 23:43:45 +0100 | [diff] [blame] | 207 | ldi 4 Load word into A |
| 208 | ldh 1, 2 Load half-word into A |
| 209 | ldb 1, 2 Load byte into A |
Arthur Fabre | 31ce8c4 | 2018-10-07 09:45:19 +0100 | [diff] [blame] | 210 | ldx 3, 4, 5, 12 Load word into X |
Daniel Borkmann | 7924cd5 | 2013-12-11 23:43:45 +0100 | [diff] [blame] | 211 | ldxi 4 Load word into X |
| 212 | ldxb 5 Load byte into X |
| 213 | |
| 214 | st 3 Store A into M[] |
| 215 | stx 3 Store X into M[] |
| 216 | |
| 217 | jmp 6 Jump to label |
| 218 | ja 6 Jump to label |
Arthur Fabre | 31ce8c4 | 2018-10-07 09:45:19 +0100 | [diff] [blame] | 219 | jeq 7, 8, 9, 10 Jump on A == <x> |
| 220 | jneq 9, 10 Jump on A != <x> |
| 221 | jne 9, 10 Jump on A != <x> |
| 222 | jlt 9, 10 Jump on A < <x> |
| 223 | jle 9, 10 Jump on A <= <x> |
| 224 | jgt 7, 8, 9, 10 Jump on A > <x> |
| 225 | jge 7, 8, 9, 10 Jump on A >= <x> |
| 226 | jset 7, 8, 9, 10 Jump on A & <x> |
Daniel Borkmann | 7924cd5 | 2013-12-11 23:43:45 +0100 | [diff] [blame] | 227 | |
| 228 | add 0, 4 A + <x> |
| 229 | sub 0, 4 A - <x> |
| 230 | mul 0, 4 A * <x> |
| 231 | div 0, 4 A / <x> |
| 232 | mod 0, 4 A % <x> |
Dave Anderson | 83d26b6 | 2016-03-28 14:56:47 -0700 | [diff] [blame] | 233 | neg !A |
Daniel Borkmann | 7924cd5 | 2013-12-11 23:43:45 +0100 | [diff] [blame] | 234 | and 0, 4 A & <x> |
| 235 | or 0, 4 A | <x> |
| 236 | xor 0, 4 A ^ <x> |
| 237 | lsh 0, 4 A << <x> |
| 238 | rsh 0, 4 A >> <x> |
| 239 | |
| 240 | tax Copy A into X |
| 241 | txa Copy X into A |
| 242 | |
Arthur Fabre | 31ce8c4 | 2018-10-07 09:45:19 +0100 | [diff] [blame] | 243 | ret 4, 11 Return |
Daniel Borkmann | 7924cd5 | 2013-12-11 23:43:45 +0100 | [diff] [blame] | 244 | |
| 245 | The next table shows addressing formats from the 2nd column: |
| 246 | |
| 247 | Addressing mode Syntax Description |
| 248 | |
| 249 | 0 x/%x Register X |
| 250 | 1 [k] BHW at byte offset k in the packet |
| 251 | 2 [x + k] BHW at the offset X + k in the packet |
| 252 | 3 M[k] Word at offset k in M[] |
| 253 | 4 #k Literal value stored in k |
| 254 | 5 4*([k]&0xf) Lower nibble * 4 at byte offset k in the packet |
| 255 | 6 L Jump label L |
| 256 | 7 #k,Lt,Lf Jump to Lt if true, otherwise jump to Lf |
Arthur Fabre | 31ce8c4 | 2018-10-07 09:45:19 +0100 | [diff] [blame] | 257 | 8 x/%x,Lt,Lf Jump to Lt if true, otherwise jump to Lf |
| 258 | 9 #k,Lt Jump to Lt if predicate is true |
| 259 | 10 x/%x,Lt Jump to Lt if predicate is true |
| 260 | 11 a/%a Accumulator A |
| 261 | 12 extension BPF extension |
Daniel Borkmann | 7924cd5 | 2013-12-11 23:43:45 +0100 | [diff] [blame] | 262 | |
| 263 | The Linux kernel also has a couple of BPF extensions that are used along |
| 264 | with the class of load instructions by "overloading" the k argument with |
| 265 | a negative offset + a particular extension offset. The result of such BPF |
| 266 | extensions are loaded into A. |
| 267 | |
| 268 | Possible BPF extensions are shown in the following table: |
| 269 | |
| 270 | Extension Description |
| 271 | |
| 272 | len skb->len |
| 273 | proto skb->protocol |
| 274 | type skb->pkt_type |
| 275 | poff Payload start offset |
| 276 | ifidx skb->dev->ifindex |
| 277 | nla Netlink attribute of type X with offset A |
| 278 | nlan Nested Netlink attribute of type X with offset A |
| 279 | mark skb->mark |
| 280 | queue skb->queue_mapping |
| 281 | hatype skb->dev->type |
Tobias Klauser | b0db5cd | 2014-05-20 13:52:13 +0200 | [diff] [blame] | 282 | rxhash skb->hash |
Daniel Borkmann | 7924cd5 | 2013-12-11 23:43:45 +0100 | [diff] [blame] | 283 | cpu raw_smp_processor_id() |
Jiri Pirko | df8a39de | 2015-01-13 17:13:44 +0100 | [diff] [blame] | 284 | vlan_tci skb_vlan_tag_get(skb) |
Michal Sekletar | 27cd545 | 2015-03-24 14:48:41 +0100 | [diff] [blame] | 285 | vlan_avail skb_vlan_tag_present(skb) |
| 286 | vlan_tpid skb->vlan_proto |
Chema Gonzalez | 4cd3675 | 2014-04-21 09:21:24 -0700 | [diff] [blame] | 287 | rand prandom_u32() |
Daniel Borkmann | 7924cd5 | 2013-12-11 23:43:45 +0100 | [diff] [blame] | 288 | |
| 289 | These extensions can also be prefixed with '#'. |
| 290 | Examples for low-level BPF: |
| 291 | |
| 292 | ** ARP packets: |
| 293 | |
| 294 | ldh [12] |
| 295 | jne #0x806, drop |
| 296 | ret #-1 |
| 297 | drop: ret #0 |
| 298 | |
| 299 | ** IPv4 TCP packets: |
| 300 | |
| 301 | ldh [12] |
| 302 | jne #0x800, drop |
| 303 | ldb [23] |
| 304 | jneq #6, drop |
| 305 | ret #-1 |
| 306 | drop: ret #0 |
| 307 | |
| 308 | ** (Accelerated) VLAN w/ id 10: |
| 309 | |
| 310 | ld vlan_tci |
| 311 | jneq #10, drop |
| 312 | ret #-1 |
| 313 | drop: ret #0 |
| 314 | |
Chema Gonzalez | 4cd3675 | 2014-04-21 09:21:24 -0700 | [diff] [blame] | 315 | ** icmp random packet sampling, 1 in 4 |
| 316 | ldh [12] |
| 317 | jne #0x800, drop |
| 318 | ldb [23] |
| 319 | jneq #1, drop |
| 320 | # get a random uint32 number |
| 321 | ld rand |
| 322 | mod #4 |
| 323 | jneq #1, drop |
| 324 | ret #-1 |
| 325 | drop: ret #0 |
| 326 | |
Daniel Borkmann | 7924cd5 | 2013-12-11 23:43:45 +0100 | [diff] [blame] | 327 | ** SECCOMP filter example: |
| 328 | |
| 329 | ld [4] /* offsetof(struct seccomp_data, arch) */ |
| 330 | jne #0xc000003e, bad /* AUDIT_ARCH_X86_64 */ |
| 331 | ld [0] /* offsetof(struct seccomp_data, nr) */ |
| 332 | jeq #15, good /* __NR_rt_sigreturn */ |
| 333 | jeq #231, good /* __NR_exit_group */ |
| 334 | jeq #60, good /* __NR_exit */ |
| 335 | jeq #0, good /* __NR_read */ |
| 336 | jeq #1, good /* __NR_write */ |
| 337 | jeq #5, good /* __NR_fstat */ |
| 338 | jeq #9, good /* __NR_mmap */ |
| 339 | jeq #14, good /* __NR_rt_sigprocmask */ |
| 340 | jeq #13, good /* __NR_rt_sigaction */ |
| 341 | jeq #35, good /* __NR_nanosleep */ |
Kees Cook | fd76875 | 2017-08-11 12:53:18 -0700 | [diff] [blame] | 342 | bad: ret #0 /* SECCOMP_RET_KILL_THREAD */ |
Daniel Borkmann | 7924cd5 | 2013-12-11 23:43:45 +0100 | [diff] [blame] | 343 | good: ret #0x7fff0000 /* SECCOMP_RET_ALLOW */ |
| 344 | |
| 345 | The above example code can be placed into a file (here called "foo"), and |
| 346 | then be passed to the bpf_asm tool for generating opcodes, output that xt_bpf |
| 347 | and cls_bpf understands and can directly be loaded with. Example with above |
| 348 | ARP code: |
| 349 | |
| 350 | $ ./bpf_asm foo |
| 351 | 4,40 0 0 12,21 0 1 2054,6 0 0 4294967295,6 0 0 0, |
| 352 | |
| 353 | In copy and paste C-like output: |
| 354 | |
| 355 | $ ./bpf_asm -c foo |
| 356 | { 0x28, 0, 0, 0x0000000c }, |
| 357 | { 0x15, 0, 1, 0x00000806 }, |
| 358 | { 0x06, 0, 0, 0xffffffff }, |
| 359 | { 0x06, 0, 0, 0000000000 }, |
| 360 | |
| 361 | In particular, as usage with xt_bpf or cls_bpf can result in more complex BPF |
| 362 | filters that might not be obvious at first, it's good to test filters before |
| 363 | attaching to a live system. For that purpose, there's a small tool called |
Wang Sheng-Hui | c246fd3 | 2018-04-15 16:07:12 +0800 | [diff] [blame] | 364 | bpf_dbg under tools/bpf/ in the kernel source directory. This debugger allows |
Daniel Borkmann | 7924cd5 | 2013-12-11 23:43:45 +0100 | [diff] [blame] | 365 | for testing BPF filters against given pcap files, single stepping through the |
| 366 | BPF code on the pcap's packets and to do BPF machine register dumps. |
| 367 | |
| 368 | Starting bpf_dbg is trivial and just requires issuing: |
| 369 | |
| 370 | # ./bpf_dbg |
| 371 | |
| 372 | In case input and output do not equal stdin/stdout, bpf_dbg takes an |
| 373 | alternative stdin source as a first argument, and an alternative stdout |
| 374 | sink as a second one, e.g. `./bpf_dbg test_in.txt test_out.txt`. |
| 375 | |
| 376 | Other than that, a particular libreadline configuration can be set via |
| 377 | file "~/.bpf_dbg_init" and the command history is stored in the file |
| 378 | "~/.bpf_dbg_history". |
| 379 | |
| 380 | Interaction in bpf_dbg happens through a shell that also has auto-completion |
| 381 | support (follow-up example commands starting with '>' denote bpf_dbg shell). |
| 382 | The usual workflow would be to ... |
| 383 | |
| 384 | > load bpf 6,40 0 0 12,21 0 3 2048,48 0 0 23,21 0 1 1,6 0 0 65535,6 0 0 0 |
| 385 | Loads a BPF filter from standard output of bpf_asm, or transformed via |
| 386 | e.g. `tcpdump -iem1 -ddd port 22 | tr '\n' ','`. Note that for JIT |
| 387 | debugging (next section), this command creates a temporary socket and |
| 388 | loads the BPF code into the kernel. Thus, this will also be useful for |
| 389 | JIT developers. |
| 390 | |
| 391 | > load pcap foo.pcap |
| 392 | Loads standard tcpdump pcap file. |
| 393 | |
| 394 | > run [<n>] |
| 395 | bpf passes:1 fails:9 |
| 396 | Runs through all packets from a pcap to account how many passes and fails |
| 397 | the filter will generate. A limit of packets to traverse can be given. |
| 398 | |
| 399 | > disassemble |
| 400 | l0: ldh [12] |
| 401 | l1: jeq #0x800, l2, l5 |
| 402 | l2: ldb [23] |
| 403 | l3: jeq #0x1, l4, l5 |
| 404 | l4: ret #0xffff |
| 405 | l5: ret #0 |
| 406 | Prints out BPF code disassembly. |
| 407 | |
| 408 | > dump |
| 409 | /* { op, jt, jf, k }, */ |
| 410 | { 0x28, 0, 0, 0x0000000c }, |
| 411 | { 0x15, 0, 3, 0x00000800 }, |
| 412 | { 0x30, 0, 0, 0x00000017 }, |
| 413 | { 0x15, 0, 1, 0x00000001 }, |
| 414 | { 0x06, 0, 0, 0x0000ffff }, |
| 415 | { 0x06, 0, 0, 0000000000 }, |
| 416 | Prints out C-style BPF code dump. |
| 417 | |
| 418 | > breakpoint 0 |
| 419 | breakpoint at: l0: ldh [12] |
| 420 | > breakpoint 1 |
| 421 | breakpoint at: l1: jeq #0x800, l2, l5 |
| 422 | ... |
| 423 | Sets breakpoints at particular BPF instructions. Issuing a `run` command |
| 424 | will walk through the pcap file continuing from the current packet and |
| 425 | break when a breakpoint is being hit (another `run` will continue from |
| 426 | the currently active breakpoint executing next instructions): |
| 427 | |
| 428 | > run |
| 429 | -- register dump -- |
| 430 | pc: [0] <-- program counter |
| 431 | code: [40] jt[0] jf[0] k[12] <-- plain BPF code of current instruction |
| 432 | curr: l0: ldh [12] <-- disassembly of current instruction |
| 433 | A: [00000000][0] <-- content of A (hex, decimal) |
| 434 | X: [00000000][0] <-- content of X (hex, decimal) |
| 435 | M[0,15]: [00000000][0] <-- folded content of M (hex, decimal) |
| 436 | -- packet dump -- <-- Current packet from pcap (hex) |
| 437 | len: 42 |
| 438 | 0: 00 19 cb 55 55 a4 00 14 a4 43 78 69 08 06 00 01 |
| 439 | 16: 08 00 06 04 00 01 00 14 a4 43 78 69 0a 3b 01 26 |
| 440 | 32: 00 00 00 00 00 00 0a 3b 01 01 |
| 441 | (breakpoint) |
| 442 | > |
| 443 | |
| 444 | > breakpoint |
| 445 | breakpoints: 0 1 |
| 446 | Prints currently set breakpoints. |
| 447 | |
| 448 | > step [-<n>, +<n>] |
| 449 | Performs single stepping through the BPF program from the current pc |
| 450 | offset. Thus, on each step invocation, above register dump is issued. |
| 451 | This can go forwards and backwards in time, a plain `step` will break |
| 452 | on the next BPF instruction, thus +1. (No `run` needs to be issued here.) |
| 453 | |
| 454 | > select <n> |
| 455 | Selects a given packet from the pcap file to continue from. Thus, on |
| 456 | the next `run` or `step`, the BPF program is being evaluated against |
| 457 | the user pre-selected packet. Numbering starts just as in Wireshark |
| 458 | with index 1. |
| 459 | |
| 460 | > quit |
| 461 | # |
| 462 | Exits bpf_dbg. |
| 463 | |
| 464 | JIT compiler |
| 465 | ------------ |
| 466 | |
| 467 | The Linux kernel has a built-in BPF JIT compiler for x86_64, SPARC, PowerPC, |
Linus Torvalds | 6325e94 | 2014-10-08 05:34:24 -0400 | [diff] [blame] | 468 | ARM, ARM64, MIPS and s390 and can be enabled through CONFIG_BPF_JIT. The JIT |
| 469 | compiler is transparently invoked for each attached filter from user space |
| 470 | or for internal kernel users if it has been previously enabled by root: |
Daniel Borkmann | 7924cd5 | 2013-12-11 23:43:45 +0100 | [diff] [blame] | 471 | |
| 472 | echo 1 > /proc/sys/net/core/bpf_jit_enable |
| 473 | |
| 474 | For JIT developers, doing audits etc, each compile run can output the generated |
| 475 | opcode image into the kernel log via: |
| 476 | |
| 477 | echo 2 > /proc/sys/net/core/bpf_jit_enable |
| 478 | |
| 479 | Example output from dmesg: |
| 480 | |
| 481 | [ 3389.935842] flen=6 proglen=70 pass=3 image=ffffffffa0069c8f |
| 482 | [ 3389.935847] JIT code: 00000000: 55 48 89 e5 48 83 ec 60 48 89 5d f8 44 8b 4f 68 |
| 483 | [ 3389.935849] JIT code: 00000010: 44 2b 4f 6c 4c 8b 87 d8 00 00 00 be 0c 00 00 00 |
| 484 | [ 3389.935850] JIT code: 00000020: e8 1d 94 ff e0 3d 00 08 00 00 75 16 be 17 00 00 |
| 485 | [ 3389.935851] JIT code: 00000030: 00 e8 28 94 ff e0 83 f8 01 75 07 b8 ff ff 00 00 |
| 486 | [ 3389.935852] JIT code: 00000040: eb 02 31 c0 c9 c3 |
| 487 | |
Leo Yan | 2c25fc9a | 2018-04-27 18:02:54 +0800 | [diff] [blame] | 488 | When CONFIG_BPF_JIT_ALWAYS_ON is enabled, bpf_jit_enable is permanently set to 1 and |
| 489 | setting any other value than that will return in failure. This is even the case for |
| 490 | setting bpf_jit_enable to 2, since dumping the final JIT image into the kernel log |
| 491 | is discouraged and introspection through bpftool (under tools/bpf/bpftool/) is the |
| 492 | generally recommended approach instead. |
| 493 | |
Wang Sheng-Hui | c246fd3 | 2018-04-15 16:07:12 +0800 | [diff] [blame] | 494 | In the kernel source tree under tools/bpf/, there's bpf_jit_disasm for |
Daniel Borkmann | 7924cd5 | 2013-12-11 23:43:45 +0100 | [diff] [blame] | 495 | generating disassembly out of the kernel log's hexdump: |
| 496 | |
| 497 | # ./bpf_jit_disasm |
| 498 | 70 bytes emitted from JIT compiler (pass:3, flen:6) |
| 499 | ffffffffa0069c8f + <x>: |
| 500 | 0: push %rbp |
| 501 | 1: mov %rsp,%rbp |
| 502 | 4: sub $0x60,%rsp |
| 503 | 8: mov %rbx,-0x8(%rbp) |
| 504 | c: mov 0x68(%rdi),%r9d |
| 505 | 10: sub 0x6c(%rdi),%r9d |
| 506 | 14: mov 0xd8(%rdi),%r8 |
| 507 | 1b: mov $0xc,%esi |
| 508 | 20: callq 0xffffffffe0ff9442 |
| 509 | 25: cmp $0x800,%eax |
| 510 | 2a: jne 0x0000000000000042 |
| 511 | 2c: mov $0x17,%esi |
| 512 | 31: callq 0xffffffffe0ff945e |
| 513 | 36: cmp $0x1,%eax |
| 514 | 39: jne 0x0000000000000042 |
| 515 | 3b: mov $0xffff,%eax |
| 516 | 40: jmp 0x0000000000000044 |
| 517 | 42: xor %eax,%eax |
| 518 | 44: leaveq |
| 519 | 45: retq |
| 520 | |
| 521 | Issuing option `-o` will "annotate" opcodes to resulting assembler |
| 522 | instructions, which can be very useful for JIT developers: |
| 523 | |
| 524 | # ./bpf_jit_disasm -o |
| 525 | 70 bytes emitted from JIT compiler (pass:3, flen:6) |
| 526 | ffffffffa0069c8f + <x>: |
| 527 | 0: push %rbp |
| 528 | 55 |
| 529 | 1: mov %rsp,%rbp |
| 530 | 48 89 e5 |
| 531 | 4: sub $0x60,%rsp |
| 532 | 48 83 ec 60 |
| 533 | 8: mov %rbx,-0x8(%rbp) |
| 534 | 48 89 5d f8 |
| 535 | c: mov 0x68(%rdi),%r9d |
| 536 | 44 8b 4f 68 |
| 537 | 10: sub 0x6c(%rdi),%r9d |
| 538 | 44 2b 4f 6c |
| 539 | 14: mov 0xd8(%rdi),%r8 |
| 540 | 4c 8b 87 d8 00 00 00 |
| 541 | 1b: mov $0xc,%esi |
| 542 | be 0c 00 00 00 |
| 543 | 20: callq 0xffffffffe0ff9442 |
| 544 | e8 1d 94 ff e0 |
| 545 | 25: cmp $0x800,%eax |
| 546 | 3d 00 08 00 00 |
| 547 | 2a: jne 0x0000000000000042 |
| 548 | 75 16 |
| 549 | 2c: mov $0x17,%esi |
| 550 | be 17 00 00 00 |
| 551 | 31: callq 0xffffffffe0ff945e |
| 552 | e8 28 94 ff e0 |
| 553 | 36: cmp $0x1,%eax |
| 554 | 83 f8 01 |
| 555 | 39: jne 0x0000000000000042 |
| 556 | 75 07 |
| 557 | 3b: mov $0xffff,%eax |
| 558 | b8 ff ff 00 00 |
| 559 | 40: jmp 0x0000000000000044 |
| 560 | eb 02 |
| 561 | 42: xor %eax,%eax |
| 562 | 31 c0 |
| 563 | 44: leaveq |
| 564 | c9 |
| 565 | 45: retq |
| 566 | c3 |
| 567 | |
| 568 | For BPF JIT developers, bpf_jit_disasm, bpf_asm and bpf_dbg provides a useful |
| 569 | toolchain for developing and testing the kernel's JIT compiler. |
| 570 | |
Alexei Starovoitov | 9a985cd | 2014-03-28 18:58:26 +0100 | [diff] [blame] | 571 | BPF kernel internals |
| 572 | -------------------- |
Alexei Starovoitov | e4ad403 | 2014-06-10 17:44:06 +0200 | [diff] [blame] | 573 | Internally, for the kernel interpreter, a different instruction set |
Alexei Starovoitov | 9a985cd | 2014-03-28 18:58:26 +0100 | [diff] [blame] | 574 | format with similar underlying principles from BPF described in previous |
| 575 | paragraphs is being used. However, the instruction set format is modelled |
| 576 | closer to the underlying architecture to mimic native instruction sets, so |
Alexei Starovoitov | e4ad403 | 2014-06-10 17:44:06 +0200 | [diff] [blame] | 577 | that a better performance can be achieved (more details later). This new |
| 578 | ISA is called 'eBPF' or 'internal BPF' interchangeably. (Note: eBPF which |
| 579 | originates from [e]xtended BPF is not the same as BPF extensions! While |
| 580 | eBPF is an ISA, BPF extensions date back to classic BPF's 'overloading' |
| 581 | of BPF_LD | BPF_{B,H,W} | BPF_ABS instruction.) |
Alexei Starovoitov | 9a985cd | 2014-03-28 18:58:26 +0100 | [diff] [blame] | 582 | |
| 583 | It is designed to be JITed with one to one mapping, which can also open up |
Alexei Starovoitov | e4ad403 | 2014-06-10 17:44:06 +0200 | [diff] [blame] | 584 | the possibility for GCC/LLVM compilers to generate optimized eBPF code through |
| 585 | an eBPF backend that performs almost as fast as natively compiled code. |
Alexei Starovoitov | 9a985cd | 2014-03-28 18:58:26 +0100 | [diff] [blame] | 586 | |
| 587 | The new instruction set was originally designed with the possible goal in |
Alexei Starovoitov | e4ad403 | 2014-06-10 17:44:06 +0200 | [diff] [blame] | 588 | mind to write programs in "restricted C" and compile into eBPF with a optional |
Alexei Starovoitov | 9a985cd | 2014-03-28 18:58:26 +0100 | [diff] [blame] | 589 | GCC/LLVM backend, so that it can just-in-time map to modern 64-bit CPUs with |
Alexei Starovoitov | e4ad403 | 2014-06-10 17:44:06 +0200 | [diff] [blame] | 590 | minimal performance overhead over two steps, that is, C -> eBPF -> native code. |
Alexei Starovoitov | 9a985cd | 2014-03-28 18:58:26 +0100 | [diff] [blame] | 591 | |
| 592 | Currently, the new format is being used for running user BPF programs, which |
| 593 | includes seccomp BPF, classic socket filters, cls_bpf traffic classifier, |
| 594 | team driver's classifier for its load-balancing mode, netfilter's xt_bpf |
| 595 | extension, PTP dissector/classifier, and much more. They are all internally |
| 596 | converted by the kernel into the new instruction set representation and run |
Alexei Starovoitov | e4ad403 | 2014-06-10 17:44:06 +0200 | [diff] [blame] | 597 | in the eBPF interpreter. For in-kernel handlers, this all works transparently |
Alexei Starovoitov | 7ae457c | 2014-07-30 20:34:16 -0700 | [diff] [blame] | 598 | by using bpf_prog_create() for setting up the filter, resp. |
| 599 | bpf_prog_destroy() for destroying it. The macro |
| 600 | BPF_PROG_RUN(filter, ctx) transparently invokes eBPF interpreter or JITed |
| 601 | code to run the filter. 'filter' is a pointer to struct bpf_prog that we |
| 602 | got from bpf_prog_create(), and 'ctx' the given context (e.g. |
Alexei Starovoitov | 4df95ff | 2014-07-30 20:34:14 -0700 | [diff] [blame] | 603 | skb pointer). All constraints and restrictions from bpf_check_classic() apply |
Alexei Starovoitov | e4ad403 | 2014-06-10 17:44:06 +0200 | [diff] [blame] | 604 | before a conversion to the new layout is being done behind the scenes! |
Alexei Starovoitov | 9a985cd | 2014-03-28 18:58:26 +0100 | [diff] [blame] | 605 | |
Alexei Starovoitov | e2989ee | 2017-04-23 09:01:00 -0700 | [diff] [blame] | 606 | Currently, the classic BPF format is being used for JITing on most 32-bit |
Shubham Bansal | d2aaa3d | 2017-08-23 21:29:10 +0530 | [diff] [blame] | 607 | architectures, whereas x86-64, aarch64, s390x, powerpc64, sparc64, arm32 perform |
| 608 | JIT compilation from eBPF instruction set. |
Alexei Starovoitov | 9a985cd | 2014-03-28 18:58:26 +0100 | [diff] [blame] | 609 | |
| 610 | Some core changes of the new internal format: |
| 611 | |
| 612 | - Number of registers increase from 2 to 10: |
| 613 | |
| 614 | The old format had two registers A and X, and a hidden frame pointer. The |
| 615 | new layout extends this to be 10 internal registers and a read-only frame |
| 616 | pointer. Since 64-bit CPUs are passing arguments to functions via registers |
Alexei Starovoitov | e4ad403 | 2014-06-10 17:44:06 +0200 | [diff] [blame] | 617 | the number of args from eBPF program to in-kernel function is restricted |
Alexei Starovoitov | 9a985cd | 2014-03-28 18:58:26 +0100 | [diff] [blame] | 618 | to 5 and one register is used to accept return value from an in-kernel |
| 619 | function. Natively, x86_64 passes first 6 arguments in registers, aarch64/ |
| 620 | sparcv9/mips64 have 7 - 8 registers for arguments; x86_64 has 6 callee saved |
| 621 | registers, and aarch64/sparcv9/mips64 have 11 or more callee saved registers. |
| 622 | |
Alexei Starovoitov | e4ad403 | 2014-06-10 17:44:06 +0200 | [diff] [blame] | 623 | Therefore, eBPF calling convention is defined as: |
Alexei Starovoitov | 9a985cd | 2014-03-28 18:58:26 +0100 | [diff] [blame] | 624 | |
Alexei Starovoitov | e4ad403 | 2014-06-10 17:44:06 +0200 | [diff] [blame] | 625 | * R0 - return value from in-kernel function, and exit value for eBPF program |
| 626 | * R1 - R5 - arguments from eBPF program to in-kernel function |
Alexei Starovoitov | 9a985cd | 2014-03-28 18:58:26 +0100 | [diff] [blame] | 627 | * R6 - R9 - callee saved registers that in-kernel function will preserve |
| 628 | * R10 - read-only frame pointer to access stack |
| 629 | |
Alexei Starovoitov | e4ad403 | 2014-06-10 17:44:06 +0200 | [diff] [blame] | 630 | Thus, all eBPF registers map one to one to HW registers on x86_64, aarch64, |
| 631 | etc, and eBPF calling convention maps directly to ABIs used by the kernel on |
Alexei Starovoitov | 9a985cd | 2014-03-28 18:58:26 +0100 | [diff] [blame] | 632 | 64-bit architectures. |
| 633 | |
| 634 | On 32-bit architectures JIT may map programs that use only 32-bit arithmetic |
| 635 | and may let more complex programs to be interpreted. |
| 636 | |
Alexei Starovoitov | e4ad403 | 2014-06-10 17:44:06 +0200 | [diff] [blame] | 637 | R0 - R5 are scratch registers and eBPF program needs spill/fill them if |
| 638 | necessary across calls. Note that there is only one eBPF program (== one |
| 639 | eBPF main routine) and it cannot call other eBPF functions, it can only |
| 640 | call predefined in-kernel functions, though. |
Alexei Starovoitov | 9a985cd | 2014-03-28 18:58:26 +0100 | [diff] [blame] | 641 | |
| 642 | - Register width increases from 32-bit to 64-bit: |
| 643 | |
| 644 | Still, the semantics of the original 32-bit ALU operations are preserved |
Alexei Starovoitov | e4ad403 | 2014-06-10 17:44:06 +0200 | [diff] [blame] | 645 | via 32-bit subregisters. All eBPF registers are 64-bit with 32-bit lower |
Alexei Starovoitov | 9a985cd | 2014-03-28 18:58:26 +0100 | [diff] [blame] | 646 | subregisters that zero-extend into 64-bit if they are being written to. |
| 647 | That behavior maps directly to x86_64 and arm64 subregister definition, but |
| 648 | makes other JITs more difficult. |
| 649 | |
| 650 | 32-bit architectures run 64-bit internal BPF programs via interpreter. |
| 651 | Their JITs may convert BPF programs that only use 32-bit subregisters into |
| 652 | native instruction set and let the rest being interpreted. |
| 653 | |
| 654 | Operation is 64-bit, because on 64-bit architectures, pointers are also |
| 655 | 64-bit wide, and we want to pass 64-bit values in/out of kernel functions, |
Alexei Starovoitov | e4ad403 | 2014-06-10 17:44:06 +0200 | [diff] [blame] | 656 | so 32-bit eBPF registers would otherwise require to define register-pair |
| 657 | ABI, thus, there won't be able to use a direct eBPF register to HW register |
Alexei Starovoitov | 9a985cd | 2014-03-28 18:58:26 +0100 | [diff] [blame] | 658 | mapping and JIT would need to do combine/split/move operations for every |
| 659 | register in and out of the function, which is complex, bug prone and slow. |
| 660 | Another reason is the use of atomic 64-bit counters. |
| 661 | |
| 662 | - Conditional jt/jf targets replaced with jt/fall-through: |
| 663 | |
| 664 | While the original design has constructs such as "if (cond) jump_true; |
| 665 | else jump_false;", they are being replaced into alternative constructs like |
| 666 | "if (cond) jump_true; /* else fall-through */". |
| 667 | |
| 668 | - Introduces bpf_call insn and register passing convention for zero overhead |
| 669 | calls from/to other kernel functions: |
| 670 | |
Alexei Starovoitov | dfee07c | 2014-05-01 08:16:03 -0700 | [diff] [blame] | 671 | Before an in-kernel function call, the internal BPF program needs to |
| 672 | place function arguments into R1 to R5 registers to satisfy calling |
| 673 | convention, then the interpreter will take them from registers and pass |
| 674 | to in-kernel function. If R1 - R5 registers are mapped to CPU registers |
| 675 | that are used for argument passing on given architecture, the JIT compiler |
| 676 | doesn't need to emit extra moves. Function arguments will be in the correct |
| 677 | registers and BPF_CALL instruction will be JITed as single 'call' HW |
| 678 | instruction. This calling convention was picked to cover common call |
| 679 | situations without performance penalty. |
| 680 | |
| 681 | After an in-kernel function call, R1 - R5 are reset to unreadable and R0 has |
| 682 | a return value of the function. Since R6 - R9 are callee saved, their state |
| 683 | is preserved across the call. |
| 684 | |
| 685 | For example, consider three C functions: |
| 686 | |
| 687 | u64 f1() { return (*_f2)(1); } |
| 688 | u64 f2(u64 a) { return f3(a + 1, a); } |
| 689 | u64 f3(u64 a, u64 b) { return a - b; } |
| 690 | |
| 691 | GCC can compile f1, f3 into x86_64: |
| 692 | |
| 693 | f1: |
| 694 | movl $1, %edi |
| 695 | movq _f2(%rip), %rax |
| 696 | jmp *%rax |
| 697 | f3: |
| 698 | movq %rdi, %rax |
| 699 | subq %rsi, %rax |
| 700 | ret |
| 701 | |
Alexei Starovoitov | e4ad403 | 2014-06-10 17:44:06 +0200 | [diff] [blame] | 702 | Function f2 in eBPF may look like: |
Alexei Starovoitov | dfee07c | 2014-05-01 08:16:03 -0700 | [diff] [blame] | 703 | |
| 704 | f2: |
| 705 | bpf_mov R2, R1 |
| 706 | bpf_add R1, 1 |
| 707 | bpf_call f3 |
| 708 | bpf_exit |
| 709 | |
| 710 | If f2 is JITed and the pointer stored to '_f2'. The calls f1 -> f2 -> f3 and |
Li RongQing | 1a9525f | 2014-10-10 11:36:54 +0800 | [diff] [blame] | 711 | returns will be seamless. Without JIT, __bpf_prog_run() interpreter needs to |
Alexei Starovoitov | dfee07c | 2014-05-01 08:16:03 -0700 | [diff] [blame] | 712 | be used to call into f2. |
| 713 | |
Alexei Starovoitov | e4ad403 | 2014-06-10 17:44:06 +0200 | [diff] [blame] | 714 | For practical reasons all eBPF programs have only one argument 'ctx' which is |
Li RongQing | 1a9525f | 2014-10-10 11:36:54 +0800 | [diff] [blame] | 715 | already placed into R1 (e.g. on __bpf_prog_run() startup) and the programs |
Alexei Starovoitov | dfee07c | 2014-05-01 08:16:03 -0700 | [diff] [blame] | 716 | can call kernel functions with up to 5 arguments. Calls with 6 or more arguments |
| 717 | are currently not supported, but these restrictions can be lifted if necessary |
| 718 | in the future. |
| 719 | |
| 720 | On 64-bit architectures all register map to HW registers one to one. For |
| 721 | example, x86_64 JIT compiler can map them as ... |
| 722 | |
| 723 | R0 - rax |
| 724 | R1 - rdi |
| 725 | R2 - rsi |
| 726 | R3 - rdx |
| 727 | R4 - rcx |
| 728 | R5 - r8 |
| 729 | R6 - rbx |
| 730 | R7 - r13 |
| 731 | R8 - r14 |
| 732 | R9 - r15 |
| 733 | R10 - rbp |
| 734 | |
| 735 | ... since x86_64 ABI mandates rdi, rsi, rdx, rcx, r8, r9 for argument passing |
| 736 | and rbx, r12 - r15 are callee saved. |
| 737 | |
| 738 | Then the following internal BPF pseudo-program: |
| 739 | |
| 740 | bpf_mov R6, R1 /* save ctx */ |
| 741 | bpf_mov R2, 2 |
| 742 | bpf_mov R3, 3 |
| 743 | bpf_mov R4, 4 |
| 744 | bpf_mov R5, 5 |
| 745 | bpf_call foo |
| 746 | bpf_mov R7, R0 /* save foo() return value */ |
| 747 | bpf_mov R1, R6 /* restore ctx for next call */ |
| 748 | bpf_mov R2, 6 |
| 749 | bpf_mov R3, 7 |
| 750 | bpf_mov R4, 8 |
| 751 | bpf_mov R5, 9 |
| 752 | bpf_call bar |
| 753 | bpf_add R0, R7 |
| 754 | bpf_exit |
| 755 | |
| 756 | After JIT to x86_64 may look like: |
| 757 | |
| 758 | push %rbp |
| 759 | mov %rsp,%rbp |
| 760 | sub $0x228,%rsp |
| 761 | mov %rbx,-0x228(%rbp) |
| 762 | mov %r13,-0x220(%rbp) |
| 763 | mov %rdi,%rbx |
| 764 | mov $0x2,%esi |
| 765 | mov $0x3,%edx |
| 766 | mov $0x4,%ecx |
| 767 | mov $0x5,%r8d |
| 768 | callq foo |
| 769 | mov %rax,%r13 |
| 770 | mov %rbx,%rdi |
| 771 | mov $0x2,%esi |
| 772 | mov $0x3,%edx |
| 773 | mov $0x4,%ecx |
| 774 | mov $0x5,%r8d |
| 775 | callq bar |
| 776 | add %r13,%rax |
| 777 | mov -0x228(%rbp),%rbx |
| 778 | mov -0x220(%rbp),%r13 |
| 779 | leaveq |
| 780 | retq |
| 781 | |
| 782 | Which is in this example equivalent in C to: |
| 783 | |
| 784 | u64 bpf_filter(u64 ctx) |
| 785 | { |
| 786 | return foo(ctx, 2, 3, 4, 5) + bar(ctx, 6, 7, 8, 9); |
| 787 | } |
| 788 | |
| 789 | In-kernel functions foo() and bar() with prototype: u64 (*)(u64 arg1, u64 |
| 790 | arg2, u64 arg3, u64 arg4, u64 arg5); will receive arguments in proper |
Alexei Starovoitov | e4ad403 | 2014-06-10 17:44:06 +0200 | [diff] [blame] | 791 | registers and place their return value into '%rax' which is R0 in eBPF. |
Alexei Starovoitov | dfee07c | 2014-05-01 08:16:03 -0700 | [diff] [blame] | 792 | Prologue and epilogue are emitted by JIT and are implicit in the |
Alexei Starovoitov | e4ad403 | 2014-06-10 17:44:06 +0200 | [diff] [blame] | 793 | interpreter. R0-R5 are scratch registers, so eBPF program needs to preserve |
Alexei Starovoitov | dfee07c | 2014-05-01 08:16:03 -0700 | [diff] [blame] | 794 | them across the calls as defined by calling convention. |
| 795 | |
| 796 | For example the following program is invalid: |
| 797 | |
| 798 | bpf_mov R1, 1 |
| 799 | bpf_call foo |
| 800 | bpf_mov R0, R1 |
| 801 | bpf_exit |
| 802 | |
| 803 | After the call the registers R1-R5 contain junk values and cannot be read. |
Edward Cree | 0cbf474 | 2017-08-07 15:30:09 +0100 | [diff] [blame] | 804 | An in-kernel eBPF verifier is used to validate internal BPF programs. |
Alexei Starovoitov | 9a985cd | 2014-03-28 18:58:26 +0100 | [diff] [blame] | 805 | |
Alexei Starovoitov | e4ad403 | 2014-06-10 17:44:06 +0200 | [diff] [blame] | 806 | Also in the new design, eBPF is limited to 4096 insns, which means that any |
Alexei Starovoitov | 9a985cd | 2014-03-28 18:58:26 +0100 | [diff] [blame] | 807 | program will terminate quickly and will only call a fixed number of kernel |
| 808 | functions. Original BPF and the new format are two operand instructions, |
Alexei Starovoitov | e4ad403 | 2014-06-10 17:44:06 +0200 | [diff] [blame] | 809 | which helps to do one-to-one mapping between eBPF insn and x86 insn during JIT. |
Alexei Starovoitov | 9a985cd | 2014-03-28 18:58:26 +0100 | [diff] [blame] | 810 | |
| 811 | The input context pointer for invoking the interpreter function is generic, |
| 812 | its content is defined by a specific use case. For seccomp register R1 points |
| 813 | to seccomp_data, for converted BPF filters R1 points to a skb. |
| 814 | |
| 815 | A program, that is translated internally consists of the following elements: |
| 816 | |
Alexei Starovoitov | e430f34 | 2014-06-06 14:46:06 -0700 | [diff] [blame] | 817 | op:16, jt:8, jf:8, k:32 ==> op:8, dst_reg:4, src_reg:4, off:16, imm:32 |
Alexei Starovoitov | 9a985cd | 2014-03-28 18:58:26 +0100 | [diff] [blame] | 818 | |
Alexei Starovoitov | dfee07c | 2014-05-01 08:16:03 -0700 | [diff] [blame] | 819 | So far 87 internal BPF instructions were implemented. 8-bit 'op' opcode field |
| 820 | has room for new instructions. Some of them may use 16/24/32 byte encoding. New |
| 821 | instructions must be multiple of 8 bytes to preserve backward compatibility. |
| 822 | |
| 823 | Internal BPF is a general purpose RISC instruction set. Not every register and |
| 824 | every instruction are used during translation from original BPF to new format. |
| 825 | For example, socket filters are not using 'exclusive add' instruction, but |
| 826 | tracing filters may do to maintain counters of events, for example. Register R9 |
| 827 | is not used by socket filters either, but more complex filters may be running |
| 828 | out of registers and would have to resort to spill/fill to stack. |
| 829 | |
| 830 | Internal BPF can used as generic assembler for last step performance |
| 831 | optimizations, socket filters and seccomp are using it as assembler. Tracing |
| 832 | filters may use it as assembler to generate code from kernel. In kernel usage |
| 833 | may not be bounded by security considerations, since generated internal BPF code |
| 834 | may be optimizing internal code path and not being exposed to the user space. |
| 835 | Safety of internal BPF can come from a verifier (TBD). In such use cases as |
| 836 | described, it may be used as safe instruction set. |
| 837 | |
Alexei Starovoitov | 9a985cd | 2014-03-28 18:58:26 +0100 | [diff] [blame] | 838 | Just like the original BPF, the new format runs within a controlled environment, |
| 839 | is deterministic and the kernel can easily prove that. The safety of the program |
| 840 | can be determined in two steps: first step does depth-first-search to disallow |
| 841 | loops and other CFG validation; second step starts from the first insn and |
| 842 | descends all possible paths. It simulates execution of every insn and observes |
| 843 | the state change of registers and stack. |
| 844 | |
Alexei Starovoitov | 783e327b | 2014-06-10 17:44:07 +0200 | [diff] [blame] | 845 | eBPF opcode encoding |
| 846 | -------------------- |
| 847 | |
| 848 | eBPF is reusing most of the opcode encoding from classic to simplify conversion |
| 849 | of classic BPF to eBPF. For arithmetic and jump instructions the 8-bit 'code' |
| 850 | field is divided into three parts: |
| 851 | |
| 852 | +----------------+--------+--------------------+ |
| 853 | | 4 bits | 1 bit | 3 bits | |
| 854 | | operation code | source | instruction class | |
| 855 | +----------------+--------+--------------------+ |
| 856 | (MSB) (LSB) |
| 857 | |
| 858 | Three LSB bits store instruction class which is one of: |
| 859 | |
| 860 | Classic BPF classes: eBPF classes: |
| 861 | |
| 862 | BPF_LD 0x00 BPF_LD 0x00 |
| 863 | BPF_LDX 0x01 BPF_LDX 0x01 |
| 864 | BPF_ST 0x02 BPF_ST 0x02 |
| 865 | BPF_STX 0x03 BPF_STX 0x03 |
| 866 | BPF_ALU 0x04 BPF_ALU 0x04 |
| 867 | BPF_JMP 0x05 BPF_JMP 0x05 |
| 868 | BPF_RET 0x06 [ class 6 unused, for future if needed ] |
| 869 | BPF_MISC 0x07 BPF_ALU64 0x07 |
| 870 | |
| 871 | When BPF_CLASS(code) == BPF_ALU or BPF_JMP, 4th bit encodes source operand ... |
| 872 | |
| 873 | BPF_K 0x00 |
| 874 | BPF_X 0x08 |
| 875 | |
| 876 | * in classic BPF, this means: |
| 877 | |
| 878 | BPF_SRC(code) == BPF_X - use register X as source operand |
| 879 | BPF_SRC(code) == BPF_K - use 32-bit immediate as source operand |
| 880 | |
| 881 | * in eBPF, this means: |
| 882 | |
| 883 | BPF_SRC(code) == BPF_X - use 'src_reg' register as source operand |
| 884 | BPF_SRC(code) == BPF_K - use 32-bit immediate as source operand |
| 885 | |
| 886 | ... and four MSB bits store operation code. |
| 887 | |
| 888 | If BPF_CLASS(code) == BPF_ALU or BPF_ALU64 [ in eBPF ], BPF_OP(code) is one of: |
| 889 | |
| 890 | BPF_ADD 0x00 |
| 891 | BPF_SUB 0x10 |
| 892 | BPF_MUL 0x20 |
| 893 | BPF_DIV 0x30 |
| 894 | BPF_OR 0x40 |
| 895 | BPF_AND 0x50 |
| 896 | BPF_LSH 0x60 |
| 897 | BPF_RSH 0x70 |
| 898 | BPF_NEG 0x80 |
| 899 | BPF_MOD 0x90 |
| 900 | BPF_XOR 0xa0 |
| 901 | BPF_MOV 0xb0 /* eBPF only: mov reg to reg */ |
| 902 | BPF_ARSH 0xc0 /* eBPF only: sign extending shift right */ |
| 903 | BPF_END 0xd0 /* eBPF only: endianness conversion */ |
| 904 | |
| 905 | If BPF_CLASS(code) == BPF_JMP, BPF_OP(code) is one of: |
| 906 | |
| 907 | BPF_JA 0x00 |
| 908 | BPF_JEQ 0x10 |
| 909 | BPF_JGT 0x20 |
| 910 | BPF_JGE 0x30 |
| 911 | BPF_JSET 0x40 |
| 912 | BPF_JNE 0x50 /* eBPF only: jump != */ |
| 913 | BPF_JSGT 0x60 /* eBPF only: signed '>' */ |
| 914 | BPF_JSGE 0x70 /* eBPF only: signed '>=' */ |
| 915 | BPF_CALL 0x80 /* eBPF only: function call */ |
| 916 | BPF_EXIT 0x90 /* eBPF only: function return */ |
Daniel Borkmann | 92b31a9 | 2017-08-10 01:39:55 +0200 | [diff] [blame] | 917 | BPF_JLT 0xa0 /* eBPF only: unsigned '<' */ |
| 918 | BPF_JLE 0xb0 /* eBPF only: unsigned '<=' */ |
| 919 | BPF_JSLT 0xc0 /* eBPF only: signed '<' */ |
| 920 | BPF_JSLE 0xd0 /* eBPF only: signed '<=' */ |
Alexei Starovoitov | 783e327b | 2014-06-10 17:44:07 +0200 | [diff] [blame] | 921 | |
| 922 | So BPF_ADD | BPF_X | BPF_ALU means 32-bit addition in both classic BPF |
| 923 | and eBPF. There are only two registers in classic BPF, so it means A += X. |
| 924 | In eBPF it means dst_reg = (u32) dst_reg + (u32) src_reg; similarly, |
| 925 | BPF_XOR | BPF_K | BPF_ALU means A ^= imm32 in classic BPF and analogous |
| 926 | src_reg = (u32) src_reg ^ (u32) imm32 in eBPF. |
| 927 | |
| 928 | Classic BPF is using BPF_MISC class to represent A = X and X = A moves. |
| 929 | eBPF is using BPF_MOV | BPF_X | BPF_ALU code instead. Since there are no |
| 930 | BPF_MISC operations in eBPF, the class 7 is used as BPF_ALU64 to mean |
| 931 | exactly the same operations as BPF_ALU, but with 64-bit wide operands |
| 932 | instead. So BPF_ADD | BPF_X | BPF_ALU64 means 64-bit addition, i.e.: |
| 933 | dst_reg = dst_reg + src_reg |
| 934 | |
| 935 | Classic BPF wastes the whole BPF_RET class to represent a single 'ret' |
| 936 | operation. Classic BPF_RET | BPF_K means copy imm32 into return register |
| 937 | and perform function exit. eBPF is modeled to match CPU, so BPF_JMP | BPF_EXIT |
| 938 | in eBPF means function exit only. The eBPF program needs to store return |
| 939 | value into register R0 before doing a BPF_EXIT. Class 6 in eBPF is currently |
| 940 | unused and reserved for future use. |
| 941 | |
| 942 | For load and store instructions the 8-bit 'code' field is divided as: |
| 943 | |
| 944 | +--------+--------+-------------------+ |
| 945 | | 3 bits | 2 bits | 3 bits | |
| 946 | | mode | size | instruction class | |
| 947 | +--------+--------+-------------------+ |
| 948 | (MSB) (LSB) |
| 949 | |
| 950 | Size modifier is one of ... |
| 951 | |
| 952 | BPF_W 0x00 /* word */ |
| 953 | BPF_H 0x08 /* half word */ |
| 954 | BPF_B 0x10 /* byte */ |
| 955 | BPF_DW 0x18 /* eBPF only, double word */ |
| 956 | |
| 957 | ... which encodes size of load/store operation: |
| 958 | |
| 959 | B - 1 byte |
| 960 | H - 2 byte |
| 961 | W - 4 byte |
| 962 | DW - 8 byte (eBPF only) |
| 963 | |
| 964 | Mode modifier is one of: |
| 965 | |
Alexei Starovoitov | 02ab695 | 2014-09-04 22:17:17 -0700 | [diff] [blame] | 966 | BPF_IMM 0x00 /* used for 32-bit mov in classic BPF and 64-bit in eBPF */ |
Alexei Starovoitov | 783e327b | 2014-06-10 17:44:07 +0200 | [diff] [blame] | 967 | BPF_ABS 0x20 |
| 968 | BPF_IND 0x40 |
| 969 | BPF_MEM 0x60 |
| 970 | BPF_LEN 0x80 /* classic BPF only, reserved in eBPF */ |
| 971 | BPF_MSH 0xa0 /* classic BPF only, reserved in eBPF */ |
| 972 | BPF_XADD 0xc0 /* eBPF only, exclusive add */ |
| 973 | |
| 974 | eBPF has two non-generic instructions: (BPF_ABS | <size> | BPF_LD) and |
| 975 | (BPF_IND | <size> | BPF_LD) which are used to access packet data. |
| 976 | |
| 977 | They had to be carried over from classic to have strong performance of |
| 978 | socket filters running in eBPF interpreter. These instructions can only |
| 979 | be used when interpreter context is a pointer to 'struct sk_buff' and |
| 980 | have seven implicit operands. Register R6 is an implicit input that must |
| 981 | contain pointer to sk_buff. Register R0 is an implicit output which contains |
| 982 | the data fetched from the packet. Registers R1-R5 are scratch registers |
| 983 | and must not be used to store the data across BPF_ABS | BPF_LD or |
| 984 | BPF_IND | BPF_LD instructions. |
| 985 | |
| 986 | These instructions have implicit program exit condition as well. When |
| 987 | eBPF program is trying to access the data beyond the packet boundary, |
| 988 | the interpreter will abort the execution of the program. JIT compilers |
| 989 | therefore must preserve this property. src_reg and imm32 fields are |
| 990 | explicit inputs to these instructions. |
| 991 | |
| 992 | For example: |
| 993 | |
| 994 | BPF_IND | BPF_W | BPF_LD means: |
| 995 | |
| 996 | R0 = ntohl(*(u32 *) (((struct sk_buff *) R6)->data + src_reg + imm32)) |
| 997 | and R1 - R5 were scratched. |
| 998 | |
| 999 | Unlike classic BPF instruction set, eBPF has generic load/store operations: |
| 1000 | |
| 1001 | BPF_MEM | <size> | BPF_STX: *(size *) (dst_reg + off) = src_reg |
| 1002 | BPF_MEM | <size> | BPF_ST: *(size *) (dst_reg + off) = imm32 |
| 1003 | BPF_MEM | <size> | BPF_LDX: dst_reg = *(size *) (src_reg + off) |
| 1004 | BPF_XADD | BPF_W | BPF_STX: lock xadd *(u32 *)(dst_reg + off16) += src_reg |
| 1005 | BPF_XADD | BPF_DW | BPF_STX: lock xadd *(u64 *)(dst_reg + off16) += src_reg |
| 1006 | |
| 1007 | Where size is one of: BPF_B or BPF_H or BPF_W or BPF_DW. Note that 1 and |
| 1008 | 2 byte atomic increments are not supported. |
| 1009 | |
Alexei Starovoitov | 02ab695 | 2014-09-04 22:17:17 -0700 | [diff] [blame] | 1010 | eBPF has one 16-byte instruction: BPF_LD | BPF_DW | BPF_IMM which consists |
| 1011 | of two consecutive 'struct bpf_insn' 8-byte blocks and interpreted as single |
| 1012 | instruction that loads 64-bit immediate value into a dst_reg. |
| 1013 | Classic BPF has similar instruction: BPF_LD | BPF_W | BPF_IMM which loads |
| 1014 | 32-bit immediate value into a register. |
| 1015 | |
Alexei Starovoitov | 51580e7 | 2014-09-26 00:17:02 -0700 | [diff] [blame] | 1016 | eBPF verifier |
| 1017 | ------------- |
| 1018 | The safety of the eBPF program is determined in two steps. |
| 1019 | |
| 1020 | First step does DAG check to disallow loops and other CFG validation. |
| 1021 | In particular it will detect programs that have unreachable instructions. |
| 1022 | (though classic BPF checker allows them) |
| 1023 | |
| 1024 | Second step starts from the first insn and descends all possible paths. |
| 1025 | It simulates execution of every insn and observes the state change of |
| 1026 | registers and stack. |
| 1027 | |
| 1028 | At the start of the program the register R1 contains a pointer to context |
| 1029 | and has type PTR_TO_CTX. |
| 1030 | If verifier sees an insn that does R2=R1, then R2 has now type |
| 1031 | PTR_TO_CTX as well and can be used on the right hand side of expression. |
Edward Cree | 0cbf474 | 2017-08-07 15:30:09 +0100 | [diff] [blame] | 1032 | If R1=PTR_TO_CTX and insn is R2=R1+R1, then R2=SCALAR_VALUE, |
Alexei Starovoitov | 51580e7 | 2014-09-26 00:17:02 -0700 | [diff] [blame] | 1033 | since addition of two valid pointers makes invalid pointer. |
| 1034 | (In 'secure' mode verifier will reject any type of pointer arithmetic to make |
| 1035 | sure that kernel addresses don't leak to unprivileged users) |
| 1036 | |
| 1037 | If register was never written to, it's not readable: |
| 1038 | bpf_mov R0 = R2 |
| 1039 | bpf_exit |
| 1040 | will be rejected, since R2 is unreadable at the start of the program. |
| 1041 | |
| 1042 | After kernel function call, R1-R5 are reset to unreadable and |
| 1043 | R0 has a return type of the function. |
| 1044 | |
| 1045 | Since R6-R9 are callee saved, their state is preserved across the call. |
| 1046 | bpf_mov R6 = 1 |
| 1047 | bpf_call foo |
| 1048 | bpf_mov R0 = R6 |
| 1049 | bpf_exit |
| 1050 | is a correct program. If there was R1 instead of R6, it would have |
| 1051 | been rejected. |
| 1052 | |
| 1053 | load/store instructions are allowed only with registers of valid types, which |
Edward Cree | 0cbf474 | 2017-08-07 15:30:09 +0100 | [diff] [blame] | 1054 | are PTR_TO_CTX, PTR_TO_MAP, PTR_TO_STACK. They are bounds and alignment checked. |
Alexei Starovoitov | 51580e7 | 2014-09-26 00:17:02 -0700 | [diff] [blame] | 1055 | For example: |
| 1056 | bpf_mov R1 = 1 |
| 1057 | bpf_mov R2 = 2 |
| 1058 | bpf_xadd *(u32 *)(R1 + 3) += R2 |
| 1059 | bpf_exit |
| 1060 | will be rejected, since R1 doesn't have a valid pointer type at the time of |
| 1061 | execution of instruction bpf_xadd. |
| 1062 | |
| 1063 | At the start R1 type is PTR_TO_CTX (a pointer to generic 'struct bpf_context') |
| 1064 | A callback is used to customize verifier to restrict eBPF program access to only |
| 1065 | certain fields within ctx structure with specified size and alignment. |
| 1066 | |
| 1067 | For example, the following insn: |
| 1068 | bpf_ld R0 = *(u32 *)(R6 + 8) |
| 1069 | intends to load a word from address R6 + 8 and store it into R0 |
| 1070 | If R6=PTR_TO_CTX, via is_valid_access() callback the verifier will know |
| 1071 | that offset 8 of size 4 bytes can be accessed for reading, otherwise |
| 1072 | the verifier will reject the program. |
Edward Cree | 0cbf474 | 2017-08-07 15:30:09 +0100 | [diff] [blame] | 1073 | If R6=PTR_TO_STACK, then access should be aligned and be within |
Alexei Starovoitov | 51580e7 | 2014-09-26 00:17:02 -0700 | [diff] [blame] | 1074 | stack bounds, which are [-MAX_BPF_STACK, 0). In this example offset is 8, |
| 1075 | so it will fail verification, since it's out of bounds. |
| 1076 | |
| 1077 | The verifier will allow eBPF program to read data from stack only after |
| 1078 | it wrote into it. |
| 1079 | Classic BPF verifier does similar check with M[0-15] memory slots. |
| 1080 | For example: |
| 1081 | bpf_ld R0 = *(u32 *)(R10 - 4) |
| 1082 | bpf_exit |
| 1083 | is invalid program. |
Edward Cree | 0cbf474 | 2017-08-07 15:30:09 +0100 | [diff] [blame] | 1084 | Though R10 is correct read-only register and has type PTR_TO_STACK |
Alexei Starovoitov | 51580e7 | 2014-09-26 00:17:02 -0700 | [diff] [blame] | 1085 | and R10 - 4 is within stack bounds, there were no stores into that location. |
| 1086 | |
| 1087 | Pointer register spill/fill is tracked as well, since four (R6-R9) |
| 1088 | callee saved registers may not be enough for some programs. |
| 1089 | |
| 1090 | Allowed function calls are customized with bpf_verifier_ops->get_func_proto() |
| 1091 | The eBPF verifier will check that registers match argument constraints. |
| 1092 | After the call register R0 will be set to return type of the function. |
| 1093 | |
| 1094 | Function calls is a main mechanism to extend functionality of eBPF programs. |
| 1095 | Socket filters may let programs to call one set of functions, whereas tracing |
| 1096 | filters may allow completely different set. |
| 1097 | |
| 1098 | If a function made accessible to eBPF program, it needs to be thought through |
| 1099 | from safety point of view. The verifier will guarantee that the function is |
| 1100 | called with valid arguments. |
| 1101 | |
| 1102 | seccomp vs socket filters have different security restrictions for classic BPF. |
| 1103 | Seccomp solves this by two stage verifier: classic BPF verifier is followed |
| 1104 | by seccomp verifier. In case of eBPF one configurable verifier is shared for |
| 1105 | all use cases. |
| 1106 | |
| 1107 | See details of eBPF verifier in kernel/bpf/verifier.c |
| 1108 | |
Edward Cree | 0cbf474 | 2017-08-07 15:30:09 +0100 | [diff] [blame] | 1109 | Register value tracking |
| 1110 | ----------------------- |
| 1111 | In order to determine the safety of an eBPF program, the verifier must track |
| 1112 | the range of possible values in each register and also in each stack slot. |
| 1113 | This is done with 'struct bpf_reg_state', defined in include/linux/ |
| 1114 | bpf_verifier.h, which unifies tracking of scalar and pointer values. Each |
| 1115 | register state has a type, which is either NOT_INIT (the register has not been |
| 1116 | written to), SCALAR_VALUE (some value which is not usable as a pointer), or a |
| 1117 | pointer type. The types of pointers describe their base, as follows: |
| 1118 | PTR_TO_CTX Pointer to bpf_context. |
| 1119 | CONST_PTR_TO_MAP Pointer to struct bpf_map. "Const" because arithmetic |
| 1120 | on these pointers is forbidden. |
| 1121 | PTR_TO_MAP_VALUE Pointer to the value stored in a map element. |
| 1122 | PTR_TO_MAP_VALUE_OR_NULL |
| 1123 | Either a pointer to a map value, or NULL; map accesses |
| 1124 | (see section 'eBPF maps', below) return this type, |
| 1125 | which becomes a PTR_TO_MAP_VALUE when checked != NULL. |
| 1126 | Arithmetic on these pointers is forbidden. |
| 1127 | PTR_TO_STACK Frame pointer. |
| 1128 | PTR_TO_PACKET skb->data. |
| 1129 | PTR_TO_PACKET_END skb->data + headlen; arithmetic forbidden. |
Joe Stringer | a610b66 | 2018-10-02 13:35:41 -0700 | [diff] [blame] | 1130 | PTR_TO_SOCKET Pointer to struct bpf_sock_ops, implicitly refcounted. |
| 1131 | PTR_TO_SOCKET_OR_NULL |
| 1132 | Either a pointer to a socket, or NULL; socket lookup |
| 1133 | returns this type, which becomes a PTR_TO_SOCKET when |
| 1134 | checked != NULL. PTR_TO_SOCKET is reference-counted, |
| 1135 | so programs must release the reference through the |
| 1136 | socket release function before the end of the program. |
| 1137 | Arithmetic on these pointers is forbidden. |
Edward Cree | 0cbf474 | 2017-08-07 15:30:09 +0100 | [diff] [blame] | 1138 | However, a pointer may be offset from this base (as a result of pointer |
| 1139 | arithmetic), and this is tracked in two parts: the 'fixed offset' and 'variable |
| 1140 | offset'. The former is used when an exactly-known value (e.g. an immediate |
| 1141 | operand) is added to a pointer, while the latter is used for values which are |
| 1142 | not exactly known. The variable offset is also used in SCALAR_VALUEs, to track |
| 1143 | the range of possible values in the register. |
| 1144 | The verifier's knowledge about the variable offset consists of: |
| 1145 | * minimum and maximum values as unsigned |
| 1146 | * minimum and maximum values as signed |
| 1147 | * knowledge of the values of individual bits, in the form of a 'tnum': a u64 |
| 1148 | 'mask' and a u64 'value'. 1s in the mask represent bits whose value is unknown; |
| 1149 | 1s in the value represent bits known to be 1. Bits known to be 0 have 0 in both |
| 1150 | mask and value; no bit should ever be 1 in both. For example, if a byte is read |
| 1151 | into a register from memory, the register's top 56 bits are known zero, while |
| 1152 | the low 8 are unknown - which is represented as the tnum (0x0; 0xff). If we |
Wang YanQing | e9dcd80 | 2018-01-24 15:48:26 +0800 | [diff] [blame] | 1153 | then OR this with 0x40, we get (0x40; 0xbf), then if we add 1 we get (0x0; |
Edward Cree | 0cbf474 | 2017-08-07 15:30:09 +0100 | [diff] [blame] | 1154 | 0x1ff), because of potential carries. |
Wang YanQing | 68625b7 | 2018-05-10 11:09:21 +0800 | [diff] [blame] | 1155 | |
Edward Cree | 0cbf474 | 2017-08-07 15:30:09 +0100 | [diff] [blame] | 1156 | Besides arithmetic, the register state can also be updated by conditional |
| 1157 | branches. For instance, if a SCALAR_VALUE is compared > 8, in the 'true' branch |
| 1158 | it will have a umin_value (unsigned minimum value) of 9, whereas in the 'false' |
| 1159 | branch it will have a umax_value of 8. A signed compare (with BPF_JSGT or |
| 1160 | BPF_JSGE) would instead update the signed minimum/maximum values. Information |
| 1161 | from the signed and unsigned bounds can be combined; for instance if a value is |
| 1162 | first tested < 8 and then tested s> 4, the verifier will conclude that the value |
| 1163 | is also > 4 and s< 8, since the bounds prevent crossing the sign boundary. |
Wang YanQing | 68625b7 | 2018-05-10 11:09:21 +0800 | [diff] [blame] | 1164 | |
Edward Cree | 0cbf474 | 2017-08-07 15:30:09 +0100 | [diff] [blame] | 1165 | PTR_TO_PACKETs with a variable offset part have an 'id', which is common to all |
| 1166 | pointers sharing that same variable offset. This is important for packet range |
Wang YanQing | 68625b7 | 2018-05-10 11:09:21 +0800 | [diff] [blame] | 1167 | checks: after adding a variable to a packet pointer register A, if you then copy |
| 1168 | it to another register B and then add a constant 4 to A, both registers will |
| 1169 | share the same 'id' but the A will have a fixed offset of +4. Then if A is |
| 1170 | bounds-checked and found to be less than a PTR_TO_PACKET_END, the register B is |
| 1171 | now known to have a safe range of at least 4 bytes. See 'Direct packet access', |
| 1172 | below, for more on PTR_TO_PACKET ranges. |
| 1173 | |
Edward Cree | 0cbf474 | 2017-08-07 15:30:09 +0100 | [diff] [blame] | 1174 | The 'id' field is also used on PTR_TO_MAP_VALUE_OR_NULL, common to all copies of |
| 1175 | the pointer returned from a map lookup. This means that when one copy is |
| 1176 | checked and found to be non-NULL, all copies can become PTR_TO_MAP_VALUEs. |
| 1177 | As well as range-checking, the tracked information is also used for enforcing |
| 1178 | alignment of pointer accesses. For instance, on most systems the packet pointer |
| 1179 | is 2 bytes after a 4-byte alignment. If a program adds 14 bytes to that to jump |
| 1180 | over the Ethernet header, then reads IHL and addes (IHL * 4), the resulting |
| 1181 | pointer will have a variable offset known to be 4n+2 for some n, so adding the 2 |
| 1182 | bytes (NET_IP_ALIGN) gives a 4-byte alignment and so word-sized accesses through |
| 1183 | that pointer are safe. |
Joe Stringer | a610b66 | 2018-10-02 13:35:41 -0700 | [diff] [blame] | 1184 | The 'id' field is also used on PTR_TO_SOCKET and PTR_TO_SOCKET_OR_NULL, common |
| 1185 | to all copies of the pointer returned from a socket lookup. This has similar |
| 1186 | behaviour to the handling for PTR_TO_MAP_VALUE_OR_NULL->PTR_TO_MAP_VALUE, but |
| 1187 | it also handles reference tracking for the pointer. PTR_TO_SOCKET implicitly |
| 1188 | represents a reference to the corresponding 'struct sock'. To ensure that the |
| 1189 | reference is not leaked, it is imperative to NULL-check the reference and in |
| 1190 | the non-NULL case, and pass the valid reference to the socket release function. |
Edward Cree | 0cbf474 | 2017-08-07 15:30:09 +0100 | [diff] [blame] | 1191 | |
Alexei Starovoitov | f9c8d19 | 2016-05-05 19:49:13 -0700 | [diff] [blame] | 1192 | Direct packet access |
| 1193 | -------------------- |
| 1194 | In cls_bpf and act_bpf programs the verifier allows direct access to the packet |
| 1195 | data via skb->data and skb->data_end pointers. |
| 1196 | Ex: |
| 1197 | 1: r4 = *(u32 *)(r1 +80) /* load skb->data_end */ |
| 1198 | 2: r3 = *(u32 *)(r1 +76) /* load skb->data */ |
| 1199 | 3: r5 = r3 |
| 1200 | 4: r5 += 14 |
| 1201 | 5: if r5 > r4 goto pc+16 |
| 1202 | R1=ctx R3=pkt(id=0,off=0,r=14) R4=pkt_end R5=pkt(id=0,off=14,r=14) R10=fp |
| 1203 | 6: r0 = *(u16 *)(r3 +12) /* access 12 and 13 bytes of the packet */ |
| 1204 | |
| 1205 | this 2byte load from the packet is safe to do, since the program author |
| 1206 | did check 'if (skb->data + 14 > skb->data_end) goto err' at insn #5 which |
| 1207 | means that in the fall-through case the register R3 (which points to skb->data) |
| 1208 | has at least 14 directly accessible bytes. The verifier marks it |
| 1209 | as R3=pkt(id=0,off=0,r=14). |
| 1210 | id=0 means that no additional variables were added to the register. |
| 1211 | off=0 means that no additional constants were added. |
| 1212 | r=14 is the range of safe access which means that bytes [R3, R3 + 14) are ok. |
| 1213 | Note that R5 is marked as R5=pkt(id=0,off=14,r=14). It also points |
| 1214 | to the packet data, but constant 14 was added to the register, so |
| 1215 | it now points to 'skb->data + 14' and accessible range is [R5, R5 + 14 - 14) |
| 1216 | which is zero bytes. |
| 1217 | |
| 1218 | More complex packet access may look like: |
Edward Cree | 0cbf474 | 2017-08-07 15:30:09 +0100 | [diff] [blame] | 1219 | R0=inv1 R1=ctx R3=pkt(id=0,off=0,r=14) R4=pkt_end R5=pkt(id=0,off=14,r=14) R10=fp |
Alexei Starovoitov | f9c8d19 | 2016-05-05 19:49:13 -0700 | [diff] [blame] | 1220 | 6: r0 = *(u8 *)(r3 +7) /* load 7th byte from the packet */ |
| 1221 | 7: r4 = *(u8 *)(r3 +12) |
| 1222 | 8: r4 *= 14 |
| 1223 | 9: r3 = *(u32 *)(r1 +76) /* load skb->data */ |
| 1224 | 10: r3 += r4 |
| 1225 | 11: r2 = r1 |
| 1226 | 12: r2 <<= 48 |
| 1227 | 13: r2 >>= 48 |
| 1228 | 14: r3 += r2 |
| 1229 | 15: r2 = r3 |
| 1230 | 16: r2 += 8 |
| 1231 | 17: r1 = *(u32 *)(r1 +80) /* load skb->data_end */ |
| 1232 | 18: if r2 > r1 goto pc+2 |
Edward Cree | 0cbf474 | 2017-08-07 15:30:09 +0100 | [diff] [blame] | 1233 | R0=inv(id=0,umax_value=255,var_off=(0x0; 0xff)) R1=pkt_end R2=pkt(id=2,off=8,r=8) R3=pkt(id=2,off=0,r=8) R4=inv(id=0,umax_value=3570,var_off=(0x0; 0xfffe)) R5=pkt(id=0,off=14,r=14) R10=fp |
Alexei Starovoitov | f9c8d19 | 2016-05-05 19:49:13 -0700 | [diff] [blame] | 1234 | 19: r1 = *(u8 *)(r3 +4) |
| 1235 | The state of the register R3 is R3=pkt(id=2,off=0,r=8) |
| 1236 | id=2 means that two 'r3 += rX' instructions were seen, so r3 points to some |
| 1237 | offset within a packet and since the program author did |
| 1238 | 'if (r3 + 8 > r1) goto err' at insn #18, the safe range is [R3, R3 + 8). |
Edward Cree | 0cbf474 | 2017-08-07 15:30:09 +0100 | [diff] [blame] | 1239 | The verifier only allows 'add'/'sub' operations on packet registers. Any other |
| 1240 | operation will set the register state to 'SCALAR_VALUE' and it won't be |
Alexei Starovoitov | f9c8d19 | 2016-05-05 19:49:13 -0700 | [diff] [blame] | 1241 | available for direct packet access. |
| 1242 | Operation 'r3 += rX' may overflow and become less than original skb->data, |
Edward Cree | 0cbf474 | 2017-08-07 15:30:09 +0100 | [diff] [blame] | 1243 | therefore the verifier has to prevent that. So when it sees 'r3 += rX' |
| 1244 | instruction and rX is more than 16-bit value, any subsequent bounds-check of r3 |
| 1245 | against skb->data_end will not give us 'range' information, so attempts to read |
| 1246 | through the pointer will give "invalid access to packet" error. |
Alexei Starovoitov | f9c8d19 | 2016-05-05 19:49:13 -0700 | [diff] [blame] | 1247 | Ex. after insn 'r4 = *(u8 *)(r3 +12)' (insn #7 above) the state of r4 is |
Edward Cree | 0cbf474 | 2017-08-07 15:30:09 +0100 | [diff] [blame] | 1248 | R4=inv(id=0,umax_value=255,var_off=(0x0; 0xff)) which means that upper 56 bits |
| 1249 | of the register are guaranteed to be zero, and nothing is known about the lower |
| 1250 | 8 bits. After insn 'r4 *= 14' the state becomes |
| 1251 | R4=inv(id=0,umax_value=3570,var_off=(0x0; 0xfffe)), since multiplying an 8-bit |
| 1252 | value by constant 14 will keep upper 52 bits as zero, also the least significant |
| 1253 | bit will be zero as 14 is even. Similarly 'r2 >>= 48' will make |
| 1254 | R2=inv(id=0,umax_value=65535,var_off=(0x0; 0xffff)), since the shift is not sign |
| 1255 | extending. This logic is implemented in adjust_reg_min_max_vals() function, |
| 1256 | which calls adjust_ptr_min_max_vals() for adding pointer to scalar (or vice |
| 1257 | versa) and adjust_scalar_min_max_vals() for operations on two scalars. |
Alexei Starovoitov | f9c8d19 | 2016-05-05 19:49:13 -0700 | [diff] [blame] | 1258 | |
| 1259 | The end result is that bpf program author can access packet directly |
| 1260 | using normal C code as: |
| 1261 | void *data = (void *)(long)skb->data; |
| 1262 | void *data_end = (void *)(long)skb->data_end; |
| 1263 | struct eth_hdr *eth = data; |
| 1264 | struct iphdr *iph = data + sizeof(*eth); |
| 1265 | struct udphdr *udp = data + sizeof(*eth) + sizeof(*iph); |
| 1266 | |
| 1267 | if (data + sizeof(*eth) + sizeof(*iph) + sizeof(*udp) > data_end) |
| 1268 | return 0; |
| 1269 | if (eth->h_proto != htons(ETH_P_IP)) |
| 1270 | return 0; |
| 1271 | if (iph->protocol != IPPROTO_UDP || iph->ihl != 5) |
| 1272 | return 0; |
| 1273 | if (udp->dest == 53 || udp->source == 9) |
| 1274 | ...; |
| 1275 | which makes such programs easier to write comparing to LD_ABS insn |
| 1276 | and significantly faster. |
| 1277 | |
Alexei Starovoitov | 99c55f7 | 2014-09-26 00:16:57 -0700 | [diff] [blame] | 1278 | eBPF maps |
| 1279 | --------- |
| 1280 | 'maps' is a generic storage of different types for sharing data between kernel |
| 1281 | and userspace. |
| 1282 | |
| 1283 | The maps are accessed from user space via BPF syscall, which has commands: |
| 1284 | - create a map with given type and attributes |
| 1285 | map_fd = bpf(BPF_MAP_CREATE, union bpf_attr *attr, u32 size) |
| 1286 | using attr->map_type, attr->key_size, attr->value_size, attr->max_entries |
| 1287 | returns process-local file descriptor or negative error |
| 1288 | |
| 1289 | - lookup key in a given map |
| 1290 | err = bpf(BPF_MAP_LOOKUP_ELEM, union bpf_attr *attr, u32 size) |
| 1291 | using attr->map_fd, attr->key, attr->value |
| 1292 | returns zero and stores found elem into value or negative error |
| 1293 | |
| 1294 | - create or update key/value pair in a given map |
| 1295 | err = bpf(BPF_MAP_UPDATE_ELEM, union bpf_attr *attr, u32 size) |
| 1296 | using attr->map_fd, attr->key, attr->value |
| 1297 | returns zero or negative error |
| 1298 | |
| 1299 | - find and delete element by key in a given map |
| 1300 | err = bpf(BPF_MAP_DELETE_ELEM, union bpf_attr *attr, u32 size) |
| 1301 | using attr->map_fd, attr->key |
| 1302 | |
| 1303 | - to delete map: close(fd) |
| 1304 | Exiting process will delete maps automatically |
| 1305 | |
| 1306 | userspace programs use this syscall to create/access maps that eBPF programs |
| 1307 | are concurrently updating. |
| 1308 | |
| 1309 | maps can have different types: hash, array, bloom filter, radix-tree, etc. |
| 1310 | |
| 1311 | The map is defined by: |
| 1312 | . type |
| 1313 | . max number of elements |
| 1314 | . key size in bytes |
| 1315 | . value size in bytes |
| 1316 | |
Edward Cree | 0cbf474 | 2017-08-07 15:30:09 +0100 | [diff] [blame] | 1317 | Pruning |
| 1318 | ------- |
| 1319 | The verifier does not actually walk all possible paths through the program. For |
| 1320 | each new branch to analyse, the verifier looks at all the states it's previously |
| 1321 | been in when at this instruction. If any of them contain the current state as a |
| 1322 | subset, the branch is 'pruned' - that is, the fact that the previous state was |
| 1323 | accepted implies the current state would be as well. For instance, if in the |
| 1324 | previous state, r1 held a packet-pointer, and in the current state, r1 holds a |
| 1325 | packet-pointer with a range as long or longer and at least as strict an |
| 1326 | alignment, then r1 is safe. Similarly, if r2 was NOT_INIT before then it can't |
| 1327 | have been used by any path from that point, so any value in r2 (including |
| 1328 | another NOT_INIT) is safe. The implementation is in the function regsafe(). |
| 1329 | Pruning considers not only the registers but also the stack (and any spilled |
| 1330 | registers it may hold). They must all be safe for the branch to be pruned. |
| 1331 | This is implemented in states_equal(). |
| 1332 | |
Alexei Starovoitov | 51580e7 | 2014-09-26 00:17:02 -0700 | [diff] [blame] | 1333 | Understanding eBPF verifier messages |
| 1334 | ------------------------------------ |
| 1335 | |
| 1336 | The following are few examples of invalid eBPF programs and verifier error |
| 1337 | messages as seen in the log: |
| 1338 | |
| 1339 | Program with unreachable instructions: |
| 1340 | static struct bpf_insn prog[] = { |
| 1341 | BPF_EXIT_INSN(), |
| 1342 | BPF_EXIT_INSN(), |
| 1343 | }; |
| 1344 | Error: |
| 1345 | unreachable insn 1 |
| 1346 | |
| 1347 | Program that reads uninitialized register: |
| 1348 | BPF_MOV64_REG(BPF_REG_0, BPF_REG_2), |
| 1349 | BPF_EXIT_INSN(), |
| 1350 | Error: |
| 1351 | 0: (bf) r0 = r2 |
| 1352 | R2 !read_ok |
| 1353 | |
| 1354 | Program that doesn't initialize R0 before exiting: |
| 1355 | BPF_MOV64_REG(BPF_REG_2, BPF_REG_1), |
| 1356 | BPF_EXIT_INSN(), |
| 1357 | Error: |
| 1358 | 0: (bf) r2 = r1 |
| 1359 | 1: (95) exit |
| 1360 | R0 !read_ok |
| 1361 | |
| 1362 | Program that accesses stack out of bounds: |
| 1363 | BPF_ST_MEM(BPF_DW, BPF_REG_10, 8, 0), |
| 1364 | BPF_EXIT_INSN(), |
| 1365 | Error: |
| 1366 | 0: (7a) *(u64 *)(r10 +8) = 0 |
| 1367 | invalid stack off=8 size=8 |
| 1368 | |
| 1369 | Program that doesn't initialize stack before passing its address into function: |
| 1370 | BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), |
| 1371 | BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), |
| 1372 | BPF_LD_MAP_FD(BPF_REG_1, 0), |
| 1373 | BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem), |
| 1374 | BPF_EXIT_INSN(), |
| 1375 | Error: |
| 1376 | 0: (bf) r2 = r10 |
| 1377 | 1: (07) r2 += -8 |
| 1378 | 2: (b7) r1 = 0x0 |
| 1379 | 3: (85) call 1 |
| 1380 | invalid indirect read from stack off -8+0 size 8 |
| 1381 | |
| 1382 | Program that uses invalid map_fd=0 while calling to map_lookup_elem() function: |
| 1383 | BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0), |
| 1384 | BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), |
| 1385 | BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), |
| 1386 | BPF_LD_MAP_FD(BPF_REG_1, 0), |
| 1387 | BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem), |
| 1388 | BPF_EXIT_INSN(), |
| 1389 | Error: |
| 1390 | 0: (7a) *(u64 *)(r10 -8) = 0 |
| 1391 | 1: (bf) r2 = r10 |
| 1392 | 2: (07) r2 += -8 |
| 1393 | 3: (b7) r1 = 0x0 |
| 1394 | 4: (85) call 1 |
| 1395 | fd 0 is not pointing to valid bpf_map |
| 1396 | |
| 1397 | Program that doesn't check return value of map_lookup_elem() before accessing |
| 1398 | map element: |
| 1399 | BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0), |
| 1400 | BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), |
| 1401 | BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), |
| 1402 | BPF_LD_MAP_FD(BPF_REG_1, 0), |
| 1403 | BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem), |
| 1404 | BPF_ST_MEM(BPF_DW, BPF_REG_0, 0, 0), |
| 1405 | BPF_EXIT_INSN(), |
| 1406 | Error: |
| 1407 | 0: (7a) *(u64 *)(r10 -8) = 0 |
| 1408 | 1: (bf) r2 = r10 |
| 1409 | 2: (07) r2 += -8 |
| 1410 | 3: (b7) r1 = 0x0 |
| 1411 | 4: (85) call 1 |
| 1412 | 5: (7a) *(u64 *)(r0 +0) = 0 |
| 1413 | R0 invalid mem access 'map_value_or_null' |
| 1414 | |
| 1415 | Program that correctly checks map_lookup_elem() returned value for NULL, but |
| 1416 | accesses the memory with incorrect alignment: |
| 1417 | BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0), |
| 1418 | BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), |
| 1419 | BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), |
| 1420 | BPF_LD_MAP_FD(BPF_REG_1, 0), |
| 1421 | BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem), |
| 1422 | BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 1), |
| 1423 | BPF_ST_MEM(BPF_DW, BPF_REG_0, 4, 0), |
| 1424 | BPF_EXIT_INSN(), |
| 1425 | Error: |
| 1426 | 0: (7a) *(u64 *)(r10 -8) = 0 |
| 1427 | 1: (bf) r2 = r10 |
| 1428 | 2: (07) r2 += -8 |
| 1429 | 3: (b7) r1 = 1 |
| 1430 | 4: (85) call 1 |
| 1431 | 5: (15) if r0 == 0x0 goto pc+1 |
| 1432 | R0=map_ptr R10=fp |
| 1433 | 6: (7a) *(u64 *)(r0 +4) = 0 |
| 1434 | misaligned access off 4 size 8 |
| 1435 | |
| 1436 | Program that correctly checks map_lookup_elem() returned value for NULL and |
| 1437 | accesses memory with correct alignment in one side of 'if' branch, but fails |
| 1438 | to do so in the other side of 'if' branch: |
| 1439 | BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0), |
| 1440 | BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), |
| 1441 | BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), |
| 1442 | BPF_LD_MAP_FD(BPF_REG_1, 0), |
| 1443 | BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem), |
| 1444 | BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 2), |
| 1445 | BPF_ST_MEM(BPF_DW, BPF_REG_0, 0, 0), |
| 1446 | BPF_EXIT_INSN(), |
| 1447 | BPF_ST_MEM(BPF_DW, BPF_REG_0, 0, 1), |
| 1448 | BPF_EXIT_INSN(), |
| 1449 | Error: |
| 1450 | 0: (7a) *(u64 *)(r10 -8) = 0 |
| 1451 | 1: (bf) r2 = r10 |
| 1452 | 2: (07) r2 += -8 |
| 1453 | 3: (b7) r1 = 1 |
| 1454 | 4: (85) call 1 |
| 1455 | 5: (15) if r0 == 0x0 goto pc+2 |
| 1456 | R0=map_ptr R10=fp |
| 1457 | 6: (7a) *(u64 *)(r0 +0) = 0 |
| 1458 | 7: (95) exit |
| 1459 | |
| 1460 | from 5 to 8: R0=imm0 R10=fp |
| 1461 | 8: (7a) *(u64 *)(r0 +0) = 1 |
| 1462 | R0 invalid mem access 'imm' |
| 1463 | |
Joe Stringer | a610b66 | 2018-10-02 13:35:41 -0700 | [diff] [blame] | 1464 | Program that performs a socket lookup then sets the pointer to NULL without |
| 1465 | checking it: |
| 1466 | value: |
| 1467 | BPF_MOV64_IMM(BPF_REG_2, 0), |
| 1468 | BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_2, -8), |
| 1469 | BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), |
| 1470 | BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), |
| 1471 | BPF_MOV64_IMM(BPF_REG_3, 4), |
| 1472 | BPF_MOV64_IMM(BPF_REG_4, 0), |
| 1473 | BPF_MOV64_IMM(BPF_REG_5, 0), |
| 1474 | BPF_EMIT_CALL(BPF_FUNC_sk_lookup_tcp), |
| 1475 | BPF_MOV64_IMM(BPF_REG_0, 0), |
| 1476 | BPF_EXIT_INSN(), |
| 1477 | Error: |
| 1478 | 0: (b7) r2 = 0 |
| 1479 | 1: (63) *(u32 *)(r10 -8) = r2 |
| 1480 | 2: (bf) r2 = r10 |
| 1481 | 3: (07) r2 += -8 |
| 1482 | 4: (b7) r3 = 4 |
| 1483 | 5: (b7) r4 = 0 |
| 1484 | 6: (b7) r5 = 0 |
| 1485 | 7: (85) call bpf_sk_lookup_tcp#65 |
| 1486 | 8: (b7) r0 = 0 |
| 1487 | 9: (95) exit |
| 1488 | Unreleased reference id=1, alloc_insn=7 |
| 1489 | |
| 1490 | Program that performs a socket lookup but does not NULL-check the returned |
| 1491 | value: |
| 1492 | BPF_MOV64_IMM(BPF_REG_2, 0), |
| 1493 | BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_2, -8), |
| 1494 | BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), |
| 1495 | BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), |
| 1496 | BPF_MOV64_IMM(BPF_REG_3, 4), |
| 1497 | BPF_MOV64_IMM(BPF_REG_4, 0), |
| 1498 | BPF_MOV64_IMM(BPF_REG_5, 0), |
| 1499 | BPF_EMIT_CALL(BPF_FUNC_sk_lookup_tcp), |
| 1500 | BPF_EXIT_INSN(), |
| 1501 | Error: |
| 1502 | 0: (b7) r2 = 0 |
| 1503 | 1: (63) *(u32 *)(r10 -8) = r2 |
| 1504 | 2: (bf) r2 = r10 |
| 1505 | 3: (07) r2 += -8 |
| 1506 | 4: (b7) r3 = 4 |
| 1507 | 5: (b7) r4 = 0 |
| 1508 | 6: (b7) r5 = 0 |
| 1509 | 7: (85) call bpf_sk_lookup_tcp#65 |
| 1510 | 8: (95) exit |
| 1511 | Unreleased reference id=1, alloc_insn=7 |
| 1512 | |
Daniel Borkmann | 04caa48 | 2014-05-23 18:43:59 +0200 | [diff] [blame] | 1513 | Testing |
| 1514 | ------- |
| 1515 | |
| 1516 | Next to the BPF toolchain, the kernel also ships a test module that contains |
| 1517 | various test cases for classic and internal BPF that can be executed against |
| 1518 | the BPF interpreter and JIT compiler. It can be found in lib/test_bpf.c and |
| 1519 | enabled via Kconfig: |
| 1520 | |
| 1521 | CONFIG_TEST_BPF=m |
| 1522 | |
| 1523 | After the module has been built and installed, the test suite can be executed |
| 1524 | via insmod or modprobe against 'test_bpf' module. Results of the test cases |
| 1525 | including timings in nsec can be found in the kernel log (dmesg). |
| 1526 | |
Daniel Borkmann | 7924cd5 | 2013-12-11 23:43:45 +0100 | [diff] [blame] | 1527 | Misc |
| 1528 | ---- |
| 1529 | |
| 1530 | Also trinity, the Linux syscall fuzzer, has built-in support for BPF and |
| 1531 | SECCOMP-BPF kernel fuzzing. |
| 1532 | |
| 1533 | Written by |
| 1534 | ---------- |
| 1535 | |
| 1536 | The document was written in the hope that it is found useful and in order |
| 1537 | to give potential BPF hackers or security auditors a better overview of |
| 1538 | the underlying architecture. |
| 1539 | |
| 1540 | Jay Schulist <jschlst@samba.org> |
Alexei Starovoitov | f9c8d19 | 2016-05-05 19:49:13 -0700 | [diff] [blame] | 1541 | Daniel Borkmann <daniel@iogearbox.net> |
| 1542 | Alexei Starovoitov <ast@kernel.org> |