Mauro Carvalho Chehab | 4d2e26a | 2019-04-10 08:32:42 -0300 | [diff] [blame] | 1 | ==================================== |
Ian Munsie | a9282d0 | 2014-10-08 19:55:05 +1100 | [diff] [blame] | 2 | Coherent Accelerator Interface (CXL) |
| 3 | ==================================== |
| 4 | |
| 5 | Introduction |
| 6 | ============ |
| 7 | |
| 8 | The coherent accelerator interface is designed to allow the |
| 9 | coherent connection of accelerators (FPGAs and other devices) to a |
| 10 | POWER system. These devices need to adhere to the Coherent |
| 11 | Accelerator Interface Architecture (CAIA). |
| 12 | |
| 13 | IBM refers to this as the Coherent Accelerator Processor Interface |
| 14 | or CAPI. In the kernel it's referred to by the name CXL to avoid |
| 15 | confusion with the ISDN CAPI subsystem. |
| 16 | |
| 17 | Coherent in this context means that the accelerator and CPUs can |
| 18 | both access system memory directly and with the same effective |
| 19 | addresses. |
| 20 | |
| 21 | |
| 22 | Hardware overview |
| 23 | ================= |
| 24 | |
Mauro Carvalho Chehab | 4d2e26a | 2019-04-10 08:32:42 -0300 | [diff] [blame] | 25 | :: |
| 26 | |
Christophe Lombard | f24be42 | 2017-04-12 16:34:07 +0200 | [diff] [blame] | 27 | POWER8/9 FPGA |
Ian Munsie | a9282d0 | 2014-10-08 19:55:05 +1100 | [diff] [blame] | 28 | +----------+ +---------+ |
| 29 | | | | | |
| 30 | | CPU | | AFU | |
| 31 | | | | | |
| 32 | | | | | |
| 33 | | | | | |
| 34 | +----------+ +---------+ |
| 35 | | PHB | | | |
| 36 | | +------+ | PSL | |
| 37 | | | CAPP |<------>| | |
| 38 | +---+------+ PCIE +---------+ |
| 39 | |
Christophe Lombard | f24be42 | 2017-04-12 16:34:07 +0200 | [diff] [blame] | 40 | The POWER8/9 chip has a Coherently Attached Processor Proxy (CAPP) |
Ian Munsie | a9282d0 | 2014-10-08 19:55:05 +1100 | [diff] [blame] | 41 | unit which is part of the PCIe Host Bridge (PHB). This is managed |
| 42 | by Linux by calls into OPAL. Linux doesn't directly program the |
| 43 | CAPP. |
| 44 | |
| 45 | The FPGA (or coherently attached device) consists of two parts. |
| 46 | The POWER Service Layer (PSL) and the Accelerator Function Unit |
| 47 | (AFU). The AFU is used to implement specific functionality behind |
| 48 | the PSL. The PSL, among other things, provides memory address |
| 49 | translation services to allow each AFU direct access to userspace |
| 50 | memory. |
| 51 | |
| 52 | The AFU is the core part of the accelerator (eg. the compression, |
| 53 | crypto etc function). The kernel has no knowledge of the function |
| 54 | of the AFU. Only userspace interacts directly with the AFU. |
| 55 | |
| 56 | The PSL provides the translation and interrupt services that the |
| 57 | AFU needs. This is what the kernel interacts with. For example, if |
| 58 | the AFU needs to read a particular effective address, it sends |
| 59 | that address to the PSL, the PSL then translates it, fetches the |
| 60 | data from memory and returns it to the AFU. If the PSL has a |
| 61 | translation miss, it interrupts the kernel and the kernel services |
| 62 | the fault. The context to which this fault is serviced is based on |
| 63 | who owns that acceleration function. |
| 64 | |
Mauro Carvalho Chehab | 4d2e26a | 2019-04-10 08:32:42 -0300 | [diff] [blame] | 65 | - POWER8 and PSL Version 8 are compliant to the CAIA Version 1.0. |
| 66 | - POWER9 and PSL Version 9 are compliant to the CAIA Version 2.0. |
| 67 | |
Christophe Lombard | f24be42 | 2017-04-12 16:34:07 +0200 | [diff] [blame] | 68 | This PSL Version 9 provides new features such as: |
Mauro Carvalho Chehab | 4d2e26a | 2019-04-10 08:32:42 -0300 | [diff] [blame] | 69 | |
Christophe Lombard | f24be42 | 2017-04-12 16:34:07 +0200 | [diff] [blame] | 70 | * Interaction with the nest MMU on the P9 chip. |
| 71 | * Native DMA support. |
| 72 | * Supports sending ASB_Notify messages for host thread wakeup. |
| 73 | * Supports Atomic operations. |
Mauro Carvalho Chehab | 4d2e26a | 2019-04-10 08:32:42 -0300 | [diff] [blame] | 74 | * etc. |
Christophe Lombard | f24be42 | 2017-04-12 16:34:07 +0200 | [diff] [blame] | 75 | |
| 76 | Cards with a PSL9 won't work on a POWER8 system and cards with a |
| 77 | PSL8 won't work on a POWER9 system. |
Ian Munsie | a9282d0 | 2014-10-08 19:55:05 +1100 | [diff] [blame] | 78 | |
| 79 | AFU Modes |
| 80 | ========= |
| 81 | |
| 82 | There are two programming modes supported by the AFU. Dedicated |
| 83 | and AFU directed. AFU may support one or both modes. |
| 84 | |
| 85 | When using dedicated mode only one MMU context is supported. In |
| 86 | this mode, only one userspace process can use the accelerator at |
| 87 | time. |
| 88 | |
| 89 | When using AFU directed mode, up to 16K simultaneous contexts can |
| 90 | be supported. This means up to 16K simultaneous userspace |
| 91 | applications may use the accelerator (although specific AFUs may |
| 92 | support fewer). In this mode, the AFU sends a 16 bit context ID |
| 93 | with each of its requests. This tells the PSL which context is |
| 94 | associated with each operation. If the PSL can't translate an |
| 95 | operation, the ID can also be accessed by the kernel so it can |
| 96 | determine the userspace context associated with an operation. |
| 97 | |
| 98 | |
| 99 | MMIO space |
| 100 | ========== |
| 101 | |
| 102 | A portion of the accelerator MMIO space can be directly mapped |
| 103 | from the AFU to userspace. Either the whole space can be mapped or |
| 104 | just a per context portion. The hardware is self describing, hence |
| 105 | the kernel can determine the offset and size of the per context |
| 106 | portion. |
| 107 | |
| 108 | |
| 109 | Interrupts |
| 110 | ========== |
| 111 | |
| 112 | AFUs may generate interrupts that are destined for userspace. These |
| 113 | are received by the kernel as hardware interrupts and passed onto |
| 114 | userspace by a read syscall documented below. |
| 115 | |
| 116 | Data storage faults and error interrupts are handled by the kernel |
| 117 | driver. |
| 118 | |
| 119 | |
| 120 | Work Element Descriptor (WED) |
| 121 | ============================= |
| 122 | |
| 123 | The WED is a 64-bit parameter passed to the AFU when a context is |
| 124 | started. Its format is up to the AFU hence the kernel has no |
| 125 | knowledge of what it represents. Typically it will be the |
| 126 | effective address of a work queue or status block where the AFU |
| 127 | and userspace can share control and status information. |
| 128 | |
| 129 | |
| 130 | |
| 131 | |
| 132 | User API |
| 133 | ======== |
| 134 | |
Christophe Lombard | 594ff7d | 2016-03-04 12:26:38 +0100 | [diff] [blame] | 135 | 1. AFU character devices |
Mauro Carvalho Chehab | 8f97986 | 2020-04-14 18:48:51 +0200 | [diff] [blame] | 136 | ^^^^^^^^^^^^^^^^^^^^^^^^ |
Christophe Lombard | 594ff7d | 2016-03-04 12:26:38 +0100 | [diff] [blame] | 137 | |
Ian Munsie | a9282d0 | 2014-10-08 19:55:05 +1100 | [diff] [blame] | 138 | For AFUs operating in AFU directed mode, two character device |
| 139 | files will be created. /dev/cxl/afu0.0m will correspond to a |
| 140 | master context and /dev/cxl/afu0.0s will correspond to a slave |
| 141 | context. Master contexts have access to the full MMIO space an |
| 142 | AFU provides. Slave contexts have access to only the per process |
| 143 | MMIO space an AFU provides. |
| 144 | |
| 145 | For AFUs operating in dedicated process mode, the driver will |
| 146 | only create a single character device per AFU called |
| 147 | /dev/cxl/afu0.0d. This will have access to the entire MMIO space |
| 148 | that the AFU provides (like master contexts in AFU directed). |
| 149 | |
| 150 | The types described below are defined in include/uapi/misc/cxl.h |
| 151 | |
| 152 | The following file operations are supported on both slave and |
| 153 | master devices. |
| 154 | |
Masanari Iida | dc12f20 | 2015-07-06 23:41:57 +0900 | [diff] [blame] | 155 | A userspace library libcxl is available here: |
Mauro Carvalho Chehab | 4d2e26a | 2019-04-10 08:32:42 -0300 | [diff] [blame] | 156 | |
Michael Neuling | aee85fb | 2015-05-27 16:07:01 +1000 | [diff] [blame] | 157 | https://github.com/ibm-capi/libcxl |
Mauro Carvalho Chehab | 4d2e26a | 2019-04-10 08:32:42 -0300 | [diff] [blame] | 158 | |
Michael Neuling | aee85fb | 2015-05-27 16:07:01 +1000 | [diff] [blame] | 159 | This provides a C interface to this kernel API. |
Ian Munsie | a9282d0 | 2014-10-08 19:55:05 +1100 | [diff] [blame] | 160 | |
| 161 | open |
| 162 | ---- |
| 163 | |
| 164 | Opens the device and allocates a file descriptor to be used with |
| 165 | the rest of the API. |
| 166 | |
| 167 | A dedicated mode AFU only has one context and only allows the |
| 168 | device to be opened once. |
| 169 | |
| 170 | An AFU directed mode AFU can have many contexts, the device can be |
| 171 | opened once for each context that is available. |
| 172 | |
| 173 | When all available contexts are allocated the open call will fail |
| 174 | and return -ENOSPC. |
| 175 | |
Mauro Carvalho Chehab | 4d2e26a | 2019-04-10 08:32:42 -0300 | [diff] [blame] | 176 | Note: |
| 177 | IRQs need to be allocated for each context, which may limit |
Ian Munsie | a9282d0 | 2014-10-08 19:55:05 +1100 | [diff] [blame] | 178 | the number of contexts that can be created, and therefore |
| 179 | how many times the device can be opened. The POWER8 CAPP |
| 180 | supports 2040 IRQs and 3 are used by the kernel, so 2037 are |
| 181 | left. If 1 IRQ is needed per context, then only 2037 |
| 182 | contexts can be allocated. If 4 IRQs are needed per context, |
| 183 | then only 2037/4 = 509 contexts can be allocated. |
| 184 | |
| 185 | |
| 186 | ioctl |
| 187 | ----- |
| 188 | |
| 189 | CXL_IOCTL_START_WORK: |
| 190 | Starts the AFU context and associates it with the current |
| 191 | process. Once this ioctl is successfully executed, all memory |
| 192 | mapped into this process is accessible to this AFU context |
| 193 | using the same effective addresses. No additional calls are |
| 194 | required to map/unmap memory. The AFU memory context will be |
| 195 | updated as userspace allocates and frees memory. This ioctl |
| 196 | returns once the AFU context is started. |
| 197 | |
Mauro Carvalho Chehab | 4d2e26a | 2019-04-10 08:32:42 -0300 | [diff] [blame] | 198 | Takes a pointer to a struct cxl_ioctl_start_work |
| 199 | |
| 200 | :: |
Ian Munsie | a9282d0 | 2014-10-08 19:55:05 +1100 | [diff] [blame] | 201 | |
| 202 | struct cxl_ioctl_start_work { |
| 203 | __u64 flags; |
| 204 | __u64 work_element_descriptor; |
| 205 | __u64 amr; |
| 206 | __s16 num_interrupts; |
| 207 | __s16 reserved1; |
| 208 | __s32 reserved2; |
| 209 | __u64 reserved3; |
| 210 | __u64 reserved4; |
| 211 | __u64 reserved5; |
| 212 | __u64 reserved6; |
| 213 | }; |
| 214 | |
| 215 | flags: |
| 216 | Indicates which optional fields in the structure are |
| 217 | valid. |
| 218 | |
| 219 | work_element_descriptor: |
| 220 | The Work Element Descriptor (WED) is a 64-bit argument |
| 221 | defined by the AFU. Typically this is an effective |
| 222 | address pointing to an AFU specific structure |
| 223 | describing what work to perform. |
| 224 | |
| 225 | amr: |
| 226 | Authority Mask Register (AMR), same as the powerpc |
| 227 | AMR. This field is only used by the kernel when the |
| 228 | corresponding CXL_START_WORK_AMR value is specified in |
| 229 | flags. If not specified the kernel will use a default |
| 230 | value of 0. |
| 231 | |
| 232 | num_interrupts: |
| 233 | Number of userspace interrupts to request. This field |
| 234 | is only used by the kernel when the corresponding |
| 235 | CXL_START_WORK_NUM_IRQS value is specified in flags. |
| 236 | If not specified the minimum number required by the |
| 237 | AFU will be allocated. The min and max number can be |
| 238 | obtained from sysfs. |
| 239 | |
| 240 | reserved fields: |
| 241 | For ABI padding and future extensions |
| 242 | |
| 243 | CXL_IOCTL_GET_PROCESS_ELEMENT: |
| 244 | Get the current context id, also known as the process element. |
| 245 | The value is returned from the kernel as a __u32. |
| 246 | |
| 247 | |
| 248 | mmap |
| 249 | ---- |
| 250 | |
| 251 | An AFU may have an MMIO space to facilitate communication with the |
| 252 | AFU. If it does, the MMIO space can be accessed via mmap. The size |
| 253 | and contents of this area are specific to the particular AFU. The |
| 254 | size can be discovered via sysfs. |
| 255 | |
| 256 | In AFU directed mode, master contexts are allowed to map all of |
| 257 | the MMIO space and slave contexts are allowed to only map the per |
| 258 | process MMIO space associated with the context. In dedicated |
| 259 | process mode the entire MMIO space can always be mapped. |
| 260 | |
| 261 | This mmap call must be done after the START_WORK ioctl. |
| 262 | |
| 263 | Care should be taken when accessing MMIO space. Only 32 and 64-bit |
| 264 | accesses are supported by POWER8. Also, the AFU will be designed |
| 265 | with a specific endianness, so all MMIO accesses should consider |
| 266 | endianness (recommend endian(3) variants like: le64toh(), |
| 267 | be64toh() etc). These endian issues equally apply to shared memory |
| 268 | queues the WED may describe. |
| 269 | |
| 270 | |
| 271 | read |
| 272 | ---- |
| 273 | |
| 274 | Reads events from the AFU. Blocks if no events are pending |
| 275 | (unless O_NONBLOCK is supplied). Returns -EIO in the case of an |
| 276 | unrecoverable error or if the card is removed. |
| 277 | |
| 278 | read() will always return an integral number of events. |
| 279 | |
| 280 | The buffer passed to read() must be at least 4K bytes. |
| 281 | |
| 282 | The result of the read will be a buffer of one or more events, |
Mauro Carvalho Chehab | 4d2e26a | 2019-04-10 08:32:42 -0300 | [diff] [blame] | 283 | each event is of type struct cxl_event, of varying size:: |
Ian Munsie | a9282d0 | 2014-10-08 19:55:05 +1100 | [diff] [blame] | 284 | |
| 285 | struct cxl_event { |
| 286 | struct cxl_event_header header; |
| 287 | union { |
| 288 | struct cxl_event_afu_interrupt irq; |
| 289 | struct cxl_event_data_storage fault; |
| 290 | struct cxl_event_afu_error afu_error; |
| 291 | }; |
| 292 | }; |
| 293 | |
Mauro Carvalho Chehab | 4d2e26a | 2019-04-10 08:32:42 -0300 | [diff] [blame] | 294 | The struct cxl_event_header is defined as |
| 295 | |
| 296 | :: |
Ian Munsie | a9282d0 | 2014-10-08 19:55:05 +1100 | [diff] [blame] | 297 | |
| 298 | struct cxl_event_header { |
| 299 | __u16 type; |
| 300 | __u16 size; |
| 301 | __u16 process_element; |
| 302 | __u16 reserved1; |
| 303 | }; |
| 304 | |
| 305 | type: |
| 306 | This defines the type of event. The type determines how |
| 307 | the rest of the event is structured. These types are |
| 308 | described below and defined by enum cxl_event_type. |
| 309 | |
| 310 | size: |
| 311 | This is the size of the event in bytes including the |
| 312 | struct cxl_event_header. The start of the next event can |
| 313 | be found at this offset from the start of the current |
| 314 | event. |
| 315 | |
| 316 | process_element: |
| 317 | Context ID of the event. |
| 318 | |
| 319 | reserved field: |
| 320 | For future extensions and padding. |
| 321 | |
| 322 | If the event type is CXL_EVENT_AFU_INTERRUPT then the event |
Mauro Carvalho Chehab | 4d2e26a | 2019-04-10 08:32:42 -0300 | [diff] [blame] | 323 | structure is defined as |
| 324 | |
| 325 | :: |
Ian Munsie | a9282d0 | 2014-10-08 19:55:05 +1100 | [diff] [blame] | 326 | |
| 327 | struct cxl_event_afu_interrupt { |
| 328 | __u16 flags; |
| 329 | __u16 irq; /* Raised AFU interrupt number */ |
| 330 | __u32 reserved1; |
| 331 | }; |
| 332 | |
| 333 | flags: |
| 334 | These flags indicate which optional fields are present |
| 335 | in this struct. Currently all fields are mandatory. |
| 336 | |
| 337 | irq: |
| 338 | The IRQ number sent by the AFU. |
| 339 | |
| 340 | reserved field: |
| 341 | For future extensions and padding. |
| 342 | |
| 343 | If the event type is CXL_EVENT_DATA_STORAGE then the event |
Mauro Carvalho Chehab | 4d2e26a | 2019-04-10 08:32:42 -0300 | [diff] [blame] | 344 | structure is defined as |
| 345 | |
| 346 | :: |
Ian Munsie | a9282d0 | 2014-10-08 19:55:05 +1100 | [diff] [blame] | 347 | |
| 348 | struct cxl_event_data_storage { |
| 349 | __u16 flags; |
| 350 | __u16 reserved1; |
| 351 | __u32 reserved2; |
| 352 | __u64 addr; |
| 353 | __u64 dsisr; |
| 354 | __u64 reserved3; |
| 355 | }; |
| 356 | |
| 357 | flags: |
| 358 | These flags indicate which optional fields are present in |
| 359 | this struct. Currently all fields are mandatory. |
| 360 | |
| 361 | address: |
| 362 | The address that the AFU unsuccessfully attempted to |
| 363 | access. Valid accesses will be handled transparently by the |
| 364 | kernel but invalid accesses will generate this event. |
| 365 | |
| 366 | dsisr: |
| 367 | This field gives information on the type of fault. It is a |
| 368 | copy of the DSISR from the PSL hardware when the address |
| 369 | fault occurred. The form of the DSISR is as defined in the |
| 370 | CAIA. |
| 371 | |
| 372 | reserved fields: |
| 373 | For future extensions |
| 374 | |
| 375 | If the event type is CXL_EVENT_AFU_ERROR then the event structure |
Mauro Carvalho Chehab | 4d2e26a | 2019-04-10 08:32:42 -0300 | [diff] [blame] | 376 | is defined as |
| 377 | |
| 378 | :: |
Ian Munsie | a9282d0 | 2014-10-08 19:55:05 +1100 | [diff] [blame] | 379 | |
| 380 | struct cxl_event_afu_error { |
| 381 | __u16 flags; |
| 382 | __u16 reserved1; |
| 383 | __u32 reserved2; |
| 384 | __u64 error; |
| 385 | }; |
| 386 | |
| 387 | flags: |
| 388 | These flags indicate which optional fields are present in |
| 389 | this struct. Currently all fields are Mandatory. |
| 390 | |
| 391 | error: |
| 392 | Error status from the AFU. Defined by the AFU. |
| 393 | |
| 394 | reserved fields: |
| 395 | For future extensions and padding |
| 396 | |
Christophe Lombard | 594ff7d | 2016-03-04 12:26:38 +0100 | [diff] [blame] | 397 | |
| 398 | 2. Card character device (powerVM guest only) |
Mauro Carvalho Chehab | 8f97986 | 2020-04-14 18:48:51 +0200 | [diff] [blame] | 399 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
Christophe Lombard | 594ff7d | 2016-03-04 12:26:38 +0100 | [diff] [blame] | 400 | |
| 401 | In a powerVM guest, an extra character device is created for the |
| 402 | card. The device is only used to write (flash) a new image on the |
| 403 | FPGA accelerator. Once the image is written and verified, the |
| 404 | device tree is updated and the card is reset to reload the updated |
| 405 | image. |
| 406 | |
| 407 | open |
| 408 | ---- |
| 409 | |
| 410 | Opens the device and allocates a file descriptor to be used with |
| 411 | the rest of the API. The device can only be opened once. |
| 412 | |
| 413 | ioctl |
| 414 | ----- |
| 415 | |
Mauro Carvalho Chehab | 4d2e26a | 2019-04-10 08:32:42 -0300 | [diff] [blame] | 416 | CXL_IOCTL_DOWNLOAD_IMAGE / CXL_IOCTL_VALIDATE_IMAGE: |
Christophe Lombard | 594ff7d | 2016-03-04 12:26:38 +0100 | [diff] [blame] | 417 | Starts and controls flashing a new FPGA image. Partial |
| 418 | reconfiguration is not supported (yet), so the image must contain |
| 419 | a copy of the PSL and AFU(s). Since an image can be quite large, |
| 420 | the caller may have to iterate, splitting the image in smaller |
| 421 | chunks. |
| 422 | |
Mauro Carvalho Chehab | 4d2e26a | 2019-04-10 08:32:42 -0300 | [diff] [blame] | 423 | Takes a pointer to a struct cxl_adapter_image:: |
| 424 | |
Christophe Lombard | 594ff7d | 2016-03-04 12:26:38 +0100 | [diff] [blame] | 425 | struct cxl_adapter_image { |
| 426 | __u64 flags; |
| 427 | __u64 data; |
| 428 | __u64 len_data; |
| 429 | __u64 len_image; |
| 430 | __u64 reserved1; |
| 431 | __u64 reserved2; |
| 432 | __u64 reserved3; |
| 433 | __u64 reserved4; |
| 434 | }; |
| 435 | |
| 436 | flags: |
| 437 | These flags indicate which optional fields are present in |
| 438 | this struct. Currently all fields are mandatory. |
| 439 | |
| 440 | data: |
| 441 | Pointer to a buffer with part of the image to write to the |
| 442 | card. |
| 443 | |
| 444 | len_data: |
| 445 | Size of the buffer pointed to by data. |
| 446 | |
| 447 | len_image: |
| 448 | Full size of the image. |
| 449 | |
| 450 | |
Ian Munsie | a9282d0 | 2014-10-08 19:55:05 +1100 | [diff] [blame] | 451 | Sysfs Class |
| 452 | =========== |
| 453 | |
| 454 | A cxl sysfs class is added under /sys/class/cxl to facilitate |
| 455 | enumeration and tuning of the accelerators. Its layout is |
| 456 | described in Documentation/ABI/testing/sysfs-class-cxl |
| 457 | |
Michael Neuling | aee85fb | 2015-05-27 16:07:01 +1000 | [diff] [blame] | 458 | |
Ian Munsie | a9282d0 | 2014-10-08 19:55:05 +1100 | [diff] [blame] | 459 | Udev rules |
| 460 | ========== |
| 461 | |
| 462 | The following udev rules could be used to create a symlink to the |
| 463 | most logical chardev to use in any programming mode (afuX.Yd for |
| 464 | dedicated, afuX.Ys for afu directed), since the API is virtually |
Mauro Carvalho Chehab | 4d2e26a | 2019-04-10 08:32:42 -0300 | [diff] [blame] | 465 | identical for each:: |
Ian Munsie | a9282d0 | 2014-10-08 19:55:05 +1100 | [diff] [blame] | 466 | |
| 467 | SUBSYSTEM=="cxl", ATTRS{mode}=="dedicated_process", SYMLINK="cxl/%b" |
| 468 | SUBSYSTEM=="cxl", ATTRS{mode}=="afu_directed", \ |
| 469 | KERNEL=="afu[0-9]*.[0-9]*s", SYMLINK="cxl/%b" |