| .. SPDX-License-Identifier: GPL-2.0 | 
 |  | 
 | ============== | 
 | Devlink Health | 
 | ============== | 
 |  | 
 | Background | 
 | ========== | 
 |  | 
 | The ``devlink`` health mechanism is targeted for Real Time Alerting, in | 
 | order to know when something bad happened to a PCI device. | 
 |  | 
 |   * Provide alert debug information. | 
 |   * Self healing. | 
 |   * If problem needs vendor support, provide a way to gather all needed | 
 |     debugging information. | 
 |  | 
 | Overview | 
 | ======== | 
 |  | 
 | The main idea is to unify and centralize driver health reports in the | 
 | generic ``devlink`` instance and allow the user to set different | 
 | attributes of the health reporting and recovery procedures. | 
 |  | 
 | The ``devlink`` health reporter: | 
 | Device driver creates a "health reporter" per each error/health type. | 
 | Error/Health type can be a known/generic (e.g. PCI error, fw error, rx/tx error) | 
 | or unknown (driver specific). | 
 | For each registered health reporter a driver can issue error/health reports | 
 | asynchronously. All health reports handling is done by ``devlink``. | 
 | Device driver can provide specific callbacks for each "health reporter", e.g.: | 
 |  | 
 |   * Recovery procedures | 
 |   * Diagnostics procedures | 
 |   * Object dump procedures | 
 |   * Out Of Box initial parameters | 
 |  | 
 | Different parts of the driver can register different types of health reporters | 
 | with different handlers. | 
 |  | 
 | Actions | 
 | ======= | 
 |  | 
 | Once an error is reported, devlink health will perform the following actions: | 
 |  | 
 |   * A log is being send to the kernel trace events buffer | 
 |   * Health status and statistics are being updated for the reporter instance | 
 |   * Object dump is being taken and saved at the reporter instance (as long as | 
 |     auto-dump is set and there is no other dump which is already stored) | 
 |   * Auto recovery attempt is being done. Depends on: | 
 |  | 
 |     - Auto-recovery configuration | 
 |     - Grace period vs. time passed since last recover | 
 |  | 
 | Devlink formatted message | 
 | ========================= | 
 |  | 
 | To handle devlink health diagnose and health dump requests, devlink creates a | 
 | formatted message structure ``devlink_fmsg`` and send it to the driver's callback | 
 | to fill the data in using the devlink fmsg API. | 
 |  | 
 | Devlink fmsg is a mechanism to pass descriptors between drivers and devlink, in | 
 | json-like format. The API allows the driver to add nested attributes such as | 
 | object, object pair and value array, in addition to attributes such as name and | 
 | value. | 
 |  | 
 | Driver should use this API to fill the fmsg context in a format which will be | 
 | translated by the devlink to the netlink message later. When it needs to send | 
 | the data using SKBs to the netlink layer, it fragments the data between | 
 | different SKBs. In order to do this fragmentation, it uses virtual nests | 
 | attributes, to avoid actual nesting use which cannot be divided between | 
 | different SKBs. | 
 |  | 
 | User Interface | 
 | ============== | 
 |  | 
 | User can access/change each reporter's parameters and driver specific callbacks | 
 | via ``devlink``, e.g per error type (per health reporter): | 
 |  | 
 |   * Configure reporter's generic parameters (like: disable/enable auto recovery) | 
 |   * Invoke recovery procedure | 
 |   * Run diagnostics | 
 |   * Object dump | 
 |  | 
 | .. list-table:: List of devlink health interfaces | 
 |    :widths: 10 90 | 
 |  | 
 |    * - Name | 
 |      - Description | 
 |    * - ``DEVLINK_CMD_HEALTH_REPORTER_GET`` | 
 |      - Retrieves status and configuration info per DEV and reporter. | 
 |    * - ``DEVLINK_CMD_HEALTH_REPORTER_SET`` | 
 |      - Allows reporter-related configuration setting. | 
 |    * - ``DEVLINK_CMD_HEALTH_REPORTER_RECOVER`` | 
 |      - Triggers reporter's recovery procedure. | 
 |    * - ``DEVLINK_CMD_HEALTH_REPORTER_TEST`` | 
 |      - Triggers a fake health event on the reporter. The effects of the test | 
 |        event in terms of recovery flow should follow closely that of a real | 
 |        event. | 
 |    * - ``DEVLINK_CMD_HEALTH_REPORTER_DIAGNOSE`` | 
 |      - Retrieves current device state related to the reporter. | 
 |    * - ``DEVLINK_CMD_HEALTH_REPORTER_DUMP_GET`` | 
 |      - Retrieves the last stored dump. Devlink health | 
 |        saves a single dump. If an dump is not already stored by devlink | 
 |        for this reporter, devlink generates a new dump. | 
 |        Dump output is defined by the reporter. | 
 |    * - ``DEVLINK_CMD_HEALTH_REPORTER_DUMP_CLEAR`` | 
 |      - Clears the last saved dump file for the specified reporter. | 
 |  | 
 | The following diagram provides a general overview of ``devlink-health``:: | 
 |  | 
 |                                                    netlink | 
 |                                           +--------------------------+ | 
 |                                           |                          | | 
 |                                           |            +             | | 
 |                                           |            |             | | 
 |                                           +--------------------------+ | 
 |                                                        |request for ops | 
 |                                                        |(diagnose, | 
 |       driver                               devlink     |recover, | 
 |                                                        |dump) | 
 |     +--------+                            +--------------------------+ | 
 |     |        |                            |    reporter|             | | 
 |     |        |                            |  +---------v----------+  | | 
 |     |        |   ops execution            |  |                    |  | | 
 |     |     <----------------------------------+                    |  | | 
 |     |        |                            |  |                    |  | | 
 |     |        |                            |  + ^------------------+  | | 
 |     |        |                            |    | request for ops     | | 
 |     |        |                            |    | (recover, dump)     | | 
 |     |        |                            |    |                     | | 
 |     |        |                            |  +-+------------------+  | | 
 |     |        |     health report          |  | health handler     |  | | 
 |     |        +------------------------------->                    |  | | 
 |     |        |                            |  +--------------------+  | | 
 |     |        |     health reporter create |                          | | 
 |     |        +---------------------------->                          | | 
 |     +--------+                            +--------------------------+ |