| .. SPDX-License-Identifier: GPL-2.0 |
| |
| ============== |
| Devlink Health |
| ============== |
| |
| Background |
| ========== |
| |
| The ``devlink`` health mechanism is targeted for Real Time Alerting, in |
| order to know when something bad happened to a PCI device. |
| |
| * Provide alert debug information. |
| * Self healing. |
| * If problem needs vendor support, provide a way to gather all needed |
| debugging information. |
| |
| Overview |
| ======== |
| |
| The main idea is to unify and centralize driver health reports in the |
| generic ``devlink`` instance and allow the user to set different |
| attributes of the health reporting and recovery procedures. |
| |
| The ``devlink`` health reporter: |
| Device driver creates a "health reporter" per each error/health type. |
| Error/Health type can be a known/generic (e.g. PCI error, fw error, rx/tx error) |
| or unknown (driver specific). |
| For each registered health reporter a driver can issue error/health reports |
| asynchronously. All health reports handling is done by ``devlink``. |
| Device driver can provide specific callbacks for each "health reporter", e.g.: |
| |
| * Recovery procedures |
| * Diagnostics procedures |
| * Object dump procedures |
| * Out Of Box initial parameters |
| |
| Different parts of the driver can register different types of health reporters |
| with different handlers. |
| |
| Actions |
| ======= |
| |
| Once an error is reported, devlink health will perform the following actions: |
| |
| * A log is being send to the kernel trace events buffer |
| * Health status and statistics are being updated for the reporter instance |
| * Object dump is being taken and saved at the reporter instance (as long as |
| auto-dump is set and there is no other dump which is already stored) |
| * Auto recovery attempt is being done. Depends on: |
| |
| - Auto-recovery configuration |
| - Grace period vs. time passed since last recover |
| |
| Devlink formatted message |
| ========================= |
| |
| To handle devlink health diagnose and health dump requests, devlink creates a |
| formatted message structure ``devlink_fmsg`` and send it to the driver's callback |
| to fill the data in using the devlink fmsg API. |
| |
| Devlink fmsg is a mechanism to pass descriptors between drivers and devlink, in |
| json-like format. The API allows the driver to add nested attributes such as |
| object, object pair and value array, in addition to attributes such as name and |
| value. |
| |
| Driver should use this API to fill the fmsg context in a format which will be |
| translated by the devlink to the netlink message later. When it needs to send |
| the data using SKBs to the netlink layer, it fragments the data between |
| different SKBs. In order to do this fragmentation, it uses virtual nests |
| attributes, to avoid actual nesting use which cannot be divided between |
| different SKBs. |
| |
| User Interface |
| ============== |
| |
| User can access/change each reporter's parameters and driver specific callbacks |
| via ``devlink``, e.g per error type (per health reporter): |
| |
| * Configure reporter's generic parameters (like: disable/enable auto recovery) |
| * Invoke recovery procedure |
| * Run diagnostics |
| * Object dump |
| |
| .. list-table:: List of devlink health interfaces |
| :widths: 10 90 |
| |
| * - Name |
| - Description |
| * - ``DEVLINK_CMD_HEALTH_REPORTER_GET`` |
| - Retrieves status and configuration info per DEV and reporter. |
| * - ``DEVLINK_CMD_HEALTH_REPORTER_SET`` |
| - Allows reporter-related configuration setting. |
| * - ``DEVLINK_CMD_HEALTH_REPORTER_RECOVER`` |
| - Triggers reporter's recovery procedure. |
| * - ``DEVLINK_CMD_HEALTH_REPORTER_TEST`` |
| - Triggers a fake health event on the reporter. The effects of the test |
| event in terms of recovery flow should follow closely that of a real |
| event. |
| * - ``DEVLINK_CMD_HEALTH_REPORTER_DIAGNOSE`` |
| - Retrieves current device state related to the reporter. |
| * - ``DEVLINK_CMD_HEALTH_REPORTER_DUMP_GET`` |
| - Retrieves the last stored dump. Devlink health |
| saves a single dump. If an dump is not already stored by devlink |
| for this reporter, devlink generates a new dump. |
| Dump output is defined by the reporter. |
| * - ``DEVLINK_CMD_HEALTH_REPORTER_DUMP_CLEAR`` |
| - Clears the last saved dump file for the specified reporter. |
| |
| The following diagram provides a general overview of ``devlink-health``:: |
| |
| netlink |
| +--------------------------+ |
| | | |
| | + | |
| | | | |
| +--------------------------+ |
| |request for ops |
| |(diagnose, |
| driver devlink |recover, |
| |dump) |
| +--------+ +--------------------------+ |
| | | | reporter| | |
| | | | +---------v----------+ | |
| | | ops execution | | | | |
| | <----------------------------------+ | | |
| | | | | | | |
| | | | + ^------------------+ | |
| | | | | request for ops | |
| | | | | (recover, dump) | |
| | | | | | |
| | | | +-+------------------+ | |
| | | health report | | health handler | | |
| | +-------------------------------> | | |
| | | | +--------------------+ | |
| | | health reporter create | | |
| | +----------------------------> | |
| +--------+ +--------------------------+ |