| .. SPDX-License-Identifier: GPL-2.0 |
| |
| ============== |
| Devlink Health |
| ============== |
| |
| Background |
| ========== |
| |
| The ``devlink`` health mechanism is targeted for Real Time Alerting, in |
| order to know when something bad happened to a PCI device. |
| |
| * Provide alert debug information. |
| * Self healing. |
| * If problem needs vendor support, provide a way to gather all needed |
| debugging information. |
| |
| Overview |
| ======== |
| |
| The main idea is to unify and centralize driver health reports in the |
| generic ``devlink`` instance and allow the user to set different |
| attributes of the health reporting and recovery procedures. |
| |
| The ``devlink`` health reporter: |
| Device driver creates a "health reporter" per each error/health type. |
| Error/Health type can be a known/generic (eg pci error, fw error, rx/tx error) |
| or unknown (driver specific). |
| For each registered health reporter a driver can issue error/health reports |
| asynchronously. All health reports handling is done by ``devlink``. |
| Device driver can provide specific callbacks for each "health reporter", e.g.: |
| |
| * Recovery procedures |
| * Diagnostics procedures |
| * Object dump procedures |
| * OOB initial parameters |
| |
| Different parts of the driver can register different types of health reporters |
| with different handlers. |
| |
| Actions |
| ======= |
| |
| Once an error is reported, devlink health will perform the following actions: |
| |
| * A log is being send to the kernel trace events buffer |
| * Health status and statistics are being updated for the reporter instance |
| * Object dump is being taken and saved at the reporter instance (as long as |
| there is no other dump which is already stored) |
| * Auto recovery attempt is being done. Depends on: |
| - Auto-recovery configuration |
| - Grace period vs. time passed since last recover |
| |
| User Interface |
| ============== |
| |
| User can access/change each reporter's parameters and driver specific callbacks |
| via ``devlink``, e.g per error type (per health reporter): |
| |
| * Configure reporter's generic parameters (like: disable/enable auto recovery) |
| * Invoke recovery procedure |
| * Run diagnostics |
| * Object dump |
| |
| .. list-table:: List of devlink health interfaces |
| :widths: 10 90 |
| |
| * - Name |
| - Description |
| * - ``DEVLINK_CMD_HEALTH_REPORTER_GET`` |
| - Retrieves status and configuration info per DEV and reporter. |
| * - ``DEVLINK_CMD_HEALTH_REPORTER_SET`` |
| - Allows reporter-related configuration setting. |
| * - ``DEVLINK_CMD_HEALTH_REPORTER_RECOVER`` |
| - Triggers a reporter's recovery procedure. |
| * - ``DEVLINK_CMD_HEALTH_REPORTER_DIAGNOSE`` |
| - Retrieves diagnostics data from a reporter on a device. |
| * - ``DEVLINK_CMD_HEALTH_REPORTER_DUMP_GET`` |
| - Retrieves the last stored dump. Devlink health |
| saves a single dump. If an dump is not already stored by the devlink |
| for this reporter, devlink generates a new dump. |
| dump output is defined by the reporter. |
| * - ``DEVLINK_CMD_HEALTH_REPORTER_DUMP_CLEAR`` |
| - Clears the last saved dump file for the specified reporter. |
| |
| The following diagram provides a general overview of ``devlink-health``:: |
| |
| netlink |
| +--------------------------+ |
| | | |
| | + | |
| | | | |
| +--------------------------+ |
| |request for ops |
| |(diagnose, |
| mlx5_core devlink |recover, |
| |dump) |
| +--------+ +--------------------------+ |
| | | | reporter| | |
| | | | +---------v----------+ | |
| | | ops execution | | | | |
| | <----------------------------------+ | | |
| | | | | | | |
| | | | + ^------------------+ | |
| | | | | request for ops | |
| | | | | (recover, dump) | |
| | | | | | |
| | | | +-+------------------+ | |
| | | health report | | health handler | | |
| | +-------------------------------> | | |
| | | | +--------------------+ | |
| | | health reporter create | | |
| | +----------------------------> | |
| +--------+ +--------------------------+ |