blob: e0b8cfed610a7a6d88ffab0dfd963b144c25ef54 [file] [log] [blame]
Jacob Kellerf7555fd2020-01-09 14:46:11 -08001.. SPDX-License-Identifier: GPL-2.0
2
3==============
4Devlink Health
5==============
6
7Background
8==========
9
10The ``devlink`` health mechanism is targeted for Real Time Alerting, in
11order to know when something bad happened to a PCI device.
12
13 * Provide alert debug information.
14 * Self healing.
15 * If problem needs vendor support, provide a way to gather all needed
16 debugging information.
17
18Overview
19========
20
21The main idea is to unify and centralize driver health reports in the
22generic ``devlink`` instance and allow the user to set different
23attributes of the health reporting and recovery procedures.
24
25The ``devlink`` health reporter:
26Device driver creates a "health reporter" per each error/health type.
Jakub Kicinski3cc9b292021-03-12 16:30:25 -080027Error/Health type can be a known/generic (e.g. PCI error, fw error, rx/tx error)
Jacob Kellerf7555fd2020-01-09 14:46:11 -080028or unknown (driver specific).
29For each registered health reporter a driver can issue error/health reports
30asynchronously. All health reports handling is done by ``devlink``.
31Device driver can provide specific callbacks for each "health reporter", e.g.:
32
33 * Recovery procedures
34 * Diagnostics procedures
35 * Object dump procedures
Moshe Shemeshc745cfb2023-02-14 18:38:05 +020036 * Out Of Box initial parameters
Jacob Kellerf7555fd2020-01-09 14:46:11 -080037
38Different parts of the driver can register different types of health reporters
39with different handlers.
40
41Actions
42=======
43
44Once an error is reported, devlink health will perform the following actions:
45
46 * A log is being send to the kernel trace events buffer
47 * Health status and statistics are being updated for the reporter instance
48 * Object dump is being taken and saved at the reporter instance (as long as
Moshe Shemeshc745cfb2023-02-14 18:38:05 +020049 auto-dump is set and there is no other dump which is already stored)
Jacob Kellerf7555fd2020-01-09 14:46:11 -080050 * Auto recovery attempt is being done. Depends on:
Jakub Kicinski3cc9b292021-03-12 16:30:25 -080051
Jacob Kellerf7555fd2020-01-09 14:46:11 -080052 - Auto-recovery configuration
53 - Grace period vs. time passed since last recover
54
Moshe Shemeshc745cfb2023-02-14 18:38:05 +020055Devlink formatted message
56=========================
57
58To handle devlink health diagnose and health dump requests, devlink creates a
59formatted message structure ``devlink_fmsg`` and send it to the driver's callback
60to fill the data in using the devlink fmsg API.
61
62Devlink fmsg is a mechanism to pass descriptors between drivers and devlink, in
63json-like format. The API allows the driver to add nested attributes such as
64object, object pair and value array, in addition to attributes such as name and
65value.
66
67Driver should use this API to fill the fmsg context in a format which will be
68translated by the devlink to the netlink message later. When it needs to send
69the data using SKBs to the netlink layer, it fragments the data between
70different SKBs. In order to do this fragmentation, it uses virtual nests
71attributes, to avoid actual nesting use which cannot be divided between
72different SKBs.
73
Jacob Kellerf7555fd2020-01-09 14:46:11 -080074User Interface
75==============
76
77User can access/change each reporter's parameters and driver specific callbacks
78via ``devlink``, e.g per error type (per health reporter):
79
80 * Configure reporter's generic parameters (like: disable/enable auto recovery)
81 * Invoke recovery procedure
82 * Run diagnostics
83 * Object dump
84
85.. list-table:: List of devlink health interfaces
86 :widths: 10 90
87
88 * - Name
89 - Description
90 * - ``DEVLINK_CMD_HEALTH_REPORTER_GET``
91 - Retrieves status and configuration info per DEV and reporter.
92 * - ``DEVLINK_CMD_HEALTH_REPORTER_SET``
93 - Allows reporter-related configuration setting.
94 * - ``DEVLINK_CMD_HEALTH_REPORTER_RECOVER``
Jakub Kicinski3cc9b292021-03-12 16:30:25 -080095 - Triggers reporter's recovery procedure.
Jakub Kicinski6f162902021-03-12 16:30:26 -080096 * - ``DEVLINK_CMD_HEALTH_REPORTER_TEST``
97 - Triggers a fake health event on the reporter. The effects of the test
98 event in terms of recovery flow should follow closely that of a real
99 event.
Jacob Kellerf7555fd2020-01-09 14:46:11 -0800100 * - ``DEVLINK_CMD_HEALTH_REPORTER_DIAGNOSE``
Jakub Kicinski3cc9b292021-03-12 16:30:25 -0800101 - Retrieves current device state related to the reporter.
Jacob Kellerf7555fd2020-01-09 14:46:11 -0800102 * - ``DEVLINK_CMD_HEALTH_REPORTER_DUMP_GET``
103 - Retrieves the last stored dump. Devlink health
Jakub Kicinski3cc9b292021-03-12 16:30:25 -0800104 saves a single dump. If an dump is not already stored by devlink
Jacob Kellerf7555fd2020-01-09 14:46:11 -0800105 for this reporter, devlink generates a new dump.
Jakub Kicinski3cc9b292021-03-12 16:30:25 -0800106 Dump output is defined by the reporter.
Jacob Kellerf7555fd2020-01-09 14:46:11 -0800107 * - ``DEVLINK_CMD_HEALTH_REPORTER_DUMP_CLEAR``
108 - Clears the last saved dump file for the specified reporter.
109
110The following diagram provides a general overview of ``devlink-health``::
111
112 netlink
113 +--------------------------+
114 | |
115 | + |
116 | | |
117 +--------------------------+
118 |request for ops
119 |(diagnose,
Jakub Kicinski3cc9b292021-03-12 16:30:25 -0800120 driver devlink |recover,
Jacob Kellerf7555fd2020-01-09 14:46:11 -0800121 |dump)
122 +--------+ +--------------------------+
123 | | | reporter| |
124 | | | +---------v----------+ |
125 | | ops execution | | | |
126 | <----------------------------------+ | |
127 | | | | | |
128 | | | + ^------------------+ |
129 | | | | request for ops |
130 | | | | (recover, dump) |
131 | | | | |
132 | | | +-+------------------+ |
133 | | health report | | health handler | |
134 | +-------------------------------> | |
135 | | | +--------------------+ |
136 | | health reporter create | |
137 | +----------------------------> |
138 +--------+ +--------------------------+