1 .. SPDX-License-Identifier: GPL-2.0 1 .. SPDX-License-Identifier: GPL-2.0 2 2 3 ============== 3 ============== 4 Devlink Health 4 Devlink Health 5 ============== 5 ============== 6 6 7 Background 7 Background 8 ========== 8 ========== 9 9 10 The ``devlink`` health mechanism is targeted f 10 The ``devlink`` health mechanism is targeted for Real Time Alerting, in 11 order to know when something bad happened to a 11 order to know when something bad happened to a PCI device. 12 12 13 * Provide alert debug information. 13 * Provide alert debug information. 14 * Self healing. 14 * Self healing. 15 * If problem needs vendor support, provide a 15 * If problem needs vendor support, provide a way to gather all needed 16 debugging information. 16 debugging information. 17 17 18 Overview 18 Overview 19 ======== 19 ======== 20 20 21 The main idea is to unify and centralize drive 21 The main idea is to unify and centralize driver health reports in the 22 generic ``devlink`` instance and allow the use 22 generic ``devlink`` instance and allow the user to set different 23 attributes of the health reporting and recover 23 attributes of the health reporting and recovery procedures. 24 24 25 The ``devlink`` health reporter: 25 The ``devlink`` health reporter: 26 Device driver creates a "health reporter" per 26 Device driver creates a "health reporter" per each error/health type. 27 Error/Health type can be a known/generic (e.g. 27 Error/Health type can be a known/generic (e.g. PCI error, fw error, rx/tx error) 28 or unknown (driver specific). 28 or unknown (driver specific). 29 For each registered health reporter a driver c 29 For each registered health reporter a driver can issue error/health reports 30 asynchronously. All health reports handling is 30 asynchronously. All health reports handling is done by ``devlink``. 31 Device driver can provide specific callbacks f 31 Device driver can provide specific callbacks for each "health reporter", e.g.: 32 32 33 * Recovery procedures 33 * Recovery procedures 34 * Diagnostics procedures 34 * Diagnostics procedures 35 * Object dump procedures 35 * Object dump procedures 36 * Out Of Box initial parameters 36 * Out Of Box initial parameters 37 37 38 Different parts of the driver can register dif 38 Different parts of the driver can register different types of health reporters 39 with different handlers. 39 with different handlers. 40 40 41 Actions 41 Actions 42 ======= 42 ======= 43 43 44 Once an error is reported, devlink health will 44 Once an error is reported, devlink health will perform the following actions: 45 45 46 * A log is being send to the kernel trace ev 46 * A log is being send to the kernel trace events buffer 47 * Health status and statistics are being upd 47 * Health status and statistics are being updated for the reporter instance 48 * Object dump is being taken and saved at th 48 * Object dump is being taken and saved at the reporter instance (as long as 49 auto-dump is set and there is no other dum 49 auto-dump is set and there is no other dump which is already stored) 50 * Auto recovery attempt is being done. Depen 50 * Auto recovery attempt is being done. Depends on: 51 51 52 - Auto-recovery configuration 52 - Auto-recovery configuration 53 - Grace period vs. time passed since last 53 - Grace period vs. time passed since last recover 54 54 55 Devlink formatted message 55 Devlink formatted message 56 ========================= 56 ========================= 57 57 58 To handle devlink health diagnose and health d 58 To handle devlink health diagnose and health dump requests, devlink creates a 59 formatted message structure ``devlink_fmsg`` a 59 formatted message structure ``devlink_fmsg`` and send it to the driver's callback 60 to fill the data in using the devlink fmsg API 60 to fill the data in using the devlink fmsg API. 61 61 62 Devlink fmsg is a mechanism to pass descriptor 62 Devlink fmsg is a mechanism to pass descriptors between drivers and devlink, in 63 json-like format. The API allows the driver to 63 json-like format. The API allows the driver to add nested attributes such as 64 object, object pair and value array, in additi 64 object, object pair and value array, in addition to attributes such as name and 65 value. 65 value. 66 66 67 Driver should use this API to fill the fmsg co 67 Driver should use this API to fill the fmsg context in a format which will be 68 translated by the devlink to the netlink messa 68 translated by the devlink to the netlink message later. When it needs to send 69 the data using SKBs to the netlink layer, it f 69 the data using SKBs to the netlink layer, it fragments the data between 70 different SKBs. In order to do this fragmentat 70 different SKBs. In order to do this fragmentation, it uses virtual nests 71 attributes, to avoid actual nesting use which 71 attributes, to avoid actual nesting use which cannot be divided between 72 different SKBs. 72 different SKBs. 73 73 74 User Interface 74 User Interface 75 ============== 75 ============== 76 76 77 User can access/change each reporter's paramet 77 User can access/change each reporter's parameters and driver specific callbacks 78 via ``devlink``, e.g per error type (per healt 78 via ``devlink``, e.g per error type (per health reporter): 79 79 80 * Configure reporter's generic parameters (l 80 * Configure reporter's generic parameters (like: disable/enable auto recovery) 81 * Invoke recovery procedure 81 * Invoke recovery procedure 82 * Run diagnostics 82 * Run diagnostics 83 * Object dump 83 * Object dump 84 84 85 .. list-table:: List of devlink health interfa 85 .. list-table:: List of devlink health interfaces 86 :widths: 10 90 86 :widths: 10 90 87 87 88 * - Name 88 * - Name 89 - Description 89 - Description 90 * - ``DEVLINK_CMD_HEALTH_REPORTER_GET`` 90 * - ``DEVLINK_CMD_HEALTH_REPORTER_GET`` 91 - Retrieves status and configuration info 91 - Retrieves status and configuration info per DEV and reporter. 92 * - ``DEVLINK_CMD_HEALTH_REPORTER_SET`` 92 * - ``DEVLINK_CMD_HEALTH_REPORTER_SET`` 93 - Allows reporter-related configuration s 93 - Allows reporter-related configuration setting. 94 * - ``DEVLINK_CMD_HEALTH_REPORTER_RECOVER`` 94 * - ``DEVLINK_CMD_HEALTH_REPORTER_RECOVER`` 95 - Triggers reporter's recovery procedure. 95 - Triggers reporter's recovery procedure. 96 * - ``DEVLINK_CMD_HEALTH_REPORTER_TEST`` 96 * - ``DEVLINK_CMD_HEALTH_REPORTER_TEST`` 97 - Triggers a fake health event on the rep 97 - Triggers a fake health event on the reporter. The effects of the test 98 event in terms of recovery flow should 98 event in terms of recovery flow should follow closely that of a real 99 event. 99 event. 100 * - ``DEVLINK_CMD_HEALTH_REPORTER_DIAGNOSE` 100 * - ``DEVLINK_CMD_HEALTH_REPORTER_DIAGNOSE`` 101 - Retrieves current device state related 101 - Retrieves current device state related to the reporter. 102 * - ``DEVLINK_CMD_HEALTH_REPORTER_DUMP_GET` 102 * - ``DEVLINK_CMD_HEALTH_REPORTER_DUMP_GET`` 103 - Retrieves the last stored dump. Devlink 103 - Retrieves the last stored dump. Devlink health 104 saves a single dump. If an dump is not 104 saves a single dump. If an dump is not already stored by devlink 105 for this reporter, devlink generates a 105 for this reporter, devlink generates a new dump. 106 Dump output is defined by the reporter. 106 Dump output is defined by the reporter. 107 * - ``DEVLINK_CMD_HEALTH_REPORTER_DUMP_CLEA 107 * - ``DEVLINK_CMD_HEALTH_REPORTER_DUMP_CLEAR`` 108 - Clears the last saved dump file for the 108 - Clears the last saved dump file for the specified reporter. 109 109 110 The following diagram provides a general overv 110 The following diagram provides a general overview of ``devlink-health``:: 111 111 112 112 netlink 113 +--- 113 +--------------------------+ 114 | 114 | | 115 | 115 | + | 116 | 116 | | | 117 +--- 117 +--------------------------+ 118 118 |request for ops 119 119 |(diagnose, 120 driver dev 120 driver devlink |recover, 121 121 |dump) 122 +--------+ +--- 122 +--------+ +--------------------------+ 123 | | | 123 | | | reporter| | 124 | | | + 124 | | | +---------v----------+ | 125 | | ops execution | | 125 | | ops execution | | | | 126 | <----------------------------------+ 126 | <----------------------------------+ | | 127 | | | | 127 | | | | | | 128 | | | + 128 | | | + ^------------------+ | 129 | | | 129 | | | | request for ops | 130 | | | 130 | | | | (recover, dump) | 131 | | | 131 | | | | | 132 | | | + 132 | | | +-+------------------+ | 133 | | health report | | 133 | | health report | | health handler | | 134 | +-------------------------------> 134 | +-------------------------------> | | 135 | | | + 135 | | | +--------------------+ | 136 | | health reporter create | 136 | | health reporter create | | 137 | +----------------------------> 137 | +----------------------------> | 138 +--------+ +--- 138 +--------+ +--------------------------+
Linux® is a registered trademark of Linus Torvalds in the United States and other countries.
TOMOYO® is a registered trademark of NTT DATA CORPORATION.