Documentation/networking/devlink/devlink-dpipe.rst - linux - Git at Google

 .. SPDX-License-Identifier: GPL-2.0

 =============
 Devlink DPIPE
 =============

 Background
 ==========

 While performing the hardware offloading process, much of the hardware
 specifics cannot be presented. These details are useful for debugging, and
 ``devlink-dpipe`` provides a standardized way to provide visibility into the
 offloading process.

 For example, the routing longest prefix match (LPM) algorithm used by the
 Linux kernel may differ from the hardware implementation. The pipeline debug
 API (DPIPE) is aimed at providing the user visibility into the ASIC's
 pipeline in a generic way.

 The hardware offload process is expected to be done in a way that the user
 should not be able to distinguish between the hardware vs. software
 implementation. In this process, hardware specifics are neglected. In
 reality those details can have lots of meaning and should be exposed in some
 standard way.

 This problem is made even more complex when one wishes to offload the
 control path of the whole networking stack to a switch ASIC. Due to
 differences in the hardware and software models some processes cannot be
 represented correctly.

 One example is the kernel's LPM algorithm which in many cases differs
 greatly to the hardware implementation. The configuration API is the same,
 but one cannot rely on the Forward Information Base (FIB) to look like the
 Level Path Compression trie (LPC-trie) in hardware.

 In many situations trying to analyze systems failure solely based on the
 kernel's dump may not be enough. By combining this data with complementary
 information about the underlying hardware, this debugging can be made
 easier; additionally, the information can be useful when debugging
 performance issues.

 Overview
 ========

 The ``devlink-dpipe`` interface closes this gap. The hardware's pipeline is
 modeled as a graph of match/action tables. Each table represents a specific
 hardware block. This model is not new, first being used by the P4 language.

 Traditionally it has been used as an alternative model for hardware
 configuration, but the ``devlink-dpipe`` interface uses it for visibility
 purposes as a standard complementary tool. The system's view from
 ``devlink-dpipe`` should change according to the changes done by the
 standard configuration tools.

 For example, it’s quite common to  implement Access Control Lists (ACL)
 using Ternary Content Addressable Memory (TCAM). The TCAM memory can be
 divided into TCAM regions. Complex TC filters can have multiple rules with
 different priorities and different lookup keys. On the other hand hardware
 TCAM regions have a predefined lookup key. Offloading the TC filter rules
 using TCAM engine can result in multiple TCAM regions being interconnected
 in a chain (which may affect the data path latency). In response to a new TC
 filter new tables should be created describing those regions.

 Model
 =====

 The ``DPIPE`` model introduces several objects:

   * headers
   * tables
   * entries

 A ``header`` describes packet formats and provides names for fields within
 the packet. A ``table`` describes hardware blocks. An ``entry`` describes
 the actual content of a specific table.

 The hardware pipeline is not port specific, but rather describes the whole
 ASIC. Thus it is tied to the top of the ``devlink`` infrastructure.

 Drivers can register and unregister tables at run time, in order to support
 dynamic behavior. This dynamic behavior is mandatory for describing hardware
 blocks like TCAM regions which can be allocated and freed dynamically.

 ``devlink-dpipe`` generally is not intended for configuration. The exception
 is hardware counting for a specific table.

 The following commands are used to obtain the ``dpipe`` objects from
 userspace:

   * ``table_get``: Receive a table's description.
   * ``headers_get``: Receive a device's supported headers.
   * ``entries_get``: Receive a table's current entries.
   * ``counters_set``: Enable or disable counters on a table.

 Table
 -----

 The driver should implement the following operations for each table:

   * ``matches_dump``: Dump the supported matches.
   * ``actions_dump``: Dump the supported actions.
   * ``entries_dump``: Dump the actual content of the table.
   * ``counters_set_update``: Synchronize hardware with counters enabled or
     disabled.

 Header/Field
 ------------

 In a similar way to P4 headers and fields are used to describe a table's
 behavior. There is a slight difference between the standard protocol headers
 and specific ASIC metadata. The protocol headers should be declared in the
 ``devlink`` core API. On the other hand ASIC meta data is driver specific
 and should be defined in the driver. Additionally, each driver-specific
 devlink documentation file should document the driver-specific ``dpipe``
 headers it implements. The headers and fields are identified by enumeration.

 In order to provide further visibility some ASIC metadata fields could be
 mapped to kernel objects. For example, internal router interface indexes can
 be directly mapped to the net device ifindex. FIB table indexes used by
 different Virtual Routing and Forwarding (VRF) tables can be mapped to
 internal routing table indexes.

 Match
 -----

 Matches are kept primitive and close to hardware operation. Match types like
 LPM are not supported due to the fact that this is exactly a process we wish
 to describe in full detail. Example of matches:

   * ``field_exact``: Exact match on a specific field.
   * ``field_exact_mask``: Exact match on a specific field after masking.
   * ``field_range``: Match on a specific range.

 The id's of the header and the field should be specified in order to
 identify the specific field. Furthermore, the header index should be
 specified in order to distinguish multiple headers of the same type in a
 packet (tunneling).

 Action
 ------

 Similar to match, the actions are kept primitive and close to hardware
 operation. For example:

   * ``field_modify``: Modify the field value.
   * ``field_inc``: Increment the field value.
   * ``push_header``: Add a header.
   * ``pop_header``: Remove a header.

 Entry
 -----

 Entries of a specific table can be dumped on demand. Each eentry is
 identified with an index and its properties are described by a list of
 match/action values and specific counter. By dumping the tables content the
 interactions between tables can be resolved.

 Abstraction Example
 ===================

 The following is an example of the abstraction model of the L3 part of
 Mellanox Spectrum ASIC. The blocks are described in the order they appear in
 the pipeline. The table sizes in the following examples are not real
 hardware sizes and are provided for demonstration purposes.

 LPM
 ---

 The LPM algorithm can be implemented as a list of hash tables. Each hash
 table contains routes with the same prefix length. The root of the list is
 /32, and in case of a miss the hardware will continue to the next hash
 table. The depth of the search will affect the data path latency.

 In case of a hit the entry contains information about the next stage of the
 pipeline which resolves the MAC address. The next stage can be either local
 host table for directly connected routes, or adjacency table for next-hops.
 The ``meta.lpm_prefix`` field is used to connect two LPM tables.

 .. code::

     table lpm_prefix_16 {
       size: 4096,
       counters_enabled: true,
       match: { meta.vr_id: exact,
                ipv4.dst_addr: exact_mask,
                ipv6.dst_addr: exact_mask,
                meta.lpm_prefix: exact },
       action: { meta.adj_index: set,
                 meta.adj_group_size: set,
                 meta.rif_port: set,
                 meta.lpm_prefix: set },
     }

 Local Host
 ----------

 In the case of local routes the LPM lookup already resolves the egress
 router interface (RIF), yet the exact MAC address is not known. The local
 host table is a hash table combining the output interface id with
 destination IP address as a key. The result is the MAC address.

 .. code::

     table local_host {
       size: 4096,
       counters_enabled: true,
       match: { meta.rif_port: exact,
                ipv4.dst_addr: exact},
       action: { ethernet.daddr: set }
     }

 Adjacency
 ---------

 In case of remote routes this table does the ECMP. The LPM lookup results in
 ECMP group size and index that serves as a global offset into this table.
 Concurrently a hash of the packet is generated. Based on the ECMP group size
 and the packet's hash a local offset is generated. Multiple LPM entries can
 point to the same adjacency group.

 .. code::

     table adjacency {
       size: 4096,
       counters_enabled: true,
       match: { meta.adj_index: exact,
                meta.adj_group_size: exact,
                meta.packet_hash_index: exact },
       action: { ethernet.daddr: set,
                 meta.erif: set }
     }

 ERIF
 ----

 In case the egress RIF and destination MAC have been resolved by previous
 tables this table does multiple operations like TTL decrease and MTU check.
 Then the decision of forward/drop is taken and the port L3 statistics are
 updated based on the packet's type (broadcast, unicast, multicast).

 .. code::

     table erif {
       size: 800,
       counters_enabled: true,
       match: { meta.rif_port: exact,
                meta.is_l3_unicast: exact,
                meta.is_l3_broadcast: exact,
                meta.is_l3_multicast, exact },
       action: { meta.l3_drop: set,
                 meta.l3_forward: set }
     }
	.. SPDX-License-Identifier: GPL-2.0

	=============
	Devlink DPIPE
	=============

	Background
	==========

	While performing the hardware offloading process, much of the hardware
	specifics cannot be presented. These details are useful for debugging, and
	``devlink-dpipe`` provides a standardized way to provide visibility into the
	offloading process.

	For example, the routing longest prefix match (LPM) algorithm used by the
	Linux kernel may differ from the hardware implementation. The pipeline debug
	API (DPIPE) is aimed at providing the user visibility into the ASIC's
	pipeline in a generic way.

	The hardware offload process is expected to be done in a way that the user
	should not be able to distinguish between the hardware vs. software
	implementation. In this process, hardware specifics are neglected. In
	reality those details can have lots of meaning and should be exposed in some
	standard way.

	This problem is made even more complex when one wishes to offload the
	control path of the whole networking stack to a switch ASIC. Due to
	differences in the hardware and software models some processes cannot be
	represented correctly.

	One example is the kernel's LPM algorithm which in many cases differs
	greatly to the hardware implementation. The configuration API is the same,
	but one cannot rely on the Forward Information Base (FIB) to look like the
	Level Path Compression trie (LPC-trie) in hardware.

	In many situations trying to analyze systems failure solely based on the
	kernel's dump may not be enough. By combining this data with complementary
	information about the underlying hardware, this debugging can be made
	easier; additionally, the information can be useful when debugging
	performance issues.

	Overview
	========

	The ``devlink-dpipe`` interface closes this gap. The hardware's pipeline is
	modeled as a graph of match/action tables. Each table represents a specific
	hardware block. This model is not new, first being used by the P4 language.

	Traditionally it has been used as an alternative model for hardware
	configuration, but the ``devlink-dpipe`` interface uses it for visibility
	purposes as a standard complementary tool. The system's view from
	``devlink-dpipe`` should change according to the changes done by the
	standard configuration tools.

	For example, it’s quite common to implement Access Control Lists (ACL)
	using Ternary Content Addressable Memory (TCAM). The TCAM memory can be
	divided into TCAM regions. Complex TC filters can have multiple rules with
	different priorities and different lookup keys. On the other hand hardware
	TCAM regions have a predefined lookup key. Offloading the TC filter rules
	using TCAM engine can result in multiple TCAM regions being interconnected
	in a chain (which may affect the data path latency). In response to a new TC
	filter new tables should be created describing those regions.

	Model
	=====

	The ``DPIPE`` model introduces several objects:

	* headers
	* tables
	* entries

	A ``header`` describes packet formats and provides names for fields within
	the packet. A ``table`` describes hardware blocks. An ``entry`` describes
	the actual content of a specific table.

	The hardware pipeline is not port specific, but rather describes the whole
	ASIC. Thus it is tied to the top of the ``devlink`` infrastructure.

	Drivers can register and unregister tables at run time, in order to support
	dynamic behavior. This dynamic behavior is mandatory for describing hardware
	blocks like TCAM regions which can be allocated and freed dynamically.

	``devlink-dpipe`` generally is not intended for configuration. The exception
	is hardware counting for a specific table.

	The following commands are used to obtain the ``dpipe`` objects from
	userspace:

	* ``table_get``: Receive a table's description.
	* ``headers_get``: Receive a device's supported headers.
	* ``entries_get``: Receive a table's current entries.
	* ``counters_set``: Enable or disable counters on a table.

	Table
	-----

	The driver should implement the following operations for each table:

	* ``matches_dump``: Dump the supported matches.
	* ``actions_dump``: Dump the supported actions.
	* ``entries_dump``: Dump the actual content of the table.
	* ``counters_set_update``: Synchronize hardware with counters enabled or
	disabled.

	Header/Field
	------------

	In a similar way to P4 headers and fields are used to describe a table's
	behavior. There is a slight difference between the standard protocol headers
	and specific ASIC metadata. The protocol headers should be declared in the
	``devlink`` core API. On the other hand ASIC meta data is driver specific
	and should be defined in the driver. Additionally, each driver-specific
	devlink documentation file should document the driver-specific ``dpipe``
	headers it implements. The headers and fields are identified by enumeration.

	In order to provide further visibility some ASIC metadata fields could be
	mapped to kernel objects. For example, internal router interface indexes can
	be directly mapped to the net device ifindex. FIB table indexes used by
	different Virtual Routing and Forwarding (VRF) tables can be mapped to
	internal routing table indexes.

	Match
	-----

	Matches are kept primitive and close to hardware operation. Match types like
	LPM are not supported due to the fact that this is exactly a process we wish
	to describe in full detail. Example of matches:

	* ``field_exact``: Exact match on a specific field.
	* ``field_exact_mask``: Exact match on a specific field after masking.
	* ``field_range``: Match on a specific range.

	The id's of the header and the field should be specified in order to
	identify the specific field. Furthermore, the header index should be
	specified in order to distinguish multiple headers of the same type in a
	packet (tunneling).

	Action
	------

	Similar to match, the actions are kept primitive and close to hardware
	operation. For example:

	* ``field_modify``: Modify the field value.
	* ``field_inc``: Increment the field value.
	* ``push_header``: Add a header.
	* ``pop_header``: Remove a header.

	Entry
	-----

	Entries of a specific table can be dumped on demand. Each eentry is
	identified with an index and its properties are described by a list of
	match/action values and specific counter. By dumping the tables content the
	interactions between tables can be resolved.

	Abstraction Example
	===================

	The following is an example of the abstraction model of the L3 part of
	Mellanox Spectrum ASIC. The blocks are described in the order they appear in
	the pipeline. The table sizes in the following examples are not real
	hardware sizes and are provided for demonstration purposes.

	LPM
	---

	The LPM algorithm can be implemented as a list of hash tables. Each hash
	table contains routes with the same prefix length. The root of the list is
	/32, and in case of a miss the hardware will continue to the next hash
	table. The depth of the search will affect the data path latency.

	In case of a hit the entry contains information about the next stage of the
	pipeline which resolves the MAC address. The next stage can be either local
	host table for directly connected routes, or adjacency table for next-hops.
	The ``meta.lpm_prefix`` field is used to connect two LPM tables.

	.. code::

	table lpm_prefix_16 {
	size: 4096,
	counters_enabled: true,
	match: { meta.vr_id: exact,
	ipv4.dst_addr: exact_mask,
	ipv6.dst_addr: exact_mask,
	meta.lpm_prefix: exact },
	action: { meta.adj_index: set,
	meta.adj_group_size: set,
	meta.rif_port: set,
	meta.lpm_prefix: set },
	}

	Local Host
	----------

	In the case of local routes the LPM lookup already resolves the egress
	router interface (RIF), yet the exact MAC address is not known. The local
	host table is a hash table combining the output interface id with
	destination IP address as a key. The result is the MAC address.

	.. code::

	table local_host {
	size: 4096,
	counters_enabled: true,
	match: { meta.rif_port: exact,
	ipv4.dst_addr: exact},
	action: { ethernet.daddr: set }
	}

	Adjacency
	---------

	In case of remote routes this table does the ECMP. The LPM lookup results in
	ECMP group size and index that serves as a global offset into this table.
	Concurrently a hash of the packet is generated. Based on the ECMP group size
	and the packet's hash a local offset is generated. Multiple LPM entries can
	point to the same adjacency group.

	.. code::

	table adjacency {
	size: 4096,
	counters_enabled: true,
	match: { meta.adj_index: exact,
	meta.adj_group_size: exact,
	meta.packet_hash_index: exact },
	action: { ethernet.daddr: set,
	meta.erif: set }
	}

	ERIF
	----

	In case the egress RIF and destination MAC have been resolved by previous
	tables this table does multiple operations like TTL decrease and MTU check.
	Then the decision of forward/drop is taken and the port L3 statistics are
	updated based on the packet's type (broadcast, unicast, multicast).

	.. code::

	table erif {
	size: 800,
	counters_enabled: true,
	match: { meta.rif_port: exact,
	meta.is_l3_unicast: exact,
	meta.is_l3_broadcast: exact,
	meta.is_l3_multicast, exact },
	action: { meta.l3_drop: set,
	meta.l3_forward: set }
	}