Framework of Fast Fault Detection for IP-baesd Networks

IP-based distributed systems are widely used, and the network is opaque to application-side systems. When an IP network connected to a distributed system encounters a fault that affects IP connectivity, the application system cannot quickly detect the fault. To enable the application system to quickly detect the fault, the application system needs to accelerate keepalive or deploy a detection mechanism, which brings extra overheads to the application system. The describes the requirements for these applications. The most typical application scenario is the IP-based NVMe scenario. IP-based NVMe is an implementation of NVMe over Fabrics that best fits NVMe semantics. It is the development trend of high-speed storage networks in the future. IP-based NVMe for high-speed storage has high requirements on IP networks. In an IP-based NVMe network, when a failure that affects an IP connection occurs, for example, an access link failure or a switch network failure that cannot perform route convergence, the NVMe connection cannot immediately detect the fault. In the current implementation mechanism, this failure can only be detected based on keepalive timeout. Generally, this failure lasts more than 10s. To speed up detection, hosts and storage devices can use fast keepalive or BFD for fast detection. However, the solution introduces additional load on hosts and storage devices, making it difficult to use in large-scale IP-based NVMe.

NoF : NVMe of Fabrics FC : Fiber Channel NVMe : Non-Volatile Memory Express SAN: Storage Area Network

This document describes the framework based of IP-based NVMe as a typical application. An IP-based NVMe mainly includes three types of roles: an initiator (referred to as a host), a switch, and a target (referred to as a storage device). Initiators and targets are also referred to as client endpoint and server endpoint. Hosts and storage devices use the IP-based NVMe protocol to transmit data over the network to provide high-performance storage services.

This is a dual-plane NVMe over IP-based Network which applies to a large-scale storage device access network. Storage devices on the dual-homed access network provide NVMe services using two different IP addresses. When an access link (for example, the S1-SW7 link) or a network-side link (for example, the SW7-SW5 link) fails, H1 cannot access the IP address of S1 connected to SW1. H1 cannot quickly detect the failure. After the keepalive timeout, H1 can detect the failure and then switch the NVMe connection to the IP address that S1 accesses through SW8.

The NVMe IP-based SANs consists of storage devices, hosts and switches. The storage device provides services. The host initiates an NVMe connection to the storage device. That is, the host is the Client Endpoint, and the storage device is the Server Endpoint.

As a service provider, the server endpoint does not need to detect the status of the client. To enable the network to know the information about the server, the server needs to advertise its information to the access switch. To reduce the complexity of server endpoint, it is suggested to extend the LLDP protocol to support registration.

| | | \ \ Figure 2 : Server Endpoint ]]>

The client needs to quickly obtain the IP reachability status of the service endpoint. In this case, the client needs to send a subscription request to the access switch. In addition, to facilitate the network to know the location of the client endpoint, the client endpoint needs to register its information to the access switch. When the switch network senses a failure required by the client endpoint, the access switch notifies the corresponding client endpoint of the fault state. Also, to reduce the complexity of client endpoints, it is recommended that the LLDP protocol be extended to support subscriptions. For notification messages initiated by the switch to client endpoints, it is recommended that the L2 extension protocol be used to control the notification scope.

| | | | Subscribe Msg | |---------------------->| | | | Notification Msg | |<----------------------| | | \ \ Figure 3 : Client Endpoint ]]>

Network devices, such as access switches, can quickly detect failures on local access links. The client endpoint that needs to obtain the failure may not be connected to that switch. Therefore, the switch that detects the failure needs to synchronize the information to other switches so that the other switches can notify the required endpoint as required. On a large-scale network, reflector can be used to reduce the number of connections for information synchronization between switches. To ensure that synchronization messages can be reliably synchronized to other switches, a reliable transmission protocol, such as TCP or Quic, must be used.

Here use the IP-based NVMe interaction example to see the complete deployment process of this framework.

The IP-based NVMe uses the standard IP technology. Network deployments typically use the current IP technologies. For example, OSPF is usually deployed as an underlay protocol.

Hosts and storage devices are connected to the IP network. As shown by Figure 1, they may access the network in single-homing or dual-homing mode. The administrator assigns access IP addresses to the hosts and storage devices. In most scenarios, these routes can be advertised through the underlay protocol. To enable IP network devices to know the information about these access nodes, hosts and storage devices need to register their own network information, such as IP addresses and roles, with the access switches after accessing the network. In addition, the host needs to initiate a subscription request to the access switch to notify the access switch of the information about the storage device it cares about.

Hosts and storage devices are connected to different switches. To enable these switches to obtain the registration and subscription information of these hosts and storage devices, synchronizing the information between the switches is needed.

| | | | | Subscribe Msg | | | | |---------------->| Sync Msg | | | | |------------>| Sync Msg | | | | |------------>| Register Msg | | | | Sync Msg |<--------------| | | Sync Msg |<------------| | | |<------------| |--/ | | | | | |Fault | | | | | |Detection | | | | Sync Msg |<-- | | | Sync Msg |<------------| | | Notification Msg|<------------| | | |<----------------| | | | \ \ \ \ \ Figure 7 : Information Advertisement ]]> After detecting a local failure, the switch calculates the IP address affected by the failure. If another access endpoint on the switch wants to obtain the IP address of the failure, the switch notifies that access endpoint of the fault. In addition, the switch needs to synchronize the failure IP address to other switches on the network. After receiving the failure IP address information, other switches notify the access endpoints who need the information. When a link between network devices or a network device is failure, routes are converged on the network. If services cannot be restored even after route convergence, such as SW7-SW5 shown in Figure 1, is faulty. As a result, H1 cannot access the IP address used by S1 to access SW7. In this case, after detecting the failure, the network device calculates the IP addresses affected by the failure. Then, the network device notifies the required access endpoint of the failure information. As shown in Figure 1, SW1 calculates that the IP address used by S1 to connect to SW7 is unreachable. Therefore, SW1 notifies H1 of the failure so that H1 can quickly switch to another storage device.