<?xml version="1.0" encoding="US-ASCII"?>
<!DOCTYPE rfc SYSTEM "rfc2629.dtd">
<?rfc toc="yes"?>
<?rfc tocompact="yes"?>
<?rfc tocdepth="3"?>
<?rfc tocindent="yes"?>
<?rfc symrefs="yes"?>
<?rfc sortrefs="yes"?>
<?rfc comments="yes"?>
<?rfc inline="yes"?>
<?rfc compact="yes"?>
<?rfc subcompact="no"?>
<rfc category="info" docName="draft-wang-ffd-framework-00" ipr="trust200902">
  <front>
    <title abbrev="Abbreviated-Title">Framework of Fast Fault Detection for
    IP-baesd Networks</title>

    <author fullname="Haibo Wang" initials="H." surname="Wang">
      <organization>Huawei</organization>

      <address>
        <postal>
          <street>No. 156 Beiqing Road</street>

          <city>Beijing</city>

          <region/>

          <code>100095</code>

          <country>P.R. China</country>
        </postal>

        <phone/>

        <facsimile/>

        <email>rainsword.wang@huawei.com</email>

        <uri/>
      </address>
    </author>

    <author fullname="Fengwei Qin" initials="F." surname="Qin">
      <organization>China Mobile</organization>

      <address>
        <postal>
          <street/>

          <city>Beijing</city>

          <region/>

          <code/>

          <country>China</country>
        </postal>

        <phone/>

        <facsimile/>

        <email>qinfengwei@chinamobile.com</email>

        <uri/>
      </address>
    </author>

    <author fullname="Lily Zhao" initials="L." surname="Zhao">
      <organization>Huawei</organization>

      <address>
        <postal>
          <street>No. 3 Shangdi Information Road</street>

          <city>Beijing</city>

          <region/>

          <code>100085</code>

          <country>P.R. China</country>
        </postal>

        <phone/>

        <facsimile/>

        <email>Lily.zhao@huawei.com</email>

        <uri/>
      </address>
    </author>

    <author fullname="Shuanglong Chen" initials="S." surname="Chen">
      <organization>Huawei</organization>

      <address>
        <postal>
          <street>No. 156 Beiqing Road</street>

          <city>Beijing</city>

          <region/>

          <code>100095</code>

          <country>P.R. China</country>
        </postal>

        <phone/>

        <facsimile/>

        <email>chenshuanglong@huawei.com</email>

        <uri/>
      </address>
    </author>

    <date day="24" month="October" year="2022"/>

    <abstract>
      <t>The IP-based distributed system and software application layer often
      use heartbeat to maintain the network topology status. However, the
      heartbeat setting is long, which prolongs the system fault detection
      time. IP-based storage network is the typical usage of that scenario.
      When an IP-based storage network fault occurs, NVMe connections need to
      be switched over. Currently, no effective method is available for quick
      detection, switchover is performed only based on keepalive timeout,
      resulting in low performance.</t>

      <t>This document defines the basic framework of how network assisted
      host devices can quickly detect application connection failures caused
      by network faults.</t>
    </abstract>

    <note title="Requirements Language">
      <t>The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
      "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
      document are to be interpreted as described in <xref
      target="RFC2119">RFC 2119</xref>.</t>
    </note>
  </front>

  <middle>
    <section title="Introduction">
      <t>IP-based distributed systems are widely used, and the network is
      opaque to application-side systems. When an IP network connected to a
      distributed system encounters a fault that affects IP connectivity, the
      application system cannot quickly detect the fault. To enable the
      application system to quickly detect the fault, the application system
      needs to accelerate keepalive or deploy a detection mechanism, which
      brings extra overheads to the application system.</t>

      <t>The <xref target="I-D.guo-ffd-requirement"/> describes the
      requirements for these applications. The most typical application
      scenario is the IP-based NVMe scenario.</t>

      <t>IP-based NVMe is an implementation of NVMe over Fabrics that best
      fits NVMe semantics. It is the development trend of high-speed storage
      networks in the future. IP-based NVMe for high-speed storage has high
      requirements on IP networks. In an IP-based NVMe network, when a failure
      that affects an IP connection occurs, for example, an access link
      failure or a switch network failure that cannot perform route
      convergence, the NVMe connection cannot immediately detect the fault. In
      the current implementation mechanism, this failure can only be detected
      based on keepalive timeout. Generally, this failure lasts more than 10s.
      To speed up detection, hosts and storage devices can use fast keepalive
      or BFD for fast detection. However, the solution introduces additional
      load on hosts and storage devices, making it difficult to use in
      large-scale IP-based NVMe.</t>
    </section>

    <section title="Terminology">
      <t>NoF : NVMe of Fabrics</t>

      <t>FC : Fiber Channel</t>

      <t>NVMe : Non-Volatile Memory Express</t>

      <t>SAN: Storage Area Network</t>
    </section>

    <section title="Reference Models">
      <t>This document describes the framework based of IP-based NVMe as a
      typical application.</t>

      <t>An IP-based NVMe mainly includes three types of roles: an initiator
      (referred to as a host), a switch, and a target (referred to as a
      storage device). Initiators and targets are also referred to as client
      endpoint and server endpoint. Hosts and storage devices use the IP-based
      NVMe protocol to transmit data over the network to provide
      high-performance storage services.</t>

      <t><figure>
          <artwork align="center"><![CDATA[               +--+      +--+      +--+      +--+ 
   Host        |H1|      |H2|      |H3|      |H4| 
(Initiator)    +/-+      +-,+      +.-+      +/-+ 
                |         | '.   ,-`|         |   
                |         |   `',   |         |   
                |         | ,-`  '. |         |   
              +-\--+    +--`-+    +`'--+    +-\--+
              | SW1|    | SW2|    | SW3|    | SW4|
              +--,-+    +---,,    +,.--+    +-.--+
                  `.          `'.,`         .`    
                    `.   _,-'`    ``'.,   .`      
    IP              +--'`+            +`-`-+      
  Network           | SW5|            | SW6|      
                    +--,,+            +,.,-+      
                    .`   `'.,     ,.-``   ',      
                  .`         _,-'`          `.    
              +--`-+    +--'`+    `'---+    +-`'-+
              | SW7|    | SW8|    | SW9|    |SW10|
              +-.,-+    +-..-+    +-.,-+    +-_.-+
                | '.   ,-` |        | `.,   .' |  
                |   `',    |        |    '.`   |  
                | ,-`  '.  |        | ,-`  `', |  
  Storage      +-`+      `'\+      +-`+      +`'+ 
  (Target)     |S1|      |S2|      |S3|      |S4| 
               +--+      +--+      +--+      +--+ 
               Figure 1 : NVMe over IP-based Network
]]></artwork>
        </figure>This is a dual-plane NVMe over IP-based Network which applies
      to a large-scale storage device access network. Storage devices on the
      dual-homed access network provide NVMe services using two different IP
      addresses.</t>

      <t>When an access link (for example, the S1-SW7 link) or a network-side
      link (for example, the SW7-SW5 link) fails, H1 cannot access the IP
      address of S1 connected to SW1. H1 cannot quickly detect the failure.
      After the keepalive timeout, H1 can detect the failure and then switch
      the NVMe connection to the IP address that S1 accesses through SW8.</t>
    </section>

    <section title="Functional Components">
      <t>The NVMe IP-based SANs consists of storage devices, hosts and
      switches. The storage device provides services. The host initiates an
      NVMe connection to the storage device. That is, the host is the Client
      Endpoint, and the storage device is the Server Endpoint.</t>

      <section title="Server Endpoint (Storage Device)">
        <t>As a service provider, the server endpoint does not need to detect
        the status of the client. To enable the network to know the
        information about the server, the server needs to advertise its
        information to the access switch.</t>

        <t>To reduce the complexity of server endpoint, it is suggested to
        extend the LLDP protocol to support registration.</t>

        <t><figure>
            <artwork align="center"><![CDATA[ +-----------+           +--------+ 
 | Server EP |           | Switch | 
 | (Storage) |           |        | 
 +----/------+           +----/---+ 
      |                       |     
      |    Register Msg       |     
      |---------------------->|     
      |                       |     
      \                       \     
   Figure 2 : Server Endpoint
]]></artwork>
          </figure></t>
      </section>

      <section title="Client Endpoint (Host)">
        <t>The client needs to quickly obtain the IP reachability status of
        the service endpoint. In this case, the client needs to send a
        subscription request to the access switch. In addition, to facilitate
        the network to know the location of the client endpoint, the client
        endpoint needs to register its information to the access switch. When
        the switch network senses a failure required by the client endpoint,
        the access switch notifies the corresponding client endpoint of the
        fault state.</t>

        <t>Also, to reduce the complexity of client endpoints, it is
        recommended that the LLDP protocol be extended to support
        subscriptions. For notification messages initiated by the switch to
        client endpoints, it is recommended that the L2 extension protocol be
        used to control the notification scope.</t>

        <t><figure>
            <artwork><![CDATA[ +-----------+           +--------+ 
 | Client EP |           | Switch | 
 |  (Host)   |           |        | 
 +----/------+           +----/---+ 
      |                       |     
      |    Register Msg       |     
      |---------------------->|     
      |                       |     
      |    Subscribe Msg      |     
      |---------------------->|     
      |                       |     
      |   Notification Msg    |     
      |<----------------------|     
      |                       |     
      \                       \     
    Figure 3 : Client Endpoint
]]></artwork>
          </figure></t>
      </section>

      <section title="Network Device">
        <t>Network devices, such as access switches, can quickly detect
        failures on local access links. The client endpoint that needs to
        obtain the failure may not be connected to that switch. Therefore, the
        switch that detects the failure needs to synchronize the information
        to other switches so that the other switches can notify the required
        endpoint as required.</t>

        <t>On a large-scale network, reflector can be used to reduce the
        number of connections for information synchronization between
        switches.</t>

        <t>To ensure that synchronization messages can be reliably
        synchronized to other switches, a reliable transmission protocol, such
        as TCP or Quic, must be used.</t>

        <t><figure>
            <artwork><![CDATA[ +--------+    +-----------+   +--------+ 
 | Switch |    | Reflector |   | Switch | 
 +----/---+    +-----/-----+   +---/----+ 
      |              |             |      
      |   Sync Msg   |             |      
      |------------->|   Sync Msg  |      
      |              |------------>|      
      \              \             \      
    Figure 4 : Network Device
]]></artwork>
          </figure></t>
      </section>
    </section>

    <section title="Procedures">
      <t> Here use the IP-based NVMe interaction example to see the complete
      deployment process of this framework.</t>

      <section title="Network Deployment">
        <t>The IP-based NVMe uses the standard IP technology. Network
        deployments typically use the current IP technologies. For example,
        OSPF is usually deployed as an underlay protocol.</t>
      </section>

      <section title="Hosts and Storage devices">
        <t>Hosts and storage devices are connected to the IP network. As shown
        by Figure 1, they may access the network in single-homing or
        dual-homing mode. The administrator assigns access IP addresses to the
        hosts and storage devices. In most scenarios, these routes can be
        advertised through the underlay protocol. </t>

        <t>To enable IP network devices to know the information about these
        access nodes, hosts and storage devices need to register their own
        network information, such as IP addresses and roles, with the access
        switches after accessing the network. In addition, the host needs to
        initiate a subscription request to the access switch to notify the
        access switch of the information about the storage device it cares
        about.</t>
      </section>

      <section title="Status Infomation Sync And Notification">
        <t>Hosts and storage devices are connected to different switches. To
        enable these switches to obtain the registration and subscription
        information of these hosts and storage devices, synchronizing the
        information between the switches is needed.<figure align="center">
            <artwork><![CDATA[ +------+        +--------+   +-----------+   +--------+     +---------+
 | Host |        | Switch |   | Reflector |   | Switch |     | Storage |
 +--/---+        +----/---+   +-----/-----+   +---/----+     +----/----+
    |  Register Msg   |             |             |               |     
    |---------------->|             |             |               |     
    |  Subscribe Msg  |             |             |               |     
    |---------------->|  Sync Msg   |             |               |     
    |                 |------------>|   Sync Msg  |               |     
    |                 |             |------------>|  Register Msg |     
    |                 |             |   Sync Msg  |<--------------|     
    |                 |   Sync Msg  |<------------|               |     
    |                 |<------------|             |--/            |     
    |                 |             |             |  |Fault       |     
    |                 |             |             |  |Detection   |     
    |                 |             |   Sync Msg  |<--            |     
    |                 |   Sync Msg  |<------------|               |     
    | Notification Msg|<------------|             |               |     
    |<----------------|             |             |               |     
    \                 \             \             \               \     
            Figure 7 : Information Advertisement
]]></artwork>
          </figure></t>

        <t>After detecting a local failure, the switch calculates the IP
        address affected by the failure. If another access endpoint on the
        switch wants to obtain the IP address of the failure, the switch
        notifies that access endpoint of the fault. In addition, the switch
        needs to synchronize the failure IP address to other switches on the
        network. After receiving the failure IP address information, other
        switches notify the access endpoints who need the information.</t>

        <t>When a link between network devices or a network device is failure,
        routes are converged on the network. If services cannot be restored
        even after route convergence, such as SW7-SW5 shown in Figure 1, is
        faulty. As a result, H1 cannot access the IP address used by S1 to
        access SW7. In this case, after detecting the failure, the network
        device calculates the IP addresses affected by the failure. Then, the
        network device notifies the required access endpoint of the failure
        information. As shown in Figure 1, SW1 calculates that the IP address
        used by S1 to connect to SW7 is unreachable. Therefore, SW1 notifies
        H1 of the failure so that H1 can quickly switch to another storage
        device.</t>
      </section>
    </section>

    <section anchor="Security" title="Security Considerations">
      <t>NA</t>
    </section>

    <section anchor="IANA" title="IANA Considerations">
      <t>NA</t>
    </section>
  </middle>

  <back>
    <references title="Normative References">
      <?rfc include="reference.RFC.2119"?>
    </references>

    <references title="Informative References">
      <?rfc include="reference.I-D.guo-ffd-requirement"?>
    </references>
  </back>
</rfc>
