<?xml version="1.0" encoding="US-ASCII"?>
<!DOCTYPE rfc SYSTEM "rfc2629.dtd">
<?rfc toc="yes"?>
<?rfc tocompact="yes"?>
<?rfc tocdepth="3"?>
<?rfc tocindent="yes"?>
<?rfc symrefs="yes"?>
<?rfc sortrefs="yes"?>
<?rfc comments="yes"?>
<?rfc inline="yes"?>
<?rfc compact="yes"?>
<?rfc subcompact="no"?>
<rfc category="info" docName="draft-wang-ffd-framework-01" ipr="trust200902">
  <front>
    <title abbrev="Framework of FFD for IP-based Network">Framework of Fast
    Fault Detection for IP-based Network</title>

    <author fullname="Haibo Wang" initials="H." surname="Wang">
      <organization>Huawei</organization>

      <address>
        <postal>
          <street>No. 156 Beiqing Road</street>

          <city>Beijing</city>

          <region/>

          <code>100095</code>

          <country>P.R. China</country>
        </postal>

        <phone/>

        <facsimile/>

        <email>rainsword.wang@huawei.com</email>

        <uri/>
      </address>
    </author>

    <author fullname="Fengwei Qin" initials="F." surname="Qin">
      <organization>China Mobile</organization>

      <address>
        <postal>
          <street/>

          <city>Beijing</city>

          <region/>

          <code/>

          <country>China</country>
        </postal>

        <phone/>

        <facsimile/>

        <email>qinfengwei@chinamobile.com</email>

        <uri/>
      </address>
    </author>

    <author fullname="Lily Zhao" initials="L." surname="Zhao">
      <organization>Huawei</organization>

      <address>
        <postal>
          <street>No. 3 Shangdi Information Road</street>

          <city>Beijing</city>

          <region/>

          <code>100085</code>

          <country>P.R. China</country>
        </postal>

        <phone/>

        <facsimile/>

        <email>Lily.zhao@huawei.com</email>

        <uri/>
      </address>
    </author>

    <author fullname="Shuanglong Chen" initials="S." surname="Chen">
      <organization>Huawei</organization>

      <address>
        <postal>
          <street>No. 156 Beiqing Road</street>

          <city>Beijing</city>

          <region/>

          <code>100095</code>

          <country>P.R. China</country>
        </postal>

        <phone/>

        <facsimile/>

        <email>chenshuanglong@huawei.com</email>

        <uri/>
      </address>
    </author>

    <date day="11" month="March" year="2023"/>

    <abstract>
      <t>The IP-based distributed system and software application layer often
      use heartbeat to maintain the network topology status. However, the
      heartbeat setting is long, which prolongs the system fault detection
      time. IP-based storage network is the typical usage of that scenario.
      When the IP-based storage network fault occurs, NVMe connections need to
      be switched over. Currently, no effective method is available for quick
      detection, switchover is performed only based on keepalive timeout,
      resulting in low performance.</t>

      <t>This document defines the basic framework of how network assisted
      host devices can quickly detect application connection failures caused
      by network faults.</t>
    </abstract>

    <note title="Requirements Language">
      <t>The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
      "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
      document are to be interpreted as described in <xref
      target="RFC2119">RFC 2119</xref>.</t>
    </note>
  </front>

  <middle>
    <section title="Introduction">
      <t>Today, distributed systems based on network communication are widely
      used. In order to ensure that both ends of the distributed system can
      perceive faults, heartbeat is a common technology. However, relying on
      the heartbeat to detect whether the peer is faulty also faces
      challenges: if the heartbeat is set too short, it may be misjudged by
      network disturbances; if the heartbeat is set too long, when a fault
      occurs, it will not be found for a long time.</t>

      <t>Application scenarios such as IP-based NVMe, distributed storage, and
      cluster computing are typical scenarios for such technologies.</t>

      <t>The <xref target="I-D.guo-ffd-requirement"/> describes the problems
      of the current IP-based NVMe solution. On an IP-based storage area
      network, if the access link of a storage device is faulty, hosts cannot
      access the storage device. Because the host cannot directly detect the
      fault, the host has to wait for the KA timeout. To speed up fault
      detection, hosts and storage devices can implement fast KA or BFD.
      However, this solution introdueced additional cost on hosts and storage
      devices and is hard to use in large-scale IP-based storage area network.
      In fact, the IP network can directly detect these faults, so we can use
      the IP network to assist these access endpoints to quickly perceive the
      fault, so as to perform quickly service recovery.</t>
    </section>

    <section title="Terminology">
      <t>NoF : NVMe of Fabrics</t>

      <t>FC : Fiber Channel</t>

      <t>NVMe : Non-Volatile Memory Express</t>

      <t>SAN: Storage Area Network</t>
    </section>

    <section title="Reference Models">
      <t>The frame solution here is applicable to the system where the
      terminals are directly connected to the IP network.</t>

      <t><figure>
          <artwork align="center"><![CDATA[ +--------+    +-----------+     +-----------+    +--------+ 
 |Terminal|----| IP Network|-----| IP Network|----|Terminal| 
 | Device |    |   Device  |     |   Device  |    | Device |
 +--------+    +-----------+     +-----------+    +--------+              
             Figure 1 : Basic framework
]]></artwork>
        </figure>Terminals are connected to the IP network, and they establish
      IP connections through the reachability provided by the IP network. When
      the connection path fails, they cannot be detected quickly. They can
      only detect it after the keep-alive timeout, and then can carry out
      service protection processing. This time may be relatively long.
      Therefore, it is necessary to notify the terminal device of some
      failures in the network, such as access port failures and internal
      network failures that will cause IP connection failures between
      terminals, so that the terminal device can respond quickly and perform
      corresponding service processing.</t>

      <t>Figure 1 shows a model abstraction. In actual use, as introduced in
      Chapter 1, there are scenarios such as IP-based NVMe, distributed
      storage, and cluster computing. IP-based NVME is introduced as a typical
      scenario here, and the processing behaviors of other scenarios are
      similar.</t>

      <t>An IP-based storage area network mainly includes three types of
      roles:</t>

      <t>o Initiator, the terminal device, is also called the host.</t>

      <t>o Switch, which is a network device used to access terminal
      devices.</t>

      <t>o Target is also a terminal device, also known as a storage
      device.</t>

      <t><figure>
          <artwork align="center"><![CDATA[               +--+      +--+      +--+      +--+ 
   Host        |H1|      |H2|      |H3|      |H4| 
(Initiator)    +/-+      +-,+      +.-+      +/-+ 
                |         | '.   ,-`|         |   
                |         |   `',   |         |   
                |         | ,-`  '. |         |   
              +-\--+    +--`-+    +`'--+    +-\--+
              | SW1|    | SW2|    | SW3|    | SW4|
              +--,-+    +---,,    +,.--+    +-.--+
                  `.          `'.,`         .`    
                    `.   _,-'`    ``'.,   .`      
    IP              +--'`+            +`-`-+      
  Network           | SW5|            | SW6|      
                    +--,,+            +,.,-+      
                    .`   `'.,     ,.-``   ',      
                  .`         _,-'`          `.    
              +--`-+    +--'`+    `'---+    +-`'-+
              | SW7|    | SW8|    | SW9|    |SW10|
              +-.,-+    +-..-+    +-.,-+    +-_.-+
                | '.   ,-` |        | `.,   .' |  
                |   `',    |        |    '.`   |  
                | ,-`  '.  |        | ,-`  `', |  
  Storage      +-`+      `'\+      +-`+      +`'+ 
  (Target)     |S1|      |S2|      |S3|      |S4| 
               +--+      +--+      +--+      +--+ 
               Figure 2 : Large-scale SAN
]]></artwork>
        </figure>.Figure 2 shows a typical IP-Based NVME dual-plane storage
      area network. When the access link of the storage device fails, the host
      needs to quickly detect the failure so that the NVMe connection
      initiated by the host can quickly switch to the backup path..</t>
    </section>

    <section title="Functional Components">
      <t>The NVMe IP-based SANs consists of storage devices, hosts and
      switches. Hosts and storage devices need to obtain required fault
      information from the IP network. Switches need to synchronize locally
      detected fault information on the IP network so that other switches can
      obtain the faults and notify hosts or storage devices that require the
      fault infomation.</t>

      <section title="Storage Device">
        <t>As the server side, storage devices provide storage access services
        for hosts. If a storage device is connected to an IP network and is
        interested in the status of other devices, the storage device can
        initiate a subscription request to the connected switch to obtain
        status notifications of other devices from the access switch.</t>

        <t>In order to reduce the complexity of storage device implementation
        and improve device security, it is recommended to extend the LLDP
        protocol to support the storage device to subscribe to the access
        switch, and use the new L2 protocol extension to support the switch to
        notify the storage device of status information.</t>

        <t><figure align="center">
            <artwork><![CDATA[  +-------+                  +------+  
  |Storage|                  |Switch|  
  +-------+                  +------+  
      |      Subscribe Msg      |      
      | ----------------------->|      
      |                         |      
      |     Notification Msg    |      
      | <-----------------------|      
      |                         |      
      |                         |
      Figure 3 : Storage Device
]]></artwork>
          </figure></t>
      </section>

      <section title="Host">
        <t>As a client accessing a storage device, the host needs to be able
        to quickly obtain the service status of the storage device. When the
        host receives the failure message of the storage device notified by
        the switch, the host will quickly disconnect the connection in use and
        switch to the redundan.t storage device.</t>

        <t>The recommended protocol on the host side is the same as that on
        the storage device.</t>

        <t><figure>
            <artwork><![CDATA[+-------+                  +------+
|  HOST |                  |Switch|
+-------+                  +------+
    |      Subscribe Msg      |    
    | ----------------------->|    
    |                         |    
    |     Notification Msg    |    
    | <-----------------------|    
    |                         |    
    |                         |    
     Figure 4 : Host Device
]]></artwork>
          </figure></t>
      </section>

      <section title="Network Device">
        <t>The switch can quickly detect local failures or network failures,
        and can calculate the affected IPs based on these failures. The switch
        synchronizes the IP information affected by the fault to other
        switches in the IP network. After the switch gets the fault
        information, it needs to notify the required hosts of the fault so
        that they can quickly switch to the redundant storage device.</t>

        <t><figure>
            <artwork><![CDATA[+------+                  +------+
|Switch|                  |Switch|
+------+                  +------+
   |    Information Sync     |    
   | ----------------------->|    
   |                         |    
   |                         |    
   |                         |    
    Figure 5 : Network Device
]]></artwork>
          </figure></t>
      </section>
    </section>

    <section title="Procedures">
      <t/>

      <section title="Network Deployment">
        <t>The IP-based SAN uses the standard Ethernet technolog. Network
        deployments typically use the current IP technologies. For example,
        OSPF is usually deployed as an underlay protocol.</t>
      </section>

      <section title="Storage and Host Access">
        <t>Hosts and storage devices are connected to the ethernet network.
        The administrator assigns access IP addresses to the hosts and storage
        devices. In most scenarios, these routes can be advertised through the
        underlay protocol. In addition, after hosts and storage devices go
        online, they needs to send subscription requests to the switch to
        obtain the status information of the target device.</t>

        <t>To prevent hosts or storage devices from being aware of extra IP
        address, it is recommended that LLDP be used to implement this
        message.</t>
      </section>

      <section title="Status Infomation Sync And Notification">
        <t>When hosts and storage devices go online, the switch can calculates
        an initial state of these devices and synchronizes the state on the IP
        network.</t>

        <t>After detecting a local fault, the switch needs to notify other
        access devices who need the fault information. In addition, the switch
        needs to synchronize the fault information to other switches on the
        network. To ensure that synchronization messages can be reliably
        synchronized to other switches, a reliable transmission protocol, such
        as TCP or Quic, must be used. For large-scale IP networks,
        hierarchical synchronization can be used to reduce the number of
        sessions between switches.</t>

        <t>The synchronization information about the host and storage devices
        belongs to the application layer's information.</t>

        <t><figure align="center">
            <artwork><![CDATA[+-------+           +----+      +------+      +----+         +-------+
|  HOST |-----------|TOR1|------|Spine1|------|TOR3|---------|Storage|
+---/---+           +-/--+      +--/---+      +-/--+         +---/---+
    |---------------->|  Info Sync |  Info Sync |<---------------|    
    |  SubscribeMsg   |----------->|<-----------|  Subscribe Msg |    
    |                 |<-----------|----------->|                |    
    |<----------------|  Info Sync |  Info Sync |                |    
    |Notification Msg |            |            |                |    
    |                 |            |            |                |    
            Figure 7 : Information Advertisement
]]></artwork>
          </figure></t>

        <section title="Access Link Failure">
          <t>When an access link fails, the access switch can detect the
          failure. According to the faulty link, the access switch can
          calculate the IP address of the affected device. The access switch
          advertises the faulty IP address information to other local devices
          that need to sense the fault. At the same time, the switch
          synchronizes the calculated affected IP information to other
          switches in the IP network. </t>

          <t>After the switch receives the synchronized fault IP information
          from other switches, it needs to notify the required local access
          device of the fault information. </t>
        </section>

        <section title="Network Link or Device Failure">
          <t>ECMP or redundant link protection is usually deployed to prevent
          this failure.</t>

          <t>But when an unconvergable fault occurs on the network, the access
          switch can detect it quickly by deploying detection technology, and
          can also calculate the IP addresses affected by the fault, and then
          perform the same actions as above.</t>
        </section>
      </section>
    </section>

    <section anchor="Security" title="Security Considerations">
      <t>In order to control the communication range of information and reduce
      the negative impact of possible information flooding, the Subscribe Msg
      and Notification Msg considered in this framework are suggested to be
      implemented through the L2 extension protocol, so that the sending and
      receiving of this information will only be controlled by the access
      network device within the domain. At the same time, the network device
      is not allowed to forward this message, only allowed to receive or send
      such message as needed.</t>

      <t>For the communication protocol between network devices, in order to
      ensure its security, it can be encrypted by commonly used encryption
      technology, including but not limited to TCP-AO, TLS and other
      technologies.</t>
    </section>

    <section anchor="IANA" title="IANA Considerations">
      <t>This document makes no request of IANA.</t>
    </section>
  </middle>

  <back>
    <references title="Normative References">
      <?rfc include="reference.RFC.2119"?>
    </references>

    <references title="Informative References">
      <?rfc include="reference.I-D.guo-ffd-requirement"?>
    </references>
  </back>
</rfc>
