<?xml version="1.0" encoding="US-ASCII"?>
<!DOCTYPE rfc SYSTEM "rfc2629.dtd">
<?rfc toc="yes"?>
<?rfc tocompact="yes"?>
<?rfc tocdepth="3"?>
<?rfc tocindent="yes"?>
<?rfc symrefs="yes"?>
<?rfc sortrefs="yes"?>
<?rfc comments="yes"?>
<?rfc inline="yes"?>
<?rfc compact="yes"?>
<?rfc subcompact="no"?>
<rfc category="info" docName="draft-guo-ffd-requirement-02" ipr="trust200902">
  <front>
    <title abbrev="Requirement of FFD for IP-based Network">Requirement of
    Fast Fault Detection for IP-based Network</title>

    <author fullname="Liang Guo" initials="L" surname="Guo">
      <organization>CAICT</organization>

      <address>
        <postal>
          <street>No.52, Hua Yuan Bei Road, Haidian District,</street>

          <city>Beijing</city>

          <region/>

          <code>100191</code>

          <country>China</country>
        </postal>

        <phone/>

        <facsimile/>

        <email>guoliang1@caict.ac.cn</email>

        <uri/>
      </address>
    </author>

    <author fullname="Yi Feng" initials="Y" surname="Feng">
      <organization>China Mobile</organization>

      <address>
        <postal>
          <street>12 Chegongzhuang Street, Xicheng District</street>

          <city>Beijingraf</city>

          <region/>

          <code/>

          <country>China</country>
        </postal>

        <phone/>

        <facsimile/>

        <email>fengyiit@chinamobile.com</email>

        <uri/>
      </address>
    </author>

    <author fullname="Jizhuang Zhao" initials="J" surname="Zhao">
      <organization>China Telecom</organization>

      <address>
        <postal>
          <street>South District of Future Science and Technology, Changping
          District</street>

          <city>Beijing</city>

          <region/>

          <code/>

          <country>China</country>
        </postal>

        <phone/>

        <facsimile/>

        <email>zhaojzh@chinatelecom.cn</email>

        <uri/>
      </address>
    </author>

    <author fullname="Fengwei Qin" initials="F" surname="Qin">
      <organization>China Mobile</organization>

      <address>
        <postal>
          <street>12 Chegongzhuang Street, Xicheng District</street>

          <city>Beijing</city>

          <region/>

          <code/>

          <country>China</country>
        </postal>

        <phone/>

        <facsimile/>

        <email>qinfengwei@chinamobile.com</email>

        <uri/>
      </address>
    </author>

    <author fullname="Lily Zhao" initials="L" surname="Zhao">
      <organization>Huawei</organization>

      <address>
        <postal>
          <street>No. 3 Shangdi Information Road, Haidian District</street>

          <city>Beijing</city>

          <region/>

          <code/>

          <country>China</country>
        </postal>

        <phone/>

        <facsimile/>

        <email>Lily.zhao@huawei.com</email>

        <uri/>
      </address>
    </author>

    <author fullname="Haibo Wang" initials="H." surname="Wang">
      <organization>Huawei</organization>

      <address>
        <postal>
          <street>No. 156 Beiqing Road</street>

          <city>Beijing</city>

          <region/>

          <code/>

          <country>P.R. China</country>
        </postal>

        <phone/>

        <facsimile/>

        <email>rainsword.wang@huawei.com</email>

        <uri/>
      </address>
    </author>

    <author fullname="Wei Quan" initials="W" surname="Quan">
      <organization>Beijing Jiaotong University</organization>

      <address>
        <postal>
          <street>3 Shangyuan Cun, Haidian District</street>

          <city>Beijing</city>

          <region/>

          <code/>

          <country>P.R. China</country>
        </postal>

        <phone/>

        <facsimile/>

        <email>weiquan@bjtu.edu.cn</email>

        <uri/>
      </address>
    </author>

    <author fullname="Hongyi Huang" initials="H." surname="Huang">
      <organization>Huawei</organization>

      <address>
        <postal>
          <street>No. 156 Beiqing Road</street>

          <city>Beijing</city>

          <region/>

          <code>100095</code>

          <country>P.R. China</country>
        </postal>

        <phone/>

        <facsimile/>

        <email>hongyi.huang@huawei.com</email>

        <uri/>
      </address>
    </author>

    <date day="01" month="March" year="2024"/>

    <workgroup>Network Working Group</workgroup>

    <keyword>Sample</keyword>

    <keyword>Draft</keyword>

    <abstract>
      <t>The IP-based distributed system and software application layer often
      use heartbeat to maintain the network topology status. However, the
      heartbeat setting is long, which prolongs the system fault detection
      time. This document describes the requirements for a fast fault
      detection solution of IP-based network.</t>
    </abstract>

    <note title="Requirements Language">
      <t>The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
      "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
      document are to be interpreted as described in <xref
      target="RFC2119">RFC 2119</xref>.</t>
    </note>
  </front>

  <middle>
    <section title="Introduction">
      <t>In the face of ever-expanding data, the powerful single-server system
      cannot meet the requirements of data analysis and storage. At the same
      time, with the increase of Ethernet network bandwidth and scale, the
      distributed system that communicates through the network emerges and
      develops rapidly. Heartbeat is a common network topology maintenance
      technology used in distributed systems and software application layers.
      However, if the heartbeat is set too short, the current network
      congestion may lead to misjudgment. If the value of this parameter is
      too long, the judgment is slow. Generally, you need to balance and set
      the parameters based on various conditions. IP-based NVMe, distributed
      storage and Cluster Computing are used for core application scenarios.
      The requirements for performance and impact of faults on services are
      increasing. This document describes application scenarios and capability
      requirements for fast fault detection in scenarios such as IP-based
      NVMe, artificial intelligence, and distributed storage.</t>
    </section>

    <section anchor="Security" title="Terminology">
      <t>AI:Artificial intelligence</t>

      <t>FC: Fiber Channel</t>

      <t>HPC: High-performance computing</t>

      <t>NVMe: Non-Volatile Memory Express</t>

      <t>IP-based NVMe: using RDMA or TCP to transport NVMe through
      Ethernet</t>

      <t>NoF: NVMe of Fabrics</t>
    </section>

    <section title="Use Cases">
      <t/>

      <section anchor="Acknowledgements" title="IP-based NVMe">
        <t>For a long time, the key storage applications and high performance
        requirements are mainly based on FC networks. With the increase of
        transmission rates, the medium has evolved from HDDs to solid-state
        storage and the protocol has evolved from SATA to NVMe. The emergence
        of new NVMe technologies brings new opportunities. With the
        development of the NVMe protocol, the application scenario of the NVMe
        protocol is extended from PCIe to other fabrics, solving the problem
        of NVMe extension and transmission distance. The block storage
        protocol uses NoF to replace SCSI, reducing the number of protocol
        interactions from application hosts to storage systems. The end-to-end
        NVMe protocol greatly improves performance.</t>

        <t>Fabrics of NoF include Ethernet, Fibre Channel and InfiniBand.
        Comparing FC-NVMe to Ethernet- or InfiniBand-based Network
        alternatives generally takes into consideration the advantages and
        disadvantages of the networking technologies. Fibre Channel fabrics
        are noted for their lossless data transmission, predictable and
        consistent performance, and reliability. Large enterprises tend to
        favor FC storage for mission-critical workloads. But Fibre Channel
        requires special equipment and storage networking expertise to operate
        and can be more costly than IP-based alternatives. Like FC, InfiniBand
        is a lossless network requiring special hardware. IP-based NVMe
        storage products tend to be more plentiful than FC-NVMe-based options.
        Most storage startups focus on IP-based NVMe. But unlink FC, The
        Ethernet switch does not notify the change of device status. When the
        device is faulty, relying on the NVMe link heartbeat message
        mechanism, the host takes tens of seconds to complete service
        failover.</t>

        <t><figure>
            <artwork align="center"><![CDATA[   +--------------------------------------+
   |          NVMe Host Software          |
   +--------------------------------------+
   +--------------------------------------+
   |   Host Side Transport Abstraction    |
   +--------------------------------------+

      /\      /\      /\      /\      /\
     /  \    /  \    /  \    /  \    /  \
      FC      IB     RoCE    iWARP   TCP
     \  /    \  /    \  /    \  /    \  /
      \/      \/      \/      \/      \/

   +--------------------------------------+
   |Controller Side Transport Abstraction |
   +--------------------------------------+
   +--------------------------------------+
   |          NVMe SubSystem              |
   +--------------------------------------+
Figure 1: NVMe SubSystem
]]></artwork>
          </figure>This section describes the application scenarios and
        capability requirements of the IP-based NVMe storage that implements
        fast fault detection similar to FC.</t>

        <t>The NVMe over RDMA or IP-based network in storage includes three
        types of roles: an initiator (referred to as a host), a switch, and a
        target (referred to as a storage device). Initiators and targets are
        also referred to as endpoint devices.</t>

        <t><figure>
            <artwork align="center"><![CDATA[                 +--+      +--+      +--+      +--+
     Host        |H1|      |H2|      |H3|      |H4|
  (Initiator)    +/-+      +-,+      +.-+      +/-+
                  |         | '.   ,-`|         |
                  |         |   `',   |         |
                  |         | ,-`  '. |         |
                +-\--+    +--`-+    +`'--+    +-\--+
                | SW |    | SW |    | SW |    | SW |
                +--,-+    +---,,    +,.--+    +-.--+
                    `.          `'.,`         .`
                      `.   _,-'`    ``'.,   .`
         IP           +--'`+            +`-`-+
    Network           | SW |            | SW |
                      +--,,+            +,.,-+
                      .`   `'.,     ,.-``   ',
                    .`         _,-'`          `.
                +--`-+    +--'`+    `'---+    +-`'-+
                | SW |    | SW |    | SW |    | SW |
                +-.,-+    +-..-+    +-.,-+    +-_.-+
                  | '.   ,-` |        | `.,   .' |
                  |   `',    |        |    '.`   |
                  | ,-`  '.  |        | ,-`  `', |
    Storage      +-`+      `'\+      +-`+      +`'+
    (Target)     |S1|      |S2|      |S3|      |S4|
                 +--+      +--+      +--+      +--+
Figure 2: NVMe over IP-based Network
]]></artwork>
          </figure></t>

        <t>Hosts and storage devices are connected to the network separately
        and in order to achieve high reliability, each host and storage device
        are connected to dual network planes simultaneously. The host can read
        and write data services when an NVMe connection is established between
        the host and the storage device.</t>

        <t>When a storage device link is faulty during running, the host
        cannot detect the fault status of the indirectly connected device at
        the transport layer. Based on the IP-based NVMe protocol, the host
        uses the NVMe heartbeat to detect the status of the storage device.
        The heartbeat message interval is 5s. Therefore, it takes tens of
        seconds to determine whether the storage device is faulty and perform
        service switchover using the multipath software. Failure tolerance
        time for core applications cannot be reached. In order to obtain the
        best customer experience and business reliability requirement, we need
        to enhance fault detection and failover for IP-based NVMe.</t>

        <t>The storage system has an active-active solution. The proposal, the
        second active path can be used to transfer faults to drive the
        switchover of the source node, is going on in NVMe. However, this can
        only solve the local link faults of the storage node, but cannot solve
        the problem of unconverged network faults. In storage application
        deployment scenarios, independent dual-plane networking maybe used. In
        this deployment, a single-plane device may be faulty. In this case,
        network convergence cannot be performed completely.</t>

        <t>In this proposal, a fast fault detection solution with switch
        participation is proposed. This scheme utilizes the ability of
        switches to detect faults quickly at the physical layer and link
        layer, and allows the switch to synchronize the detected fault
        information in the IP network, and then notify the fault status to the
        endpoint devices.</t>

        <t>Fault detection procedure: The host can detect the fault status of
        the storage device and quickly switch to the standby path.<list
            style="numbers">
            <t>If a storage fault occurs, the access switch detects the fault
            at the storage network layer or link layer.</t>

            <t>The switch synchronizes the status to other switches on the
            network.</t>

            <t>The switch notifies the storage fault information to the
            hosts.</t>

            <t>Quickly disconnect the connection from the storage device and
            trigger the multipathing software to switch services to the
            redundant path. The fault should be detected within 1s.</t>
          </list><figure>
            <artwork align="center"><![CDATA[   +----+       +-------+     +-------+    +-------+
   |Host|       |Switch |     |Switch |    |Storage|
   +----+       +-------+     +-------+    +-------+
      |             |            |-+           |
      |             |            |1|           |
      |             |            |-+           |
      |             |<----2------|             |
      |             |            |             |
      |<----3-------|            |             |
      |             |            |             |
      |<----4-------|------------|-----------> |
      |             |            |             |
Figure 3: Switches interact with hosts and storage devices
]]></artwork>
          </figure></t>
      </section>

      <section title="Distributed Storage">
        <t>Distributed storage cluster devices are interconnected through a
        network (back-end IP network) to establish a cluster. When a link
        fault on a node or node fault occurs in the storage cluster, other
        nodes in the storage cluster cannot detect the fault status of the
        indirectly connected devices through the transport layer. Based on the
        IP protocol, management or master nodes in a storage cluster use
        heartbeats to detect the status of storage nodes. It takes 10 seconds
        or more to determine whether a storage device is faulty and switch
        services to another normal storage node. Services cannot be accessed
        during the fault. To achieve the best customer experience and service
        reliability, we need to enhance the fault detection and failover of
        IP-based cluster nodes.</t>

        <t><figure>
            <artwork align="center"><![CDATA[    Storage      +--+      +--+      +--+      +--+
    cluster      |S1|      |S2|      |S3|      |S4|
                 +--+      +--+      +--+      +--+
                  |           '.   ,-`          |
                  |            .`',_            |
                  |    _ ..--`       `'--.._    |
                +-\--+                       +-\--+
                | SW |                       | SW |
                +--,-+_                     _+-.--+
                    `. `'--..._   _ .. -- '`_.`
                      `.    _,-'` -._     .`
    BACK Storage      +--'`+         +`-`-+
    IP Network        | SW |         | SW |
                      +----+         +----+
Figure 4: Distributed storage
]]></artwork>
          </figure></t>

        <t>The fast fault detection solution in this proposal can be used in
        this scenario. This solution takes advantage of the switch's ability
        to quickly detect faults at the physical layer and link layer, and
        allows the switch to synchronize fault information detected on the IP
        network. Then, the system notifies the storage cluster management node
        or the primary node of the fault status.</t>

        <t>Fault detection procedure: <list style="numbers">
            <t>If a storage fault occurs, the access switch detects the fault
            at the storage network layer or link layer.</t>

            <t>The switch synchronizes the status to other switches on the
            network.</t>

            <t>The switch notifies the storage fault information to the
            storage management or master node. The fault should be detected
            within 1s.</t>
          </list><figure>
            <artwork><![CDATA[   +------+       +-------+     +-------+    +-------+
   |master|       |Switch |     |Switch |    |Storage|
   +------+       +-------+     +-------+    +-------+
      |               |            |-+           |
      |               |            |1|           |
      |               |            |-+           |
      |               |<----2------|             |
      |               |            |             |
      |<----3---------|            |             |
      |               |            |             |

Figure 5: Switches interact with controller
]]></artwork>
          </figure></t>
      </section>

      <section title="Cluster Computing">
        <t>In cluster computing scenarios, for example, HPC cluster
        applications and AI cluster applications, cluster node faults and
        failures may occur on any node at any time. However, for a
        high-performance computing task, once a fault occurs, the entire task
        needs to be re-scheduled. However, It takes several minutes for the
        management node to detect the node fault status. During this period,
        new jobs may be scheduled to the faulty node, causing task execution
        failure.</t>

        <t>The fast fault detection solution in this proposal can be used in
        this scenario. The fault can be detected within seconds.</t>

        <t><figure>
            <artwork><![CDATA[   +-----------------+       +-------+     +-------+    +----------+
   | Management/     |       |Switch |     |Switch |    | Computer |
   | Scheduling node |       |       |     |       |    | node     |
   +-----------------+       +-------+     +-------+    +----------+
      |                          |            |-+           |
      |                          |            |1|           |
      |                          |            |-+           |
      |                          |<----2------|             |
      |                          |            |             |
      |<----3--------------------|            |             |
      |                          |            |             |

Figure 6: Switches interact with HPC cluster]]></artwork>
          </figure>Fault detection procedure is similar to that of distributed
        storage like figure 6.</t>
      </section>
    </section>

    <section title="Requirement">
      <t>In distributed Ethernet systems and cross-network connection
      scenarios, the following requirements are raised to accelerate
      failover:</t>

      <t><list style="numbers">
          <t>A network device can detect link or network failure.</t>

          <t>A network device can synchronize the failure to other network
          devices.</t>

          <t>A network device can notify local/remote failure information to
          local access endpoints.</t>

          <t>The network device sends notification to the endpoints when it
          detects, or being notified of the detection of, any of the
          endpoints' subscribing failure .</t>
        </list></t>
    </section>

    <section title="Security Considerations">
      <t>The functions in this requirement are mainly used in limited
      networks, and the use of the functions needs to be deployed by the
      operator and control the scope of use.</t>

      <t>This requirement involves network devices notifying messages to
      endpoint devices, which requires the cooperation of endpoint devices. In
      addition, in order to limit the range of notification messages, it is
      recommended that network devices use L2 messages to implement the
      notification function, so that the range of notification messages
      generated is limited to the access range of access nodes, and the flood
      of notification messages will not be caused. In addition, according to
      the scope of this required function, the notification message should
      only be generated by the access network devices, and should not be
      forwarded by the network device, so the network device also needs to
      control the receiving and publishing behavior of the messages.</t>

      <t>The synchronization message between network devices is based on the
      session between devices, and the message encryption and authentication
      can be performed for this session, which is already a mature
      technology.</t>
    </section>

    <section title="IANA Considerations">
      <t>NA</t>
    </section>
  </middle>

  <back>
    <references title="Normative References">
      <?rfc include="reference.RFC.2119"?>
    </references>

    <references title="Informative References"/>
  </back>
</rfc>
