<?xml version="1.0" encoding="utf-8"?>
<?xml-model href="rfc7991bis.rnc"?>

<!DOCTYPE rfc [
  <!ENTITY nbsp    "&#160;">
  <!ENTITY zwsp   "&#8203;">
  <!ENTITY nbhy   "&#8209;">
  <!ENTITY wj     "&#8288;">
]>
<rfc
  xmlns:xi="http://www.w3.org/2001/XInclude"
  category="std"
  docName="draft-wang-idr-dpf-00"
  ipr="trust200902"
  obsoletes=""
  updates=""
  submissionType="IETF"
  xml:lang="en"
  version="3"
  consensus="true">

  <front>
    <title abbrev="DPF">BGP Deterministic Path Forwarding (DPF) </title>

    <seriesInfo name="Internet-Draft" value="draft-wang-idr-dpf-00"/>
   
    <author fullname="Kevin Wang" initials="K" surname="Wang">
      <organization>HPE</organization>
      <address>
        <email>kevin.wang@hpe.com</email>  
      </address>
    </author>
   
    <author fullname="Michal Styszynski" initials="M" surname="Styszynski">
      <organization>HPE</organization>
      <address>
        <email>mlstyszynski@juniper.net</email>  
        <uri></uri>
      </address>
    </author>
   
    <author fullname="Wen Lin" initials="W" surname="Lin">
      <organization>HPE</organization>
      <address>
        <email>wen.lin@hpe.com</email>  
        <uri></uri>
      </address>
    </author>
   
    <author fullname="Mahesh Subramaniam" initials="M" surname="Subramaniam">
      <organization>HPE</organization>
      <address>
        <email>mahesh-kumar.subramaniam@hpe.com</email>  
        <uri></uri>
      </address>
    </author>
   
    <author fullname="Thomas Kampa" initials="T" surname="Kampa">
      <organization>Audi</organization>
      <address>
        <email>thomas.kampa@audi.de</email>  
        <uri></uri>
      </address>
    </author>
   
    <author fullname="Diptanshu Singh" initials="D" surname="Singh">
      <organization>Oracle Cloud Infrastructure</organization>
      <address>
        <email>diptanshu.singh@oracle.com</email>  
        <uri></uri>
      </address>
    </author>
   
    <date year="2025"/>
    <area>Routing</area>
    <workgroup>IDR</workgroup>

    <keyword>BGP</keyword>

    <abstract>
      <t>
        Modern data center (DC) fabrics typically employ Clos topologies with
	External BGP (EBGP) for plain IPv4/IPv6 routing.
	While hop-by-hop EBGP routing is simple and scalable, it
	provides only a single best-effort forwarding service for all types of traffic. This
	single best-effort service might be insufficient for increasingly diverse
	traffic requirements in modern DC environments. For
	example, loss and latency sensitive AI/ML flows may demand stronger Service Level Agreements (SLA)
	than general purpose traffic. Duplication schemes which are standardized through
	protocols such as Parallel Redundancy Protocol (PRP) require
	disjoint forwarding paths to avoid single points
	of failure. Congestion avoidance may require more deterministic forwarding behavior.
      </t>

      <t>
	This document introduces BGP Deterministic Path Forwarding (DPF), a mechanism that
	partitions the physical fabric into multiple logical fabrics. Flows can be mapped to
	different logical fabrics based on their specific requirements, enabling deterministic
	forwarding behavior within the data center.
      </t>
    </abstract>
 
  </front>

  <middle>
    
    <section>
      <name>Introduction</name>
      <t>
        Modern data center (DC) fabrics typically employ Clos topologies with
	External BGP (EBGP) <xref target="RFC7938"/> for plain IPv4/IPv6 routing.
	While hop-by-hop EBGP routing is simple and scalable, it
	provides only a single best-effort forwarding service for all types of traffic. This
	single best-effort service might be insufficient for increasingly diverse
	traffic requirements in modern DC environments. For
	example, loss and latency sensitive AI/ML flows may demand stronger Service Level Agreements (SLAs)
	than general purpose traffic. Duplication schemes which are standardized through
	protocols such as Parallel Redundancy Protocol (PRP)
	<xref target="IEC62439-3"/> require disjoint forwarding paths to avoid single points
	of failure. Congestion avoidance may require more deterministic forwarding behavior.
      </t>

      <t>
        Traditionally, traffic engineering requirements like these can be served using technologies
	like RSVP-TE <xref target="RFC3209"/> or Segment Routing <xref target="RFC8402"/>
	in MPLS networks. However, according to the reasons stated in <xref target="RFC7938"/>,
	modern data centers mostly use IP routing with EBGP as their sole routing protocol. BGP DPF
	is a lightweight traffic engineering alternative designed specifically for the IP Clos
	fabrics with EBGP as the routing protocol. It partitions the physical fabric into multiple
	logical fabrics by coloring the EBGP sessions running on the fabric links. Routes are also
	colored so that they are only advertised and received over the matching colored EBGP sessions.
	Together, they provide a certain level of deterministic forwarding behavior for the flows to
	satisfy the diverse traffic requirements of today's data centers.
      </t>


      <section>
        <name>Requirements Language</name>
        <t>The key words "MUST", "MUST NOT", "REQUIRED", "SHALL",
          "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT
          RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be
          interpreted as described in BCP 14 <xref target="RFC2119"/>
          <xref target="RFC8174"/> when, and only when, they appear in
          all capitals, as shown here.</t>
      </section>

    </section>
    
    <section>
      <name>BGP DPF</name>

      <t>
        BGP DPF use BGP session coloring and route coloring to direct flows to different logical
	fabrics.
      </t>
      
      <section>
	<name>BGP Session Coloring</name>
	
	<t>
          <xref target="logical-fabric"/> shows how a physical fabric is partitioned into two logical
	  fabrics, the red fabric and the blue fabric. Leaf1 and Leaf2 can communicate using the red
	  fabric via Spine1, or using the blue fabric via Spine2. Link Spine1-Leaf1 and Spine1-Leaf2
	  belong to the red fabric and link Spine2-Leaf1, Spine2-Leaf2 belong
	  to the blue fabric. Instead of coloring the links directly, BGP DPF colors the EBGP sessions
	  running on the corresponding links. The color of an EBGP session is configured on both ends
	  separately, using the Color Extended Community as defined in
	  <xref target="RFC9012" sectionFormat="of" section="4.3"/>.
	</t>

	<t>
	  There are two modes for session coloring, the strict mode and the loose mode. In the
	  strict mode, the EBGP session MUST NOT come to Established state unless both ends
	  are configured with the same color. In the loose mode, mismatched colors on both ends
	  of an EBGP session SHALL NOT prevent the session from coming up.
	</t>
	
	<figure anchor="logical-fabric">
          <name>Divide one physical fabric into two logical fabrics</name>
          <artwork type="ascii-art">
            <![CDATA[
                 +---------+           +---------+
                 | Spine 1 |           | Spine 2 |
                 |  (red)  |           |  (blue) |
                 +---------+           +---------+
                      | \                 / |
                      |    \           /    |
                      |   red \     / blue  |
                  red |          /          | blue
                      |       /    \        |
                      |    /           \    |
                      | /                 \ |
                 +---------+           +---------+
                 | Leaf 1  |           | Leaf 2  |
                 +---------+           +---------+
            ]]>
          </artwork>
	</figure>
	
	<section>
	  <name>Strict Mode</name>

	  <t>
	    When running in the strict session coloring mode, a BGP speaker uses the Capability
	    Advertisement procedures from <xref target="RFC5492"/> to determine whether the color
	    configured locally matches the color configured on the remote end. When a color is
	    configured for an EBGP session locally, the BGP speaker sends the SESSION-COLOR
	    capability in the OPEN message. The fields in the Capability Optional Parameter
	    are set as follows. The Capability Code field is set as TBD. The Capability Length
	    field is set as 4. The Capability Value field is set as the 4-octet Color Value
	    of the Color Extended Community, as
	    defined in <xref target="RFC9012" sectionFormat="of" section="4.3"/>. Note, even
	    though the BGP session is colored using a Color Extended Community, the only
	    field useful is the Color Value of the Color Extended Community. The Flags field
	    is ignored. That is why only the 4-octect Color Value is included in the SESSION-COLOR
	    Capability.
	    The SESSION-COLOR capability format is shown in <xref target="session_color"/>:
	  </t>

	  <figure anchor="session_color">
            <name>SESSION-COLOR Capability</name>
            <artwork type="ascii-art">
              <![CDATA[
   0                   1                   2                   3
   0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
  +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
  |Cap Code = TBD |Cap Length = 4 |
  +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
  |                        Color Value                            |
  +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
              ]]>
            </artwork>
	  </figure>

	  <t>
	    When receiving the OPEN message for an EBGP
	    session, the BGP speaker matches the SESSION-COLOR capability against its locally
	    configured session color. Session color is considered as a match for one of the
	    following conditions:
	  </t>

	  <dl newline="true">
	    <dt>
	      No color on both ends:
	    </dt>
	    <dd>
	      The receive OPEN message has no SESSION-COLOR capability and the EBGP session is
	      not configured with a color.
	    </dd>
	    <dt>
	      Same color on both ends:
	    </dt>
	    <dd>
	      The received OPEN message has SESSION-COLOR capability and its color is the
	      same as the session color configured locally for the EBGP session.
	    </dd>
	  </dl>

	  <t>
	    All other cases MUST be considered as session color mismatch.
	    When a session color mismatch is detected, the BGP speaker MUST reject the session
	    by sending a Color Mismatch Notification (code 2, subcode TBD) to the peer BGP
	    speaker.
	  </t>
        </section>
	
	<section>
	  <name>Loose Mode</name>

	  <t>
	    The strict session coloring mode ensures that an Established EBGP session must have
	    matching session colors on both ends. It helps to detects the color
	    misconfigurations earlier. However, exchanging session colors through a Capability
	    in BGP OPEN message requires BGP session flaps whenever session colors are changed.
	    To address this session flap issue, the loose session coloring mode is introduced.
	    When running the loose session coloring mode, session colors are not carried in the
	    BGP OPEN message therefore change of the session color will not lead to the session
	    flap. In this case, if the colors configured on both ends of the EBGP session mismatch,
	    the routes received over the session will only match the color of the remote end but
	    mismatch the color of the local end, as described in <xref target="route-coloring"/>.
	    A route received with mismatched color MUST NOT be accepted.
	  </t>

	  <t>
	    <xref target="I-D.ietf-idr-dynamic-cap"/> allows Capabilities to be exchanged
	    without flapping the session. That might allow us to gradually phase out the Loose
	    Mode once dynamic capability is widely deployed.
	  </t>
        </section>
      </section>

      <section anchor="route-coloring">
	<name>Route Coloring</name>

        <t>
	  Once the EBGP sessions are colored accordingly, the physical fabric is
	  partitioned into multiple logical fabrics. Routes can also be colored at the
	  egress leaves to indicate which EBGP sessions (or which logical fabrics)
	  they should be advertised over.
        </t>

	<section>
	  <name>Route Coloring at the Egress Leaf</name>

	  <t>
	    There are several ways to color a route at an egress leaf:
	  </t>

	  <dl newline="true">
            <dt>
	      One color: 
	    </dt>
	    <dd>
	      When a route is configured with one color at the egress leaf, it is advertised
	      over the same colored or uncolored EBGP sessions, with the
	      corresponding Color Extended Community attached. This is the easiest way
	      to make use of the logical fabrics.
	    </dd>
            <dt>
	      One primary color and one backup color:
	    </dt>
	    <dd>
	      When a route is configured with one primary color and one backup color at the
	      egress leaf, it is
	      advertised over the EBGP sessions of the primary color, with the primary
	      Color Extended Community and an AIGP metric  <xref target="RFC7311"/>
	      of value zero. It is also advertised
	      over the EBGP sessions of the backup color, with the backup
	      Color Extended Community. In case there are uncolored sessions, the route
	      is also advertised over the uncolored sessions, without Color Extended
	      Community. The AIGP metric will help the receiving node to
	      identify the primary colored paths. This allows traffic to fall back to the backup
	      logical fabric when the primary logical fabric fails.
	    </dd>
            <dt>
	      One primary color and all-colors as backup colors:
	    </dt>
	    <dd>
	      When a route is configured with one primary color and all-colors as backup colors
	      at the egress leaf,
	      it is advertised over the EBGP sessions of the primary color, with the primary
	      Color Extended Community and an AIGP metric of value 0. It is also advertised
	      over the EBGP sessions of all other colors, with the Color Extended Community
	      same as the corresponding session color. In case there are uncolored sessions,
	      the route is also advertised over the uncolored sessions, without Color Extended
	      Community. The AIGP metric will help the receiving nodes to
	      identify the primary colored paths. By specifying all-colors as backup colors,
	      traffic can be spread over all remaining logical fabrics when the primary fabric
	      fails. In the single backup color approach, traffic from the failed primary logical
	      fabric might congest the backup fabric. By spreading the failed primary logical
	      fabric traffic to all backup logical fabrics, the chance of congestion on the backup
	      logical fabrics will be significantly reduced.
	    </dd>
            <dt>
	      All-colors:
	    </dt>
	    <dd>
	      When a route is configured with all-colors at the egress leaf,
	      it is advertised over the EBGP sessions
	      with any color, with the Color Extended Community same as the corresponding
	      session color. In case there are uncolored sessions,
	      the route is also advertised over the uncolored sessions, without Color Extended
	      Community. This allows the ingress router to map different flows of the route to
	      different logical fabrics.
	    </dd>
            <dt>
	      No color:
	    </dt>
	    <dd>
	      An uncolored route from the egress leaf can be advertised over
	      EBGP sessions with any color or no color.
	      It is advertised without Color Extended Community. Uncolored routes could be useful
	      to carry routing protocol PDUs which do not use much bandwidth but needs to be
	      sent over any links regardless of the logical fabrics.
	    </dd>
	  </dl>

	  <t>
	    Since AIGP metric is used in the primary/backup color cases, it is expected
	    that all BGP speakers MUST support AIGP if we need DPF primary/backup protection.
	  </t>
	</section>

	<section>
	  <name>Color Matching at the Spine and Super Spine</name>

	  <t>
	    At the transit nodes (Spines or Super Spines), the Color Extended Community of the
	    route is used to match against the EBGP session color to decide whether the route
	    should be advertised over the session:
	  </t>

	  <dl>
	    <dt>
	      Advertising over an uncolored EBGP session:
	    </dt>
	    <dd>
	      If the session is uncolored, the route is re-advertised following the existing
	      route advertisement rules defined in <xref target="RFC4271"/>.
	    </dd>
	    <dt>
	      Advertising over a colored BGP session:
	    </dt>
	    <dd>
	      If the active route has no Color Extended Community or a Color Extended Community
	      which is the same as the session color, then the active route is advertised over the
	      session. If the active route has a Color Extended Community mismatching the session
	      color, then check whether there is an inactive route with a Color Extended Community
	      matching the session color. If yes, advertise the active route to the session,
	      except that the AIGP attributed (if any) MUST be
	      stripped and the Extended Color Community MUST be replaced with the session's
	      Color Extended Community. Otherwise, don't advertise the route. Matching the session color
	      against the inactive routes is necessary because a backup route needs to be
	      re-advertised to the backup fabric. So, when a packet arrives from the backup fabric,
	      it is forwarded over the primary fabric to the destination, unless the primary fabric
	      is down.
	    </dd>
	  </dl>
	</section>

	<section>
	  <name>Flow Mapping at the Ingress Leaf</name>

	  <t>
	    At the ingress leaf, flows can be mapped to different logical fabrics based on
	    the route coloring approaches from the egress leaf:
	  </t>

	  <dl>
	    <dt>
	      One color:
	    </dt>
	    <dd>
	      When a route is configured with one color at the egress leaf, the ingress leaf
	      will receive the route from the EBGP session(s) with that color only. Flows
	      towards this destination will be mapped to the logical fabric of this color
	      only.
	    </dd>
	    <dt>
	      One primary color and one backup color:
	    </dt>
	    <dd>
	      When a route is configured with one color as primary color and one
	      color as backup color at the egress leaf, the ingress leaf will receive
	      the route from EBGP sessions of both the primary color and the backup
	      color. The routes received from the primary color sessions will be
	      preferred due to AIGP. The routes received from the backup color sessions
	      can be used as the backup paths. Flows towards this destination will be
	      mapped to the primary logical fabric. In case the primary logical fabric
	      fails, flows towards this destination will be mapped to the backup
	      logical fabrics. Note that fallback to the backup logical fabric could happen at
	      the ingress leaf as well as the spines and super spines.
	    </dd>
	    <dt>
	      One primary color and all-colors as backup color:
	    </dt>
	    <dd>
	      When a route is configured with one color as primary color and all-colors
	      as backup color at the egress leaf, the ingress leaf will receive the
	      route from EBGP sessions of all colors. The routes received from the
	      primary color sessions will be preferred due to AIGP. The routed received
	      from all other colored sessions can be used as backup paths.
	      Flows towards this destination will be
	      mapped to the primary logical fabric. In case the primary logical fabric
	      fails, flows towards this destination will be mapped to all backup
	      logical fabrics. Note fallback to backup logical fabrics could happen at
	      the ingress leaf as well as the spines and super spines.
	    </dd>
	    <dt>
	      All colors:
	    </dt>
	    <dd>
	      When a route is configured with all-colors at the egress leaf, the ingress
	      leaf will receive the route from EBGP session of all colors. The routes
	      from all sessions can be used to forward traffic. The ingress leaf can
	      map flows towards this destination to routes with different Color Extended
	      Communities, using mechanisms such as the Access Control List (ACL) filter.
	      The details of mapping different flows to different routes of the same
	      destination is out of the scope of this document.
	    </dd>
	  </dl>

	  <t>
	    Apart from mapping IP flows as described above, the ingress leaf could also
	    map VPN flows, such as EVPN-VXLAN flows, to different logical fabrics. For
	    example, the egress leaf can advertise multiple VXLAN tunnel endpoint routes,
	    each with its own color. When a VXLAN tunnel endpoint is chosen for a MAC
	    VRF at the ingress leaf, flows of that MAC VRF will be mapped to the
	    logical fabric corresponding to the color of the tunnel endpoint route.
	  </t>
	</section>
      </section>
    </section>
    
    <section>
      <name>Use Cases</name>

      <t>
        The most common use cases related to the BGP-DPF are:
      </t>
      <ul spacing="normal">
        <li>AI/ML backend training DC networks</li>
        <li>AI/ML frontend DC Inference networks</li>
        <li>IP Storage networks</li>
        <li>DCI - Data Center Interconnect</li>
        <li>Industrial hybrid DC/Campus networks</li>
      </ul>
      <section anchor="backend-network">
        <name>AI/ML backend training Data Center network</name>

        <t>
          In the context of the AI/ML data centers (DC), especially where the training
          of LLM (Large Language Models)
          is the primary goal, there might be some challenges with the traditional IP ECMP
	  packet spraying, such as sending the packets in an unordered manner due to the way load
	  balancing is performed or maintaining consistency of performance between different phases 
	  of the job executions. AI/ML training in a data center refers to the process of utilizing 
	  large-scale computing infrastructure to train machine learning models on massive datasets. 
	  This process can take weeks or sometimes months for larger models. 
	  LLM training is taking place in DCs with GPU-enabled servers interconnected in 
	  the Rail Optimized Design within the IP Clos scale-out fabrics. 
	  In such architectures, every GPU of the server is linked to a 400G/800G NIC card, which 
	  connects to a different ToR (Top of Rack) leaf Ethernet switch node. The typical AI 
	  training server uses eight GPUs, so each server requires eight NIC cards, each connecting 
	  to a different ToR. A typical Rail is based on eight 400G/800G/1.6Tbps switches, and 
	  rail-to-rail communication between strips is achieved through multiple spine nodes (typically 32 or 
	  more).
	</t>

        <t>
          The transport used by the GPU servers between the rails or within the rail is 
          either based on ROCEv2, or UEC transport (UET) in the future. The number of these flows per 
          GPU/NIC is sometimes limited. A single ROCEv2 flow can utilize a massive bandwidth, and 
          the characteristics of the flows may have very low entropy - the same source UDP and 
          destination UDP are used by the ROCEv2 transport between the GPU servers during the given 
          Job-ID. This may lead to short-term congestion at the spines, triggering the DCQCN 
          reactive congestion control in the AI/DC fabric, with the PFC (Priority Flow Control) and 
          ECN (Explicit Congestion Notification) mechanisms activated to prevent frame loss. 
          Consequently, these mechanisms slow down the AI/ML session by temporarily reducing the 
          rate at the source GPU server and extending the time needed to complete the given Job-ID. 
          If congestion persists, frame loss may also occur, and the given Job-ID may need to be 
          restarted to be synced across all GPUs participating in the collective communication.  
          With packet spraying techniques or flow-based Dynamic Load Balancing, this is a less common 
          situation in a well-designed Ethernet/IP fabric, but the GPU servers NIC cards must support
          the Out Of Order delivery. Additionally, it may still reduce performance or cause instability
          between Job-IDs or between tenants connected to the same AI/DC fabric.
        </t>

	<t>
	  This is where deterministic path pinning-based load balancing of flows can be 
	  applied, and where the BGP-DPF can be utilized to color the paths of a given tenant or a 
	  specific AI/ML workload, controlling how these paths are used. When the given ROCEv2 
	  traffic is identified through the destination QPAIR in the BTH header at the ToR Ethernet 
	  switch, it can be allocated to a specific DPF color ID using ingress enforcement rules or 
	  TCAM flow awareness at the ASIC level. The AI/ML flows can be load-balanced across 
	  different DPF fabric color IDs and remain on the specified fabric color for the duration 
	  of the AI/ML Job. Thanks to that, not only does the given AI workload get a dedicated 
	  fabric color ID, but it also becomes isolated from the other AI workloads, which offers 
	  more predictable performance results (consistent tail latency and same Job Completion Time
	  (JCT)) when compared to packet spraying based load balancing across all of the IP ECMP paths.
	</t>

	<t>
	  In this case, the probability of encountering congestion is also lower, as the given 
	  workload is assigned a dedicated path and is not competing with other AI workloads. 
	  When pinning the AI workload to a specific path, this means that there will be no packet 
	  reordering at the destination/target server, as the ROCEv2/UET packets will follow the 
	  same path from the beginning to the end of the given session.
	</t>

	<t>
	  The Rail Optimized Design shown in <xref target="backend-fabric"/> may also run two LLM training sessions
	  simultaneously from two different tenants. This is also where IP path diversity of the 
	  DPF comes into play - by simply coloring the two workloads from the two LLMs, we can 
	  forward them across a different set of spine switches.
	</t>

        <figure anchor="backend-fabric">
          <name>AI/ML backend training Data Center network</name>
          <artwork type="ascii-art">
            <![CDATA[
                  +-----------+
             +----|GPU-server1|---+
             |    +-----------+   |
             |       |            |
             |       |            |
             |       |            |
       +-----+-------+------------+-----rail1
       |  +--+---+ +-+----+      ++-----+  |
       |  +leaf1-+ +leaf2-+ .... +leaf8-+  |
       +---+----+-------------+--------+---+
 +---------+    |             |        +-----------+
 |              |             |                    |
 |              |             |                    |
 |              |             |                    |
 |              |             |                    |
Fab-A          Fab-A         Fab-B                Fab-B
 |              |             |                    |
 |              |             |                    |
 |              |             |                    |
++--------+   +-+-------+   +-+-------+   +--------++
|spine1   |...|spine16  |   |spine17  |...|spine32  |
++--------+   +-+-------+   +-+-------+   +--------++
 |              |             |                    |
 |              |             |                    |
 |              |             |                    |
Fab-A          Fab-A         Fab-B               Fab-B
 |              |             |                    |
 |              |             |                    |
 |              |             |                    |
 |              |             |                    |
 +-------+      |             |         +----------+
        ++------+-------------+---------+---+
        |  +------+ +------+      +------+  |
        |  +leaf9-+ +leaf10+ .... +leaf16+  |
        +---+---------+--------------+---rail2
            |         |              |
            |         |              |
            |         |              |
            |    +----+------+       |
            +----|GPU-server2+-------+
                 +-----------+
            ]]>
          </artwork>
        </figure>                        

	<t>
	  For example, 16 spines are allocated to the LLM-A training, and the other 16 spines are 
	  mapped to the LLM-B. Within each group of colored spines, IP ECMP with Dynamic Load 
	  Balancing can still operate on a per-flow or per-packet basis. Each tenant LLM with this 
	  approach receives half of the fabric's capacity of the fabric, and if required, this can 
	  be adjusted to be reduced or increased. The given fabric color fab-A and fab-B can be also
	  allocated to the tenants enabled with EVPN-VXLAN overlays.
	</t>

	<t>
	  In summary, using BGP-DPF in backend DC network could achieve:
	</t>
	<ul spacing="normal">
	  <li>
	    Predictable and more efficient load balancing of the AI/ML workloads with the path 
	    pinning (for example, the  ROCEv2 Op Code-based pinning or the destination ROCEv2 
	    QPAIR-based path pinning in case of the ROCEv2 traffic)
	  </li>
	  <li>
	    Isolation of the tenants inside the larger-scale AI/ML IP Clos fabric
	  </li>
	  <li>
	    Consistency of the performances and faster AI workload ramp time
	  </li>
	  <li>
	    Eliminated or highly reduced utilization of the PFC/ECN in the lossless fabric
	  </li>
	</ul>
      </section> 
      
      <section>
        <name>AI/ML frontend DC and the Inference network</name>

        <t> 
          In the context of an AI/ML data center, an inference network refers to the computing 
	  infrastructure and networking components optimized for running already trained machine 
	  learning models (inference) at scale. Its primary purpose is to deliver low-latency, 
	  high-throughput predictions for both real-time and batch workloads. ChatGPT is a
	  large-scale inference application deployed in a data center environment that utilizes 
	  real-time data. Still, it employs a generative AI model, such as GPT, which has been 
	  trained for several weeks in the training domain, as explained in <xref target="backend-network"/> above.
	</t>

	<t>
	  The reason we mention it is that in many cases, cloud or service providers will run 
	  inferences in parallel for multiple customers simultaneously. Multi-tenancy is likely to 
	  be used at the network level - for example, utilizing EVPN-VXLAN-based tenant isolation 
	  in the leaf/spine/super-spine IP Clos fabric, or using MAC-VRFs or Pure RT5 IPVPN. 
	  In such cases, many inference applications can be enabled simultaneously within the same 
	  physical fabric. 
	  In some cases, the tenant/customer may request to be fully isolated from the other 
	  tenants, not only from a control plane perspective but also from a data plane perspective 
	  when forwarding traffic between the two ToR switches.
	</t>

	<t>
	  For example, the tenant-A and 
	  tenant-B may each be allocated to a different RT5 EVPN-VXLAN instance, and these 
	  instances are mapped to two different BGP-DPF color-ids. With this approach, the overlays 
	  of tenant A and tenant B will never overlap and will utilize different fabric spines. 
	  The outcomes here are that the latency, which is critical for inference applications, is 
	  also becoming more predictable if the fabric paths for the two tenants are different. 
	  The two overlays are more correlated with the underlay path. In some cases, with the 
	  explicit definition of the backup color ID at the BGP-DPF level, the fast convergence 
	  will become an additional outcome for the frontend EVPN-VXLAN fabrics. 
        </t>
      </section>
      
      <section>
        <name>IP Storage networks with Fab-A/Fab-B path diversity</name>

	<t> 
          In the context of the DC, storage networks are a key component of the infrastructure 
	  that manages and enables servers with scalable block or object storage systems. For block 
	  storage, such as NVMe-o-F (using NVMe-o-RDMA or NVMe-o-TCP), the Fab-A/Fab-B design is 
	  often used, where Fabric-A serves as the primary and Fabric-B as the backup path for 
	  performing read or write operations on the remote storage arrays. The given server inside 
	  the DC typically has dedicated storage NICs. For redundancy purposes, two NIC 
	  ports are generally used - one connected to Fab-A and another to Fab-B. As in the case of 
	  traditional storage, such as Fiber Channel(FC), the recommended approach is to make sure 
	  that the storage dedicated fabric supports complete path isolation. In case of failure, 
	  at least one of the two fabrics becomes available.
	</t>

	<t>
	  This is also where BGP DPF can 
	  help, by explicitly defining the IP Storage paths for Fab-A and Fab-B. Besides the storage 
	  redundancy aspect, the capacity planning is also essential here. After the failover from 
	  A to B, the same read and write capacity is offered to all IP Fabric-connected 
	  servers. Fab A/B offers 100% capacity in the event of failure, while all operations are 
	  managed at the logical level using the BGP DPF.
        </t>
      </section>
      
      <section>
        <name>DCI - Data Center Interconnect</name>

	<t> 
          In case of critical applications, disaster recovery plans usually require a second
	  availability zone for redundancy and resilience. Concepts foresee either the replication
	  of persistent storage data, or to run the same application in parallel in a backup
	  location, or to load balance across multiple DCs.
	</t>

	<t>
	  When replicating data or 
	  synchronizing the application state between two places, it is sometimes also necessary to 
	  isolate the paths across long-distance connectivity. If the connection between DC1 and DC2
	  use a mesh of links or partial mesh and the DCI connect solution uses EVPN-VXLAN or
	  Pure IP connections, some workloads may require communication in a more deterministic way by 
	  correlating the underlay and overlay when both uses the BGP as IP routing protocol - one 
	  path may have better latency and jitter than the other when connecting between the two 
	  remote locations so the admin may decide to push one EVPN-VXLAN instance (MAC-VRF and/or 
	  RT5 IPVPN) through very well selected underlay path of the dark fiber connection. In this use 
	  case, we assume the DCI is using the underlay IP EBGP, and some links may be colored using 
	  the BGP-DPF. EVPN-VXLAN can use the EVPN-VXLAN to EVPN-VXLAN tunnel stitching <xref target="RFC7938"/>, 
	  with the DCI underlay links colored by BGP-DPF as red and blue paths. 
	  Different MAC-VRFs and RT5 instances are assigned to various DPF colors to control the 
	  forwarding of the workloads between the two DC locations.
	</t>

	<t>
	  The outcome of this use case is 
	  that the DCI admin can anticipate failovers and allocate EVPN-VXLAN-connected workloads
	  based on the capacity and performance (including latency and jitter) of the DCI links.
	</t>
      </section>
      
      <section> 
	<name>Industrial/factory hybrid DC/Campus networks</name>

	<t>
	  Industrial and factory automation is increasingly adopting distributed computing
	  concepts to leverage the benefits of virtualization and containerization. This change
	  often comes with a shift of application into a remote DC, which imposes stringent
	  requirements on the networking infrastructure between DC and the respective process.
	  These hybrid DC campus networks require a high level of resiliency against failures
	  as certain applications tolerate zero loss of frames. Duplication schemes like
	  PRP <xref target="IEC62439-3"/> are being leveraged in these scenarios to provide
	  zero loss in face of failures but require disjoint paths to avoid any single point of failures.
	</t>

	<t>
	  When the Campus and DC fabrics utilize modern solutions, such as 
	  EVPN-VXLAN overlays, IP ECMP from leaf to spine is frequently employed.
	  This might lead to PRP duplicates being forwarded across the same spine
	  and bring processes to a standstill in case of a spine maintenance or physical 
	  failure. That's where the BGP-DPF based underlay network can guarantee that the EVPN-VXLAN 
	  overlays are always forwarded over their predefined nominal and backup paths,
	  allowing for disjoint paths across the fabric. The
	  primary and backup paths taken by PRP frames are well-defined, enabling fault-tolerant
	  communication, i.e., between robots on the shop floor and control applications running
	  on a distributed environment in the DC. With the PRP frames destined for LAN A and 
	  LAN B being sent through EVPN-VXLAN MAC-VRF-A and MAC-VRF-B, over diverse paths DPF color-A and 
	  DPF color-B, critical communication flows are being controlled in terms of forwarding and
	  recovery for the deterministic behavior they require.
	</t>
      </section>
    </section>

    <section>
      <name>Operational Considerations</name>

      <t>
	When routes are colored with both primary and backup colors at the egress leaf,
	we need to make sure the network is a strictly staged network to avoid potential
	routing	and forwarding loops. A strictly staged
	network ensures that packet always goes to the next stage and never come back.
	In the Clos topology with EBGP, staged routing is guaranteed by configuring the
	same AS number on the spines and super spines in the same stage. Only leaves have
	unique AS numbers.
      </t>
    </section>
    
    <section anchor="IANA">
      <name>IANA Considerations</name>
      <t>
	A new BGP Capability will be requested from the "Capability Codes" registry
	within the "IETF Review" range <xref target="RFC5492"/>.
      </t>

      <t>
	A new OPEN Message Error subcode named "Color mismatch" will be requested
	from the "OPEN Message Error subcodes" registry.
      </t>
    </section>
    
    <section anchor="Security">
      <name>Security Considerations</name>
      <t>
        Modifying Color Extended Community of a BGP UPDATE message by an attacker
	could potentially cause the routes to be advertised to the unintended
	logical fabrics. This could potentially lead to failed or suboptimal
	routing.
      </t>
    </section>
  </middle>

  <back>
    <references>
      <name>References</name>
      <references>
        <name>Normative References</name>
        
        <xi:include href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.2119.xml"/>
        <xi:include href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.8174.xml"/>
        <xi:include href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.5492.xml"/>
        <xi:include href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.9012.xml"/>
        <xi:include href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.4271.xml"/>
        <xi:include href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.7311.xml"/>

      </references>
 
      <references>
        <name>Informative References</name>
       
        <xi:include href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.3209.xml"/>
        <xi:include href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.8402.xml"/>
        <xi:include href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.7938.xml"/>
	<xi:include href="https://bib.ietf.org/public/rfc/bibxml3/reference.I-D.draft-ietf-idr-dynamic-cap-17.xml"/>
	<reference anchor="IEC62439-3">
	  <front>
	    <title>
	      Industrial communication networks – High availability automation networks –
	      Part 3: Parallel Redundancy Protocol (PRP) and High-availability Seamless Redundancy (HSR)
	    </title>
	    <author>
	      <organization>International Electrotechnical Commission</organization>
	    </author>
	    <date year="2016"/>
	  </front>
	  <seriesInfo name="IEC" value="62439-3:2016"/>
	</reference>
      </references>
    </references>

    <section>
      <name>Alternative Solutions</name>
      <t>
	An alternative way to achieve part of the BGP DPF functionalities is to use
	BGP export and import policies. Instead of coloring the EBGP sessions and routes, one
	could choose to use export policies to specify which session(s) a route
	should be advertised. On the receiving side, one could also choose to use
	import policies to ensure a route is only received from certain EBGP sessions.
	The alternative approach is not chosen due to the following factors:
      </t>
	
      <ul>
        <li>
	  The policy configurations have to be done on each nodes and might need to
	  change when new routes are added.
	</li>
        <li>
	  Policy configurations are less intuitive than session coloring and could be
	  prone to configuration mistakes.
	</li>
        <li>
	  Certain functionalities in DPF, like the primary and backup logical fabrics,
	  might not be achievable using popular policies.
	</li>
       </ul>
    </section>

    <section anchor="Acknowledgements" numbered="false">
      <name>Acknowledgements</name>
      <t>
        TBD.
      </t>
    </section>
    
    <section anchor="Contributors" numbered="false">
      <name>Contributors</name>
      <contact initials="J." surname="Haas" fullname="Jeffrey Haas">
        <organization>HPE</organization>
        <address>
          <email>jeffrey.haas@hpe.com</email>
        </address>
      </contact>
    </section>
    
 </back>
</rfc>
