<?xml version="1.0" encoding="US-ASCII"?>
<!-- edited with XMLSPY v5 rel. 3 U (http://www.xmlspy.com)
     by Daniel M Kohn (private) -->
<!DOCTYPE rfc SYSTEM "rfc2629.dtd" [
<!ENTITY rfc2119 PUBLIC "" "http://xml.resource.org/public/rfc/bibxml/reference.RFC.2119.xml">
]>
<?rfc toc="yes"?>
<?rfc tocompact="yes"?>
<?rfc tocdepth="3"?>
<?rfc tocindent="yes"?>
<?rfc symrefs="yes"?>
<?rfc sortrefs="yes"?>
<?rfc comments="yes"?>
<?rfc inline="yes"?>
<?rfc compact="yes"?>
<?rfc subcompact="no"?>
<rfc category="std" docName="draft-yang-dmsc-distributed-model-00"
     ipr="trust200902">
  <front>
    <title abbrev="D">Distributed AI model architecture for microservices
    communication and computing power scheduling</title>

    <author fullname=" Hui Yang" initials="H" surname="Yang">
      <organization> Beijing University of Posts and
      Telecommunications</organization>

      <address>
        <postal>
          <street/>

          <city>Beijing</city>

          <region>Beijing</region>

          <code/>

          <country>China</country>
        </postal>

        <email>yanghui@bupt.edu.cn</email>
      </address>
    </author>

    <author fullname="Tiankuo Yu" initials="TK" surname="Yu">
      <organization>Beijing University of Posts and
      Telecommunications</organization>

      <address>
        <postal>
          <street/>

          <city>Beijing</city>

          <region>Beijing</region>

          <code/>

          <country>China</country>
        </postal>

        <email>yutiankuo@bupt.edu.cn</email>
      </address>
    </author>

    <date day="30" month="January" year="2025"/>

    <area>IETF Area</area>

    <workgroup>DSMC Working Group</workgroup>

    <keyword>distributed AI, service architecture</keyword>

    <abstract>
      <t>This document describes the distributed AI micromodel computing power
      scheduling service architecture.</t>
    </abstract>
  </front>

  <middle>
    <section anchor="intro" title="Introduction">
      <t>The Distributed AI Micromodel Computing Power Scheduling Service
      Architecture is a structured framework designed to address the
      challenges of scalability, flexibility, and efficiency in modern AI
      systems. By integrating model segmentation, micro-model deployment, and
      microservice orchestration, this architecture enables the effective
      allocation and management of computing resources across distributed
      environments. The primary focus lies in leveraging model segmentation to
      decompose large AI models into smaller, modular micro-models, which are
      executed collaboratively across distributed nodes.</t>

      <t>The architecture is organized into four tightly integrated layers,
      each with distinct roles and responsibilities that together ensure
      seamless functionality:</t>

      <t>Service Layer: This layer acts as the interface between the
      user-facing applications and the underlying system. It encapsulates AI
      capabilities as microservices, enabling modular deployment, elastic
      scaling, and independent version control. By routing user requests
      through service gateways, it ensures efficient interaction with back-end
      micro-models while balancing workloads. The service layer also
      facilitates collaboration between multiple micro-models, allowing them
      to function as part of a cohesive distributed system.</t>

      <t>Control Layer: The control layer is the central coordination hub,
      responsible for task scheduling, resource allocation, and the
      implementation of model segmentation strategies. It decomposes large AI
      models into smaller, manageable components, assigns tasks to specific
      nodes, and ensures synchronized execution across distributed
      environments. This layer dynamically balances compute and network
      resources while adapting to system demands, ensuring high efficiency for
      training and inference workflows.</t>

      <t>Computing Power Layer: As the execution core, this layer translates
      the decisions made by the control layer into distributed computation. It
      executes segmented micro-models on diverse hardware resources such as
      GPUs, CPUs, and accelerators, optimizing parallelism and fault
      tolerance. By coordinating with the control layer, it ensures that tasks
      are executed efficiently while leveraging distributed orchestration
      frameworks to handle diverse workloads.</t>

      <t>Data Layer: The data layer underpins the entire system by managing
      secure storage, access, and transmission of data. It provides the
      necessary datasets, intermediate results, and metadata required for
      executing segmented micro-models. Privacy protection mechanisms, such as
      federated learning and differential privacy, ensure data security and
      compliance, while distributed database operations guarantee consistent
      access and high availability across nodes.</t>

      <t>At the heart of this architecture is model segmentation, which serves
      as the foundation for effectively distributing computation and
      optimizing resource utilization. The control layer breaks down models
      into smaller micro-models using strategies such as layer-based,
      business-specific, or block-based segmentation. These micro-models are
      then deployed as independent services in the service layer, where they
      are dynamically scaled and orchestrated to meet real-time demands. The
      computing power layer executes these tasks using parallel processing
      techniques and advanced scheduling algorithms, while the data layer
      ensures secure and efficient data flow to support both training and
      inference tasks.</t>

      <t>By tightly integrating these layers, the architecture addresses
      critical challenges such as balancing compute and network resources,
      synchronizing distributed micro-models, and minimizing communication
      overhead. This cohesive design enables AI systems to achieve high
      performance, scalability, and flexibility across dynamic and
      resource-intensive workloads.</t>

      <t>This document outlines the design principles, key components, and
      operational advantages of the Distributed AI Micromodel Computing Power
      Scheduling Service Architecture, emphasizing how model segmentation,
      micro-models, and microservices form the foundation for scalable and
      efficient distributed AI systems.</t>

      <t/>
    </section>

    <section title="Conventions used in this document">
      <t>The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
      "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
      document are to be interpreted as described in .</t>
    </section>

    <section title="Terminology">
      <t>TBD</t>
    </section>

    <section title="Scenarios and requirements">
      <t>To provide a more structured and logical analysis, we can organize
      the insights into ICN&rsquo;s development trajectory by following a
      chronological order and emphasizing how different RFCs complement and
      enhance each other. This approach will allow us to see how ICN evolved
      and how various RFCs address specific aspects of its implementation,
      challenges, and optimization.</t>

      <section title="AI Microservice model scenario requirements">
        <t>At present, with the accelerated evolution of artificial
        intelligence technology, the scale and complexity of AI models
        continue to expand, and the traditional monolithic application or
        centralized reasoning and training mode is increasingly difficult to
        meet the rapidly changing business needs. Encapsulating AI
        capabilities as microservices can bring significant advantages in
        terms of system flexibility, scalability, and service governance. By
        decoupling models through microservices, a separate AI model service
        can avoid potential bottlenecks caused by deep coupling with the rest
        of the business logic, and can also scale elastic when requests or
        training load surges. For AI models, the iteration and upgrade speed
        is very fast. Microservice architecture makes it possible to coexist
        multiple versions of the model, grayscale release, and fast rollback,
        thereby reducing the impact on the overall system.</t>

        <t>AI microservice models are often extremely demanding on computing
        power. On the one hand, the training or inference process usually
        involves massive data processing and high-density parallel computing,
        which requires the collaborative work of various hardware resources
        such as GPU, CPU, FPGA, and NPU. On the other hand, if the model scale
        is large or the amount of requests is high, the computing power of a
        single machine is often not enough to meet the business needs. It is
        necessary to use the distributed mode to perform parallel computing on
        multiple nodes, and reasonably release resources during idle time to
        improve utilization. This type of distributed training or inference
        usually relies on efficient communication strategies to synchronize
        model parameters or gradients. Methods such as AllReduce or All-to-All
        are often used to reduce communication overhead and ensure model
        consistency.</t>

        <t>In a distributed environment, the role played by the network is
        crucial. A large number of model parameters and gradients need to be
        exchanged frequently during the calculation process, which puts
        forward high requirements for network bandwidth and delay. In the
        large-scale cluster scenario, the reasonable design of the network
        topology and the choice of the communication framework can not be
        ignored. Only in the high-bandwidth, low-latency network environment,
        combined with the appropriate communication library (such as NCCL,
        MPI, etc.), can the cluster fully exploit the potential of computing
        power and avoid communication becoming the bottleneck of global
        performance.</t>

        <t/>
      </section>

      <section title="Distributed Micro model Service Flow">
        <t>In the distributed AI micro-model computing power scheduling
        service architecture, the core of the business process is how to
        realize the multi-node layout and collaborative work of the model to
        ensure efficient parameter synchronization and communication.
        Typically, a model is trained and evaluated by a data scientist or
        algorithm engineer using a deep learning framework during development,
        and then container-ized or mirrored to package the model and its
        dependencies into a service that can be deployed independently. Then,
        these encapsulated model services are registered to the system's
        microservice management platform for subsequent unified scheduling and
        access. As AI models evolve rapidly, version management and grayscale
        releases are the norm. Small validation or quick rollback of new
        versions while keeping old ones online can minimize risk and ensure
        user experience.</t>

        <t>When the model is deployed to a distributed cluster, computing
        power orchestration and resource scheduling allocate computing
        resources such as Gpus or cpus according to real-time load, business
        priority and hardware topology, and use container orchestration tools
        (such as Kubernetes) to start corresponding service instances on each
        node. When distributed cooperation is needed, NCCL, Horovod and other
        frameworks are used to complete inter-process communication. For
        inference scenarios, requests from upper business systems or users
        usually arrive at the API Gateway or service gateway first, and then
        are distributed to the target service instance according to load
        balancing or other routing policies. If distributed reasoning is
        needed, multiple nodes cooperate to perform model segmentation
        reasoning and summarize the results, and finally return the reasoning
        results to the requester. For the training scenario, when a
        distributed training task is triggered, the scheduler allocates
        several worker nodes for the training job to complete data loading,
        forward and backward propagation, and AllReduce or All-to-All
        communication operations together to realize the synchronous update of
        model parameters. After the training is complete, the new version of
        the model is saved to the model repository or a corresponding storage
        medium, triggering the subsequent model release process.</t>

        <t>In this process, real-time monitoring and elastic scaling mechanism
        play an important role in ensuring system stability and optimizing
        resource utilization. On the monitoring level, through a unified data
        acquisition and analysis platform, the system can track core
        indicators such as GPU utilization, network traffic, and request
        latency of each service node, so as to provide timely alarms in case
        of failures, performance bottlenecks, or insufficient resources, and
        perform automatic failover or node offline processing. In terms of
        elastic scaling, according to the preset resource utilization
        threshold or response time target, the system will dynamically
        increase the scale of model service instances or nodes when the number
        of requests surges or the training scale expands, otherwise it will
        reclaim part of the idle resources to ensure efficient cooperation
        between global computing and storage.</t>

        <t>In addition, the distributed micromodel business flow needs to be
        combined with the data backflow mechanism. A large number of logs,
        user feedback and interaction information generated in the inference
        process can be further used for the training of new models or the
        performance optimization of existing models if they can be returned to
        the data platform under the premise of meeting privacy and compliance
        requirements.</t>
      </section>
    </section>

    <section title="Key issues and challenges">
      <section title=" Balancing Compute and Network Resources under Constraints">
        <t>As AI models grow in scale and business demands intensify,
        single-node or single-cluster architectures often struggle to support
        high-intensity training and inference tasks. This leads to limitations
        in computational power or significant cost surges. Distributed
        training has emerged as a necessary approach to address these
        challenges by enabling the coordination of computing resources across
        multiple nodes and regions, thereby improving overall efficiency and
        fault tolerance. However, distributed deployment also introduces
        considerable complexity, such as handling heterogeneous hardware
        differences (e.g., GPU, CPU, FPGA) and balancing resource allocation
        across diverse network topologies and bandwidth conditions.</t>

        <t>One of the key difficulties lies in managing resource scarcity
        effectively. Dynamic scheduling and allocation must account for
        factors such as business priority, model scale, and real-time workload
        conditions. Strategies such as priority-based queuing, elastic
        scaling, and cross-cluster resource collaboration are crucial to
        maximizing service efficiency under these constraints. However,
        implementing these strategies often depends on sophisticated
        partitioning and parallelism approaches.</t>

        <t>In distributed training, model partitioning and parallelism play a
        pivotal role. By employing techniques like tensor slicing or computing
        power pipelining, models can be decomposed and distributed across
        multiple nodes, with each node handling specific submodules or slices.
        This approach is particularly effective in training scenarios, where
        workload distribution ensures that no single server becomes a
        bottleneck. Similarly, in inference scenarios, input data can flow
        through a sequence of model microservices in a pipelined processing
        framework, which helps to maximize the utilization of scattered
        computing resources.</t>

        <t>Despite the potential benefits, these strategies are not without
        challenges. Distributed training inherently requires efficient
        synchronization and communication between nodes, which are further
        constrained by network resource availability. Moreover, achieving
        balance between computational and network resources demands meticulous
        planning and real-time adaptation. While model partitioning and
        parallel execution help alleviate pressure on individual servers and
        utilize idle nodes more effectively, they also add layers of
        complexity to resource coordination in distributed environments.</t>

        <t/>
      </section>

      <section title=" Data Collaboration Challenges under Block Isolation">
        <t>In distributed systems, large-scale data is often divided into
        multiple blocks that are stored and processed separately. While this
        improves data security and processing efficiency, it introduces
        significant challenges for data collaboration. When multiple nodes or
        microservice modules need to share or exchange data, strict
        coordination is required, including defining interfaces and call
        sequences in advance and managing consistency and concurrency control.
        The complexity increases further when cross-node dependencies exist
        between data blocks, making the scheduling, loading, and distribution
        of data one of the primary bottlenecks for system scalability and
        computational efficiency.</t>

        <t>A key difficulty lies in synchronizing data across distributed
        nodes while minimizing latency and avoiding bottlenecks. Cross-node
        dependencies require precise scheduling to ensure data arrives at the
        correct location and time without conflicts. As the scale of data and
        the number of nodes grow, the management overhead for maintaining
        these dependencies can increase exponentially, particularly when
        network bandwidth or latency constraints exacerbate delays.
        Additionally, ensuring data consistency across multiple data blocks
        during concurrent access or updates adds another layer of complexity.
        High levels of concurrency can increase the risk of inconsistencies,
        data races, and synchronization issues, demanding advanced mechanisms
        to enforce data integrity.</t>

        <t>Traditional distributed communication strategies, such as AllReduce
        and All-to-All, are widely used and remain effective in addressing
        certain data collaboration needs in training and inference tasks. For
        example, AllReduce is well-suited for data parallel scenarios, where
        all nodes compute on the same model with different data splits, and
        gradients or weights are synchronized via aggregation and broadcast.
        Similarly, All-to-All is valuable in more complex distributed tasks
        that require frequent intermediate data exchanges across nodes.
        However, these methods are not without limitations. As data and system
        complexity grow, they can lead to increased communication overhead,
        especially in scenarios where synchronization is uneven or poorly
        timed.</t>

        <t>The effectiveness of traditional methods depends on careful tuning
        and precise execution. Poorly timed data exchanges can result in
        prolonged waiting times, underutilized resources, or even data
        mismatches. Although methods like AllReduce and All-to-All offer
        reliable frameworks for communication, their scalability and
        efficiency are often constrained by the challenges of cross-node
        synchronization, network variability, and system heterogeneity. These
        limitations highlight the need for continuous refinement and
        innovation in distributed communication and data collaboration
        strategies to overcome the challenges posed by block isolation.</t>
      </section>
    </section>

    <section title="Distributed solution based on model segmentation">
      <t>4.1. Service layer</t>

      <t>The service layer serves as the central hub of a distributed AI
      system, connecting the front-end, business logic, and microservices to
      enable efficient interaction and seamless workflows. It hosts the core
      service logic, processes user and business requests, and coordinates the
      collaboration of multiple components.</t>

      <t>At the front-end layer, the service layer interacts with user-facing
      interfaces, which handle tasks such as user authentication, data input,
      and result presentation. These interfaces act as the entry point for
      system requests, routing them through APIs provided by a service
      gateway. The gateway manages external request routing, authentication,
      protocol translation, and load balancing to ensure smooth and efficient
      communication between the user interface and the back-end services.</t>
    </section>

    <section anchor="iana" title="IANA Considerations">
      <t>TBD</t>
    </section>

    <section title="Acknowledgement">
      <t>TBD</t>
    </section>
  </middle>

  <back/>
</rfc>
