1 - Ant Group's Large-Scale Practice in Cost Reduction and Efficiency Improvement with Serverless

Koupleless’s practice in Ant Group for large-scale Serverless deployment to reduce costs and improve efficiency

Authors: Liu Yu, Zhao Zhenling, Liu Jing, Dai Wei, Sun Ren’en, etc.

Pain Points in Ant Group’s Business Operations

Over the past 20 years, Ant Group has experienced rapid evolution in microservices architecture, alongside explosive growth in the number and complexity of applications, leading to significant cost and efficiency issues:

  1. A large number of long-tail applications have a CPU utilization rate of less than 10%, yet consume substantial resources due to multi-regional high availability requirements.
  2. Slow build and deployment speeds for applications, averaging 10 minutes, resulting in low development efficiency and lack of rapid scalability.
  3. Collaborative development of applications forces features to be bundled together in a “catching the train” manner, causing iteration blocks and low efficiency in collaboration and delivery.
  4. Upgrades to business SDKs and some frameworks cause significant disturbances to operations, preventing infrastructure from being minimally or unobtrusive to business operations.
  5. Difficulties in capitalizing on business assets, leading to high costs in building large and medium platforms.

Use Cases of Koupleless in Ant Group

Consolidated Deployment for Cost Reduction

In enterprises, a common observation is that “80%” of long-tail applications only cater to “20%” of the traffic, and Ant Group is no exception to this trend.
Within Ant Group, a significant number of long-tail applications exist, each requiring at least three environments: pre-release, gray release, and production. For each environment, a minimum deployment across three data centers is necessary, with each data center maintaining at least two machines for high availability. As a result, many of these long-tail applications have a CPU utilization rate of “less than 10%”.
By leveraging Koupleless, Ant Group streamlined its server infrastructure for long-tail applications, utilizing class delegation isolation, resource monitoring, and log monitoring technologies. This approach enabled the consolidated deployment of multiple applications, significantly reducing operational and resource costs while ensuring stability.
合并部署。裁剪机器。
This approach allows small applications to bypass the traditional application launch and server request processes. They can be directly deployed on a common business platform, enabling rapid innovation for low-traffic services.

Modular Development for Ultimate Efficiency Improvement

Within Ant Group, many departments have applications with a large number of developers. Due to the high headcount, there have been significant issues with environment contention, integration testing contention, and test resource contention, leading to mutual blockages where a delay by one person causes delays for many, resulting in inefficient requirement delivery.
By using Koupleless, Ant Group has gradually refactored applications with a large number of collaborators into foundational code and modules for different functionalities. The foundational code consolidates various SDKs and common business interfaces, maintained by dedicated personnel, while module code encapsulates specific business logic unique to a functional domain, capable of calling local foundational interfaces. Modules employ hot deployment to achieve ten-second level builds, releases, and scaling, while module developers do not have to worry at all about servers and infrastructure, thus enabling ordinary applications to achieve a Serverless development experience with very low access costs.
Taking the finance business of Ant Group as an example, by decomposing applications into a foundation and multiple modules, it has achieved significant efficiency improvements in release operations, organizational collaboration, and cluster traffic isolation across multiple dimensions.
模块化研发提速。模块化研发提效。

The Evolution and Practice of Ant Group’s Financial Business Koupleless Architecture, see details: https://mp.weixin.qq.com/s/uN0SyzkW_elYIwi03gis-Q

General Base to Shield Infrastructure

Within Ant Group, frequent SDK upgrades and slow build/release processes have been pain points. Leveraging the Koupleless universal foundation mode, Ant Group has enabled some applications to achieve micro-sensory upgrades for infrastructure. Concurrently, the build and release speed of applications has been reduced from 600 seconds to 90 seconds.

屏蔽基础设施

In the Koupleless universal base model, the base is pre-started and includes common middleware, second-party, and third-party SDKs. Using the Koupleless build plugin, business applications are built into FatJar packages. For new version releases, the scheduler deploys the FatJar to an empty base without modules, and servers with old modules are asynchronously replaced with new ones (empty bases).
A dedicated team maintains and upgrades the base, offering developers seamless infrastructure upgrades and a fast build and release experience.

Cost-effective and Efficient Central Platforms

Within Ant Group, there are numerous middleware services, typical examples include various business lines’ strategies, marketing, charity, search recommendations, and advertising. By utilizing Koupleless, these middleware services have gradually evolved into a foundation + module delivery method. In this architecture, the foundation code consolidates common logic and defines several Service Provider Interfaces (SPIs), while modules are responsible for implementing these SPIs. Traffic enters through the foundation code and calls the module’s SPI implementation.
In the context of middleware, modules are generally very lightweight, sometimes merely a snippet of code. Most modules can be deployed and scaled up within 5 seconds, and module developers do not need to concern themselves with the underlying infrastructure, enjoying an ultimate Serverless development experience.
Taking Ant Group’s search and recommendation service middleware as an example, this service sinks common dependencies, general logic, and the workflow engine into the foundation and defines some SPIs. The search and recommendation algorithms are implemented by individual module developers. Currently, the search and recommendation service has integrated over 1000+ modules, with an average code deployment time of less than 1 day, truly achieving a “write in the morning, deploy in the evening” capability.

代码 1 天上线

Conclusion and Plans

After over five years of evolution and refinement, Koupleless has been fully integrated across all business lines within Ant Group, supporting a quarter of the group’s online traffic and achieving significant cost reduction and efficiency improvement. Ant Group plans to further promote the Koupleless development model, continue building elastic capabilities for an even more extreme elasticity experience and green, low-carbon operations. Additionally, there is a focus on contributing to open-source capabilities, aiming to collaborate with the community to create a top-notch modular technology system, driving technical value creation across industries and helping enterprises to reduce costs and improve efficiency.

2 - Alibaba International Digital Commerce Group Middleware Business Efficiency Tripled

Koupelelss Alibaba International Digital Commerce Group Middleware Business Efficiency Tripled

Authors: Zhu Lin (Feng Yuan), Zhang Jianming (Ming Men)

Project Background

In the past few years, AIDC (Alibaba Overseas Digital Commerce) business division has expanded in multiple countries and regions globally. The international e-commerce business model is divided into “Cross-border” and “Local to Local,” respectively based on AE (Cross-border), Lazada, Daraz, Mirivia, Trendyol, and other e-commerce platforms. The different e-commerce platforms will be collectively referred to as “sites”.

阿里巴巴国际数字商业背景

For the entire e-commerce business, there are certain differences in the core buyer and seller foundation links among the sites, but there are more commonalities. Abstracting a universal platform to achieve low-cost reuse across various sites helps support upper-layer businesses more efficiently. Therefore, in the past few years, the foundation link has been attempting to support business development through a platform-based construction approach, with a model of 1 global business platform + N business sites; the technical iteration has undergone five stages of development, from the initial centralized middleware integration business architecture model, gradually transitioning to a decentralized architecture integrated by business, which can now basically meet the global sites’ business and platform’s own closed-loop iteration.

全球化业务平台。业务站点。

Each site is logically based on the international middleware (platform) for personalized customization, while in the delivery/operations state, each site is split into independent applications, each carrying its own business traffic. The platform capabilities are integrated into the site applications via secondary package methods, and the platform also possesses an ability expansion mechanism. The R&D at business sites can overwrite platform logic within site applications, maximizing the autonomy of site business development/operations, and to a certain extent, ensuring the reusability of platform capabilities. However, due to the current e-commerce sites being at different stages of development, and differences in business models between local-to-local and cross-border, as well as continuous changes in business strategies, the rapid iteration of business and the posterior sedimentation of platform capabilities gradually formed contradictions, mainly manifested in the following aspects:

  • Platform Redundancy: As the platform adopts an open, integrated strategy without certain constraints, demand iterations, even those requiring changes to platform logic, are basically self-contained within sites. There’s redundancy in platform capability sedimentation, stability, performance, and openness across all sites, with the differences in platform versions supporting different sites gradually widening;
  • High Site Maintenance Costs: Independently circled site applications, by maintaining customized platform capabilities and assuming part of “platform team responsibilities,” gradually increased the burden on site development teams, leading to higher labor costs;
  • Low Development Iteration Efficiency: Core application construction and deployment efficiency is low. Taking transaction site applications as an example, system startup time stabilizes at 300s+, compile time at 150s+, image construction time at 30s+, and container reinitialization and other scheduling layer time consumption at about 2 minutes. With over 100 deployments in the development environment per day, reducing construction and deployment time will effectively decrease development waiting time;

Therefore, the next generation of architectural iteration will need to focus on solving how to achieve autonomy in capability iteration and version unity under a decentralized architecture model integrated by business. It will also need to consider how to further reduce site development and operations costs and improve construction and deployment efficiency, allowing business developers to truly focus on their own business logic customization. The technology concept of Serverless emphasizes separation of concerns, allowing business developers to focus on the development of business logic without paying much attention to the underlying platform. This concept, combined with the problems we face, may be a good solution, upgrading the platform from a secondary package to a platform base application, unifying the iteration of the platform, including the upgrade of the application runtime; making business site applications lightweight, focusing only on the development of customized logic, improving deployment efficiency, reducing maintenance costs, with the overall logic architecture as follows:

阿里巴巴国际研发痛点

Concept Elaboration

Serverless is commonly understood as “serverless architecture”. It is one of the main technologies of cloud-native, where serverless means users do not need to worry about the management of application operation and maintenance. It allows users to develop and use applications without managing the underlying infrastructure. Cloud service providers offer, configure, and manage the underlying infrastructure for computing, decoupling applications from the infrastructure, allowing developers to focus more on business logic, thereby enhancing delivery capabilities and reducing work costs. Traditional Serverless in implementation is actually FaaS + BaaS. FaaS (Function as a Service) carries code snippets (i.e., functions) that can be created, used, and destroyed anytime, anywhere, without carrying state on their own. It is used in conjunction with BaaS (Backend as a Service). Together, they ultimately realize the complete behavior of Serverless services.

Serverless 概念

Under the traditional Serverless technology system, Java application architecture has mostly solved the problems of IaaS layer + Containerization, but Serverless itself cannot extend its coverage down into the JVM internals. Therefore, for a complex Java monolithic application, the concept of Serverless can be leveraged to further separate and split the business code under the Java technology stack from the infrastructure (middleware) dependencies. The Serverless transformation in this practice can be abstracted into the following process and objectives:

Java Serverless

Horizontally split a monolithic application into two layers:

  • Base: Some components and Lib packages that do not change frequently in business application iterations are sunk into a new application, which we call the base application, with the following characteristics:
    • The base can be published and maintained independently
    • Base application developers can uniformly upgrade middleware and component versions, without the upper layer App needing to be aware of the entire upgrade process, provided that compatibility is ensured
    • The base has reusability across different sites; a trading base can be shared by different site Apps like AE, Lazada, Daraz, etc.
  • Serverless App: To minimize the cost of business transformation, the App maintained by the business team still retains its independence in publishing and operational responsibilities. After Serverless transformation, business developers only need to focus on the business code. The JVM’s external service carrier remains the business application.

Technical Implementation

阿里巴巴国际 Serverless 演进

The implementation process of the Serverless architecture evolution is divided into two parts:

  1. Redesign the application architecture layering and responsibility division under the Serverless architecture model to reduce the burden on the business and improve the efficiency of SRE (Site Reliability Engineering).
  2. Adopt new development frameworks, delivery models, and release & operations products in the areas of R&D, publishing, and operations to support rapid business iteration.

Application Architecture

Taking the Daraz foundational link as an example, the application architecture’s layered structure, interaction relationships, and team responsibilities are as follows:

阿里巴巴国际 Serverless 应用架构

We logically layer the supporting architecture required for the complete delivery of a Serverless application and divide the development responsibilities, clearly defining the protocol standards for interaction between the App and the base.

Development Domain

阿里巴巴国际 Serverless 研发运维平台

  • Constructed a Serverless runtime framework to drive the operation and interaction of “Base-Serverless App”
  • Collaborated with the Aone development platform team to build a complete set of release & operations product systems for the base and App under the Serverless model

voyager-serverless framework

voyager-serverless

voyager-serverless framework is a self-developed R&D framework based on Koupleless technology, providing a Serverless programming interface, allowing business Apps to be dynamically loaded into a running base container (ArkContainer). Based on the module isolation capability of Koupleless , we have made in-depth customization for the Alibaba Group technology stack.

The entire framework provides the following key capabilities:

阿里巴巴国际 Serverless 框架关键能力

Isolation and Runtime Principles

Serverless 隔离性与运行时原理

The framework implements ClassLoader isolation and SpringContext isolation between the base and application modules. The startup process is divided into two stages and three steps, with the startup sequence from bottom to top:

  • Phase One: Base Startup
    • Step One: Bootstrap startup, including Kondyle and Pandora containers, loading Kondyle plugins and Pandora middleware plugins classes or objects
    • Step Two: Base application startup, internally ordered as follows:
      • Start ArkContainer, initialize Serverless-related components
      • Base application SpringContext initialization, loading base-owned classes, base Plugin classes, dependency package classes, middleware SDK classes, etc.
  • Phase Two: App Startup
    • Step Three: Serverless App startup, where the ArkContainer internal component accepts Fiber scheduling requests to download App artifacts and trigger App Launch
      • Create BizClassLoader as a thread ClassLoader to initialize SpringContext, loading App-owned classes, base Plugin classes, dependency package classes, middleware SDK classes, etc.

Communication Mechanism

In the Serverless mode, communication between the base and App can be achieved through in-process communication. Currently, two communication models are provided: SPI and Base Bean Service Export.

SPI is essentially an internal special implementation based on the standard Java SPI extension, which is not elaborated further in this article. Here, we focus on introducing Base Bean Service Export.

In general, the SpringContext of the base and the SpringContext of the App are completely isolated and have no parent-child inheritance relationship. Therefore, the App side cannot reference beans in the lower-level base SpringContext through regular @Autowired methods. In addition to sinking classes, in some scenarios, the base can also sink its already initialized bean objects, declaring and exposing them for use by the upper-level App. After this, when the App starts, it can directly obtain the initialized beans in the base SpringContext to accelerate the startup speed of the App. The process is as follows:

加快 Java 应用启动

  1. Users declare the beans that need to be exported in the base either through configuration or annotation.
  2. After the base startup is complete, the isolated container will automatically export the beans marked by the user to a buffer area, waiting for the App to start.
  3. When the App starts on the base, during the initialization process of the App’s SpringContext, it will read the beans that need to be imported from the buffer area during the initialization phase.
  4. Other components in the App’s SpringContext can then @Autowired these exported beans normally.

Plugin Mechanism

The Serverless plugin provides a mechanism for classes required by the App runtime to be loaded from the base by default. The framework supports packaging SDKs/secondary packages required by the platform base for upper-level App use into a plugin (Ark Plugin), ultimately achieving the sinking of mid-platform controlled packages into the base without requiring changes to upper-level business:

Serverless 插件机制

Middleware Adaptation

In the evolution of Serverless architecture, as the startup process of a complete application is split into base startup and App startup, the initialization logic of related middleware in phases one and two has also changed. We have tested and adapted commonly used middleware and product components on the international side. In summary, most issues arise from the fact that some middleware processes are not designed for scenarios with multiple ClassLoaders. Many classes/methods do not pass the ClassLoader object as a parameter, causing errors when initializing model objects, leading to abnormal context interactions.

Development Support

We also provide a complete and easy-to-use set of supporting tools to facilitate developers in quickly integrating into the Serverless ecosystem:

Serverless 研发配套

Release & Operations Domain

In addition to the development domain, the Serverless architecture also brings many new changes to the release and operations domain. Firstly, there is the splitting of development and operations layers, achieving separation of concerns and reducing development complexity:

Serverless 运维配套

  • Logical Splitting: Splitting the original application, isolating business code from infrastructure, sinking basic capabilities. For example, sinking time-consuming middleware, some rich secondary libraries, and controllable secondary libraries into the base, achieving lightweighting of business applications.
  • Independent Evolution: After splitting into layers, the base and business applications iterate independently. SREs can control and upgrade infrastructure uniformly on the base, reducing or even eliminating the cost of business upgrades. Serverless 运维配套

We also collaborate with Aone, and voyager-serverless integrates into the Aone Serverless product technology system using the OSR (Open Serverless Runtime) standard protocol. With the help of new release models and deployment strategies, significant improvements have been achieved in App building speed and startup efficiency.

Serverless 运维配套

Improvement in Build Efficiency

  • Maven Build Optimization: Since many dependencies have been sunk into the ready-made base, the number of secondary packages and class files that need to be built can be reduced for the App, thereby optimizing the overall artifact size and improving build efficiency.
  • Removal of Docker Builds: Since the artifacts deployed for business Apps under Serverless mode are lightweight Fat Jars, there is no need for Docker builds.

Improvement in Release Efficiency

In Serverless mode, we use Surge+streaming release instead of traditional batch releases to further improve the release efficiency of the App.

TermDescription
Batched Release   The strategy of releasing in batches involves moving to the next batch after a certain number of new nodes are reached in each batch. For example, with 100 nodes and 10 batches, the first batch releases 10 new nodes, the second batch releases 20 new nodes, and so on.
Surge   Surge release strategy accelerates business release efficiency without affecting service availability:
1) During release, a proportionate number of nodes are added according to the Surge configuration. For instance, if there are 10 machines in the business and Surge is configured at 50%, 5 machines are first added for release.
2) If the base is configured with an appropriate-sized buffer, these 5 machines can be directly obtained from the buffer to release the new version of the code.
3) Once the overall number of new version nodes reaches the expected number (in this example, 10 machines), the old nodes are directly taken offline, completing the entire release process.
When Surge is used in conjunction with streaming release and an appropriate number of buffers in the Runtime, it can maximize business release efficiency.
  • Waterfall Batched Release: In waterfall batched release strategy, all machines in each batch are deployed online before moving on to the next batch. Machines within a batch are deployed in parallel, while batches are deployed sequentially. For example, if there are 100 machines and the release is divided into 10 batches, with each batch deploying 10 machines, the total deployment time would be:

Serverless 流式发布

  • Surge Streaming Release: During the release process, it allows for the allocation of additional machines to participate in the update. The core principle is to increase the number of machines participating in the update in a single round, under the condition of ensuring availability. For example, with 100 machines, and ensuring availability ≥ 90%, meaning at any time at least 90 (100 * 90%) machines are online, the release scheduling with a surge of 5% would proceed as follows:

Serverless 流式发布

Serverless 流式发布

Using this new release model, we are fully implementing Surge releases in the daily and staging environments where development changes are most frequent, to accelerate the deployment of business apps.

  • Before the Serverless transformation:
    • To ensure that traffic is not affected during deployment, a staging environment typically retains two machines (replica = 2) and follows traditional batched releases (batch = 2), meaning each machine is updated in turn.
    • Here, let’s assume the application startup time is 5 minutes, with frequent changes in business code taking 1 minute, and platform and middleware components loading taking 4 minutes.
    • The total deployment time is 5 minutes (for business code changes) + 5 minutes (for platform and middleware loading) = 10 minutes.

Serverless 速度收益

  • After completing the Serverless transformation and adopting Surge streaming release:
    • The staging environment for the App only needs to retain one machine (replica = 1), and the base is configured with a buffer of 1, meaning one empty base is retained for scheduling use by the App.
    • In terms of release strategy, the Surge percentage for the App environment is set to 100%.
    • Since only updates to the App’s Biz code are being released, the total deployment time is 1 minute, and the total cost of machines remains unchanged throughout the process.

Serverless 速度收益

Additionally, we have configured a certain number of base buffers in the production environment to support rapid elastic scaling of site apps.

Summary and Outlook

We have completed the Serverless upgrade and transformation of Daraz business site’s transaction, marketing, fulfillment, membership, and reverse applications. Significant optimization effects have been achieved in three indicators: build time, single application startup time, and staging environment deployment time. In some areas, we have even achieved 10-second level application startup.

阿里巴巴国际 Serverless 展望

It can be seen that the upgrade of the Serverless architecture in this iteration has brought significant positive benefits and efficiency improvements, both in theoretical deduction and practical results. This brings much convenience to the rapid iteration of subsequent business apps. Meanwhile, since the platform code is sunk as a base application, it also has the ability to release orthogonal to the business site, basically achieving the goal of unified platform version for the basic chain link. “Focus separation” has also liberated business developers, allowing them to focus more on their business code. However, there are still some challenges and issues to be further addressed, such as the maturity of development support, problem diagnosis, and optimization of production environment costs with the optimal configuration of the base. We will also deeply participate in the co-construction of the Koupleless open-source community to explore more landing scenarios and practical experience.

Serverless has never been a single architecture form; it brings more of a concept and production mode. Understanding and utilizing it help us broaden new ideas and problem-solving methods.

3 - Koupleless 所有企业案例

Koupleless 所有企业案例

目前主动登记使用 Koupleless 的企业,按企业拼音字母顺序排序,不分先后:


阿里国际数字商业集团

使用场景:通用基座实现普通应用秒级构建发布、SDK 无感升级。详细案例



阿里健康科技(中国)有限公司

https://www.alihealth.cn/

使用场景:应用插件化开发



阿里妈妈

https://www.alihealth.cn/

使用场景:合并部署、热部署



北京快手科技有限公司

https://www.kuaishou.com

使用场景:合并部署、热部署



杭州涂鸦科技有限公司

https://tuya.com/

使用场景:合并部署、热部署



浙江政采云 - 中国招标与采购网

https://www.zbytb.com/

使用场景:合并部署、热部署



郑州易盛信息科技有限公司

https://www.esunny.com.cn/

使用场景:合并部署、热部署



广东正元信息技术有限公司

https://www.fizzgate.com/

使用场景:热部署



斑马信息科技有限公司

https://www.ebanma.com/

使用场景:类隔离、合并部署



成都云智天下科技股份有限公司

https://www.yunzhitx.com/

使用场景:模块动态部署、卸载



蚂蚁集团

https://www.antgroup.com/

使用场景:应用秒级构建发布、秒级弹性、并行迭代、合并部署,实现资源降本和研发提效。详细案例



南京爱福路汽车科技有限公司

https://www.f6car.cn/

使用场景:多应用合并部署降本增效



深圳市诺安赛威科技有限公司

https://company.rfidworld.com.cn/Intro-137503.aspx

使用场景:动态热部署、热卸载、模块隔离



山东网聪信息科技有限公司

https://www.gridnt.cn/

使用场景:类隔离



宋城独木桥网络有限公司

https://www.dmqwl.com/

使用场景:动态热部署、热卸载、模块隔离



网商银行

https://www.mybank.cn/

使用场景:应用秒级构建发布、秒级弹性、并行迭代、合并部署



易宝支付有限公司

https://www.yeepay.com/

使用场景:项目插件化,如动态部署、卸载、模块隔离



江苏纵目信息科技有限公司

https://www.zmops.com/

使用场景:多应用合并部署降本增效


上海一嗨信息技术服务有限公司

https://www.1hai.cn/

使用场景:多应用合并部署降本增效、模块热部署实现秒级构建发布



华信永道(北京)科技股份有限公司

使用场景:多应用合并部署降本增效

https://www.yondervision.com.cn/