Skip to main content

· 19 min read

Abstract: This blog attempts to talk about the challenges and best practices in the process of large-scale platform engineering from the perspectives of engineering, programing language, divide-and-conquer, modeling, automation, and collaborative culture. Hopefully, by sharing the concepts and practices of our platform engineering with more companies and teams, we can make some interesting changes happen together.

This blog is based on the platform engineering and automation practice of KusionStack in Ant Group.

1. Platform Engineering: Making Enterprise DevOps Happen

The concept of DevOps was proposed more than 10 years ago. From KVM to container to the cloud-native era, a large number of enterprises have invested in the DevOps wave to solve the dilemma of internal large-scale operation efficiency and platform construction efficiency. Most of them have fallen into some kind of Anti-Pattern based on an inappropriate understanding of DevOps, while some companies have explored their golden paths. I have experienced Anti-Patterns as shown in the figure below, where the Dev and Ops teams go their separate ways, or simply force the Dev team to handle Ops work independently. More typical classifications can be found here.

There are various reasons for difficulties achieving DevOps at scale within the enterprise, especially for companies that maintain their infrastructure within the enterprise while adopting technology platforms on the cloud. Among which, the following situations are quite common:

  • The dev team and the ops team are often isolated at work and cannot reach a consensus due to team silos, lack of leaders’ insight, etc.
  • The dev team leaders underestimate the professionalism, complexity and rapid changes of platform and infrastructure technology, operation and reliability work so that they push application developers to become experts with a simple DevOps understanding
  • The leaders established a DevOps team, but became just middle executors, failing to make the Dev and Ops teams move forward and work closely together
  • The platform dev team responds insufficiently to the business complexity due to the scale as well as to the tech complexity as a result of technological evolution, thus cannot provide effective support to application developers
  • ...

Unlike small teams working on cloud-hosted Infra services and DevOps-as-a-Service products, medium and large enterprises often need to establish an appropriate DevOps system based on their team structure and culture. Based on the successful cases, whether for Meta Dev team fully assumes the Ops function, or Google introduces the SRE team as the middle layer, Platform Engineering plays a very important role. Platform engineering aims to help enterprises build a self-service operation system for application developers, and tries to solve the following key problems through engineering tech means and workflows:

  • Design a reasonable abstraction to help application developers reduce the burden of cognizing the delivery, operation, and reliability based on Infra, platform and other technologies
  • Provide application developers with a unified working interface and space to avoid falling into fragmented product interfaces and complex workflows
  • Help developers carry out work quickly based on internal engineering platform through effective workflow and recommended path
  • Help developers manage the application life cycle through self-service CI, CD, CDRA products
  • Help the platform dev team to open fundamental platform capabilities in a simple, efficient and consistent manner
  • Create a culture of collaboration and sharing through training, propagation, operations, etc.

In fact, not everyone should be or could be an expert in this specific field, which turns out to be particularly hard as well. Even the experts from the platform technology team themselves are usually good at their own professional fields, especially nowadays confronting the wide adoption of cloud-native concepts and technologies, hundreds of thousands of application configurations brought by a large number of highly open and configurable platform technologies and the business complexity of the PaaS field as well as the requirements for high stability and unified governance. The purpose of platform engineering just lies in allowing application developers to participate in such large-scale DevOps work as easily and painlessly as possible. In Ant Group's practice, we tend to the following cooperative state, which is closer to Google's best practices in team structure and work mode. Platform developers and SREs become "Enablers" to support application developers to complete dev, delivery and operation in self-service. At the same time, the work results of application developers making applications deliverable and operational also become the basis for the ops team to handle application ops work. The SRE, application dev, and ops team periodically feedback the problems and pain points in the work process to the platform dev team to form a positive cycle.

2. Domain Language: A Pole of the Engineering Thought

Compared with a domain language there's no better way for open, self-service, domain-oriented business problem definitions, as well as meeting the enterprise's internal requirements of automation, low-security-risk, low noise, and easy governance. Just as there are staves for recording music, and time-series databases for storing time-series data, within the problem domain of platform engineering, a set of configuration and policy languages ​​are created to write and manage configurations and policies at scale. Different from high-level general purpose languages ​​with mixed paradigms and engineering capabilities, the core logic of such domain languages ​​is to solve the near-infinite variation and complexity of domain problems with a convergent and limited set of syntax and semantics, and to integrate the ideas and methods of large-scale complex configuration and policies writing into language features.

In the platform engineering practice of Ant Group, we have strengthened the client-side working mode. We write and maintain the models, orchestration, constraints and policies around the application ops life cycle in the shared codebase Konfig through the record and functional language KCL. KCL is a static and strongly typed language for application developers with programming ability, and provides the writing experience of a modern high-level language with limited functionality around domain purposes. Under such practices, KCL is not a language just for writing K-V pairs, but a language for platform engineering development. Application developers, SREs, and platform developers conduct dev collaboratively based on Konfig. They write configurations, and schema abstractions, functions, constraints and rules which are frequent and complex in the PaaS field through KCL native functions, that is, writing stable and scalable business models, business logic, error-proofing constraints, and environmental rules. The Konfig repository becomes a unified programming interface, workspace and business layer, while the KCL-oriented writing paradigm, which is secure and consistent, with low noise, low side effect and easy to automate, are more beneficial for long-term management and governance.

3. Divide and Conquer: Deconstructing the Scaling Problem

The idea of ​​divide and conquer is the key to solving the scaling problem, whose efficiency is reflected from MapReduce to Kubernetes. In the field of large-scale delivery and operation, the classic operation platform tries to use the built-in unified model, orchestration, and provision technology in a black-box product to deal with full-scale business scenarios. Such a practice can be started easily and quickly and turns out to be effective on a small scale. However, as the adoption rate of different business groups increases and different requirements are put forward, it gradually enters into a state of fatigue with the constantly growing platform technology.

In Ant Group's practice, Konfig monorepo is the programming workspace opened by the internal engineering platform to developers, helping application developers to write configurations and policies around the application operation life cycle with a unified programming interface and tech stack, to integrate with existing and future platform and infrastructure, to create and manage cloud-native environments and RBAC-based permissions on demand, and to manage the delivery workflow through GitOps. Konfig monorepo provides an independent white-box programming space for different scenarios, projects and applications, whose intrinsic scalability comes from:

  • Flexible, scalable, independent client-side engineering structure design
  • The automatic merging technology of isolated config blocks supports the arbitrary and scalable organization of config blocks
  • Static type system technology provides reusable and scalable type-based modeling and constraints in a modern programming language manner
  • Project-grained GitOps CI workflow definition support
  • Provision technology selection based on Kusion engine

Konfig monorepo provides divide-and-conquer, composable engineering structure design, code organization, modeling, workflow and provision tech selection support, meanwhile, it carries scalable business requirements with a consistent development model and workflow. The working method of the client-side ensures flexibility, scalability, and portability while reducing the steadily increasing pressure on the extension mechanisms of the server-side, such as Kubernetes API Machinery.

The following figure illustrates a typical automated workflow with a GitOps approach in the Konfig monorepo, starting from project-oriented code changes and reaching the runtime through configurable CI and CD processes. Compared to centralized black-box products, this mechanism is much more open, customizable and extensible, through which there’s no more necessity of designing clumsy configuration management portals for different business scenarios, projects, and applications.


4. Modeling: Marginal Revenue and the Long Tail

With the divide-and-conquer white-box engineering structure design, code organization, modeling, workflow and provision tech selection, we also need to consider how to work with platform APIs. The typical dispute within the enterprise is whether to face directly the platform details or to design an abstraction, which eventually grows into such kind of dispute between explicit and implicit concepts.

The abstract implicit method is a usual choice for platform engineers for non-expert end users, who want to design an easy-to-understand and easy-to-use application model or Spec abstraction, which is isolated from specific platform technical details, in order to reduce the end-user's burden on cognizing and enhances the performance of preventing errors by reducing detail perception. However, most operation platform developers tend to design a powerful and unified application model or Spec abstraction. In practice, they often encounter the following obstacles:

  • With the increasing adoption rate of different business groups within the enterprise, it is difficult to implement unified modeling. The most typical case in Ant Group is the huge difference between Infra projects and SaaS applications. Java-based SaaS applications are easy to be unified, while Infra applications often need to be designed separately.
  • With a large number of platform technologies in the enterprise, the unified model itself is difficult to be stabilized, especially in response to continuously changing business needs and platform demand growth driven by technology. In Ant Group's practice, delivery and operation are affected by various factors and thus have great instability. At the same time, the business requirements for deliverable, runtime, security, and instrumentation around the application operation life cycle are also increasing. Taking instrumentation as an example, the rapidly increasing demand for application runtime observability and SLO definition in the past two years directly drives changes in end-user usage.
  • The general problem of abstract models is that a reasonable model needs to be well designed for end users, meanwhile, the API details of the platform must be kept in sync.

In Ant Group's practice, we adopt an abstract model for end-users, which refers to application developers, and solve several key problems through the following ideas:

  • Modeling for typical applications or scenarios (such as Ant Group's Sofa application), these models are developed by platform developers together with platform SREs and maintained together with application developers to achieve a balance between user experience, cost and standard compatibility. In Ant Group's practice, the information entropy convergence ratio of the abstract model is about 1:5, and the marginal benefit of modeling investment is guaranteed through extensive high-frequency usage.
  • For non-typical user applications or scenarios, platform developers and platform SRE support application developers to design application models. Mechanisms such as KCL schema and mixin help users to model, abstract, inherit, combine, reuse, and reduce repetitive code. Such modeling design work is one of the key points in the field of application PaaS, and we need a more reasonable division of labor for such a scenario. Finally, a large number of "non-standard" platform applications were adopted and managed in an accordant way within Ant Group for the first time, which solved the long tail problem effectively. In a typical collaborative mode, platform developers and platform SREs write base components of platform capability, thus becoming “Enablers”, and helping application developers "build building blocks” quickly by using base components to complete their application models.
  • For platform technology, we provide the generation tool from platform API Spec to KCL schema code, natively support the compile-time selection of different Kubernetes API versions through combined compilation, and solve the flexible requirements of mapping application models to different versions of Kubernetes clusters in internal practice. At the same time, KCL supports the writing of in-schema constraints and independent environmental rules. In addition, KCL also provides the deprecated decorator to support the force deprecation of the model or model attribute. Through robust complete modeling and constraint mechanism on the client side, general problems such as configuration errors and schema drift are exposed at compile time. Due to the left-shifted problem found before runtime, runtime errors or failures are avoided while pushing to the cluster, which is also a necessary requirement for the stability of the production environment in the enterprise, especially in an enterprise with high-level risks.

Expert users of the underlying platform technology are usually very familiar with a specific technical domain and prefer to work in an explicit way that faces platform details. The KCL language provides the necessary dynamic and modular support and ensures stability through a static type system and constraint mechanisms. However, the explicit method cannot solve the problem that expert users are not familiar with the details of using cross-domain platform technologies, nor can it solve the problem of the scalability and complexity of platform technologies today. In Ant Group's small-scale YAML-based explicit engineering practice, facing a large number of highly open and configurable platform technologies, the complexity grows continuously with the utilization rate of platform technologies, and ends up in a rigid state that is hard to read, write, constrain, test, and maintain.

5. Automation: New Challenges

Automation is a classic domain in the field of infrastructure operation. With the wide and rapid adoption of cloud-native concepts and technologies, the ability to automatically integrate turns out to be the basic requirement of enterprise operation practice. Open source, highly configurable CI and CD technologies are gradually adopted by enterprises. The black-box "product" approach that cannot be integrated is gradually weakened and replaced by a flexible orchestration approach. The main advantages of this practice lie in its powerful customize orchestration and linking capabilities, high scalability as well as good portability. Especially in the Kubernetes ecosystem, the GitOps method has a high adoption rate and a natural affinity with configurable CI and CD technologies. Such changes are also promoting the gradual transformation of the work order and product centric workflow into a self-service and engineering efficiency platform centric workflow, and the operational capability of the production environment has become an important part of the automatic workflow. In the open source community, the technology innovation of the abstraction layer for different engineering efficiency platforms is also active in progress. The developers on the platform side hope to get through the CI and CD process applied to the cloud environment through the shortest cognitive and practical path.

In Ant Group's engineering practice, the engineering efficiency platform is deeply involved in the open automation practice of Konfig monorepo, and our practice direction is also highly aligned with the technological roadmap of the engineering efficiency platform. In the collaborative work of several people to dozens or even hundreds of people, workflow design for operation scenarios, high-frequency code submission and pipeline execution, real-time automated testing and deployment pose several challenges to the engineering efficiency platform. Especially the diverse businesses in monorepo require independent and powerful workflow customization and operation support, as well as parallel workflow execution capabilities with high real-time and strong SLO guarantee. The requirements of the single-repository mode are hugely different. Most configuration languages are interpreted languages, while KCL is designed as a compiled language, implemented by Rust, C and LLVM optimizer, to provide high-performance compile-time and runtime execution for large-scale KCL files. At the same time, KCL can be compiled to native code and wasm binary to meet various runtimes execution requirements. In addition, the storage and architecture design of Git is different from the Citc/Piper architecture, and not suitable for monorepo of large-scale code. Fortunately, we have not encountered a big problem with the scale of code today. Meanwhile, we’re working together to solve problems and hope to solve them gradually as the practice deepens.

6. Collaborative Culture: A More Important Thing

The above technologies, tools and mechanisms are very important, but I must say that for engineering and DevOps, the culture of collaboration, cooperation and sharing among groups and teams is more important, because it’s work composed of people, which means people and culture are actually the key points. Within an enterprise full of team silos or isolated teams, people tend to pursue a closed and bad engineering culture, thus we usually see a large number of private code bases and private documents, the judging and working method of small isolated groups. Teams that are supposed to work closely together actually do their own thing and chase after their own private goals. I think it’ll be very difficult to do anything at scale in a culture like this. So if your company or team plans to adopt DevOps at scale, the most important thing is to communicate extensively and to start the culture construction, because it’s not just such a kind of issue concerning a few persons, and it’s also quite difficult and unmanageable.

In Ant Group's practice, there are always various difficulties in the initial stage, and everyone's concerns about the self-service mechanism and collaborative culture are particularly obvious, such as "I need to write code?" "My code is actually in the same codebase with other teams?", "The job I'm in charge of is not easy, it won't work this way" are typical. Fortunately, we ended up with a virtual organization for common goals, with full support from partners and leaders, and we reached a consensus in the mass. Although the application developers are always complaining about the technology, workflow and mechanism, and hope to obtain a better user experience, which is understandable in deed, the application developers themselves are not the meant obstacle. The real obstacle firstly comes from the operation platform dev team itself. I see that the DevOps ideal of some companies turns into that the operation platform team does all the work for the application developers, even prevents users from accessing labor tools such as code and toolchains, and rushes to hide behind the existing GUI product, which deviates from the original intention and underestimates the capabilities and creativity of the users. Secondly, the obstacles also come from the technical leaders of some platform technical teams. It is difficult for them to give up focusing on the existing work that has lasted for many years, and then accept a new user service model. The feasible way is to let them understand the meaning and vision of the work, and gradually exert influence on them step by step and stage by stage.

7. Summary

After more than a year of practice, 400+ developers have directly participated in the code contribution of Konfig monorepo, and managed more than 1500 projects. Among them, the ratio of platform developers together with platform SREs to application developers is less than 1:9. Some of these application developers are the application owners themselves, and some are representatives of the application dev team, which is up to the application team itself. Through continuous automation capability building, 200-300 commits occur every day based on the Konfig monorepo, most of which are automated code modifications, about 1k pipeline task executions and nearly 10k KCL compilation executions. Today, if you compile the full code in Konfig once and output it, 300W+ lines of YAML text will be generated. A release operation process requires multiple compilations with different parameter combinations. Through the lightweight and portable code base and toolchains, we have completed a significant external private cloud delivery, eliminating the pain of transforming and porting a series of legacy operation platforms. Inside Ant Group, we have served several different operation scenarios, and are expanding the scale and exploring more possibilities.

Finally, I would like to talk about the next step. There’re still possibilities and potentialities for us to keep improving our technology and tools in terms of usability and user experience. More user feedback and continuous improvement are always needed, and there is no shortcut to a good user experience. In the aspect of testing, we provide simple integration testing methods, which play the role of smoke testing, however, it is far from enough. We are trying to ensure correctness based on schema constraints and environmental rules rather than tests. In terms of work interface, we hope to build an IDE-based dev workspace and continue to enhance and optimize the internal workflow experience. At the same time, we hope to continue to improve coverage and technical capabilities. In addition, we also hope to apply the practice method more widely in CI and operation workflow, as well as other scenarios, to shorten end-user awareness and end-to-end workflow. At present, KusionStack is still in the very early stage of open source, and there is a lot of work to do in the future. The most important thing is that we hope to share the concepts and practices of our platform engineering with more companies and teams, to bring about and witness some interesting changes together.

8. Reference

· 11 min read


本文撰写于 KusionStack 开源前夕,作者有感而发,回顾了团队从 Kusion 项目开发之初到现今成功走上开源之路的艰辛历程。当中既描述了作者及其团队做 Kusion 项目的初心和项目发展至今的成果,也表达了作者自身对团队的由衷感激,字里行间都散发着真情实感。

1. KusionStack 是什么?

KusionStack 是开源的可编程云原生协议栈!

Kusion 一词来源于 fusion(意为融合),希望通过一站式的技术栈融合运维体系的多个角色,提升运维基础设施的开放性、扩展性,从整体上降本增效。KusionStack 通过定义云原生可编程接入层,提供包括配置语言 KCL、模型界面、自动化工具、最佳实践在内的一整套解决方案,连通云原生基础设施与业务应用,连接定义和使用基础设施的各个团队,串联应用生命周期的研发、测试、集成、发布各个阶段,服务于云原生自动化系统建设,加速云原生落地。

2. 为了一个理想的运维体系

2019 年秋,MOSN 的工作已持续了近两年,期间我们逐步完成了在支付宝核心链路的形态验证。整个过程中除了 MOSN 本身面对的种种技术挑战和困难,所谓的云原生技术红利,实际上也已经掣肘于运维系统固化所造成的效率制约。


他希望 SRE 能通过一种专用语言来编写需求,通过写代码来定义基础设施的状态,而不是花费极大的精力在检查、发现、修复的循环上。基础设施团队则通过提供开放的可编程语言和工具支撑不同诉求的 SRE 团队,达到更高的整体 ROI。

我立刻意识到这和 Hashicorp 的 Terraform 神之相似(后来 Hashicorp 在 2021 年底上市,以超过 150 亿美元的市值成为迄今为止市值最高的一次开源 IPO)。另一方面,不同于 IaaS 交付场景,蚂蚁面对着大量更规模化、复杂度更高的云原生 PaaS 场景,又让我想到了 Google 内部运用专用语言、工具等技术开放 Borg 和相关的 运维能力的实践,当时感觉这是 一个既有意思又有挑战的事


3. 漫长的学习、探索与实践




屋漏偏逢连夜雨,期间又令人惋惜且无奈的经历了一些人事变化,同时由于种种原因,项目一度陷入了各种困境。整个 2020 年,我们在未知、纠结、无奈中度过……


4. 痛并快乐的孵化之旅

通过持续地布道、交流和沟通,我们逐步在基础设施技术团队和 SRE 团队找到了更多有共识的朋友。

同时在技术上,我们亦脱离了迷茫,真正意义上地启动了 Kusion 项目,也成功地从 PoC 过渡到了 MVP 的阶段。



5. 突破与进展

略过中间的种种探索和实践,回顾这段历程,在这一年多的时间里我们结合了编译技术、运维及平台技术,成功建立了一个基于 Kusion 可编程技术栈的运维体系。

在业务场景上,项目覆盖了从 IaaS 到 SaaS 的大量运维场景,截至目前共接入了 800+ 应用,覆盖 9 个 BG,21 个 BU,其中典型案例交付运维提效 90% 以上,这也是蚂蚁内部第一次将大量异构应用纳入到一整套运维技术栈。

在蚂蚁我们基于云原生容器和微服务技术深入探索了 DevOps、CICD 实践,完善了蚂蚁的云原生技术体系,逐步释放了云原生效率红利,同时形成了一个近 300 人的虚拟运维研发团队。

不同职能不同团队的参与者凝聚在一起解决各自所面对的问题,贡献了 30K+ commit 和 350K+ 行代码,有一些参与者自发成为 Kusion 的研发者 。我认为这些工程师文化理念和领域知识的积累带来了远超运维业务本身的价值。

此外,Kusion 也成为了可编程基线产品、云原生运维产品、多云交付产品等新一代运维产品的基础技术,成为蚂蚁运维体系架构升级的一部分。


同时,因蚂蚁内部运维场景较多且链路复杂,每个环节都需要最懂运维业务的 SRE 密切参与,与平台、应用研发协同工作,最终各环节联合在一起形成了一套完整的运维体系,在这样的思路下开放技术也会越来越重要。


6. 走上开源之路

在历经了一段内部探索之后,我们希望把 KusionStack 开源到技术社区。因为我们意识到自身面对的问题,其他公司、团队其实也同样正在面对。借助开源这件事,我们希望团队的这些工作成果能对更多人有所帮助。

当然,也受限于自身能力以及精力和资源的投入,我们希望能有更多朋友参与进来,与我们共同去完善 KusionStack,不论你是工作在云原生、运维自动化、编程语言或者是编译器中的哪一个领域,我们都非常期待和欢迎你的加入。

7. 期待与你共成长

这段经历对我来说异常宝贵,不仅仅是在于自身再一次在新的技术领域和蚂蚁的技术升级方面尝试了新的探索并实现了突破,更宝贵的是,自己还拥有了一段与一群人均 95 后的小伙伴一起将想法落地实现的奇幻历程。

在未来,Kusion 的朋友圈不再局限于蚂蚁内部,面向开源,我们期待着能有更多的社区朋友在 KusionStack 与我们共同成长!