From a6b93d9f0c6c34405d7c4cd649315a72a9a4e345 Mon Sep 17 00:00:00 2001 From: payall4u Date: Thu, 30 Nov 2023 02:07:02 +0000 Subject: [PATCH] deploy: c811bb6827258ea2d5a5b8afdeff5a8848609f6d --- .../index.html" | 4 +- .../index.html | 8 +- .../how-kujiale-adopt-ehpa/index.html | 8 +- .../index.html | 50 +- docs/best-practices/index.html | 4 +- docs/best-practices/index.xml | 306 ++++--- docs/contributing/code-standards/index.html | 8 +- docs/contributing/contributing/index.html | 6 +- docs/contributing/developer-guide/index.html | 6 +- docs/contributing/index.html | 4 +- docs/core-concept/architecture/index.html | 6 +- docs/core-concept/index.html | 4 +- docs/core-concept/index.xml | 160 ++-- .../resource-optimize-model/index.html | 6 +- .../timeseriees-forecasting-by-dsp/index.html | 44 +- docs/getting-started/index.html | 4 +- docs/getting-started/index.xml | 2 +- docs/getting-started/installation/index.html | 4 +- docs/getting-started/installation/index.xml | 78 +- .../installation-cli-tool/index.html | 6 +- .../installation/installation/index.html | 26 +- .../installation/quick-start/index.html | 8 +- docs/getting-started/introduction/index.html | 8 +- docs/index.html | 4 +- docs/index.xml | 769 +++++++----------- docs/mirror-resources/index.html | 4 +- .../index.html | 6 +- .../index.html | 8 +- .../index.html | 6 +- docs/proposals/index.html | 4 +- .../index.html | 8 +- docs/roadmap/index.html | 4 +- docs/roadmap/roadmap-2022/index.html | 6 +- docs/roadmap/roadmap-2023/index.html | 6 +- .../colocation-with-enhanced-qos/index.html | 4 +- .../colocation-with-enhanced-qos/index.xml | 14 +- .../index.html | 6 +- .../index.html | 6 +- .../index.html | 8 +- .../index.html | 6 +- .../index.html | 8 +- .../using-qos-ensurance/index.html | 10 +- .../dynamic-scheduler-plugin/index.html | 6 +- docs/tutorials/index.html | 4 +- .../how-to-develop-recommender/index.html | 6 +- .../hpa-recommendation/index.html | 6 +- .../idlenode-recommendation/index.html | 6 +- docs/tutorials/recommendation/index.html | 6 +- docs/tutorials/recommendation/index.xml | 207 +---- .../pv-recommendation/index.html | 91 --- .../recommendation-framework/index.html | 52 +- .../replicas-recommendation/index.html | 6 +- .../resource-recommendation/index.html | 6 +- .../service-recommendation/index.html | 97 --- .../index.html | 6 +- .../index.html | 6 +- .../using-time-series-prediction/index.html | 6 +- en/sitemap.xml | 2 +- search/index.html | 2 +- sitemap.xml | 2 +- .../index.html" | 4 +- .../index.html | 8 +- .../how-kujiale-adopt-ehpa/index.html | 8 +- .../index.html | 10 +- zh-cn/docs/best-practices/index.html | 4 +- zh-cn/docs/best-practices/index.xml | 4 +- .../contributing/code-standards/index.html | 8 +- .../docs/contributing/contributing/index.html | 6 +- .../contributing/developer-guide/index.html | 6 +- zh-cn/docs/contributing/index.html | 4 +- .../docs/core-concept/architecture/index.html | 6 +- zh-cn/docs/core-concept/index.html | 4 +- .../resource-optimize-model/index.html | 6 +- .../timeseriees-forecasting-by-dsp/index.html | 6 +- zh-cn/docs/getting-started/index.html | 4 +- zh-cn/docs/getting-started/index.xml | 4 +- .../getting-started/installation/index.html | 4 +- .../getting-started/installation/index.xml | 78 +- .../installation-cli-tool/index.html | 6 +- .../installation/installation/index.html | 26 +- .../installation/quick-start/index.html | 8 +- .../getting-started/introduction/index.html | 8 +- zh-cn/docs/index.html | 4 +- zh-cn/docs/index.xml | 445 +++------- zh-cn/docs/mirror-resources/index.html | 4 +- .../index.html | 6 +- .../index.html | 8 +- .../index.html | 6 +- zh-cn/docs/proposals/index.html | 4 +- .../index.html | 8 +- zh-cn/docs/roadmap/index.html | 4 +- zh-cn/docs/roadmap/roadmap-2022/index.html | 6 +- zh-cn/docs/roadmap/roadmap-2023/index.html | 6 +- .../colocation-with-enhanced-qos/index.html | 4 +- .../colocation-with-enhanced-qos/index.xml | 14 +- .../index.html | 6 +- .../index.html | 6 +- .../index.html | 8 +- .../index.html | 6 +- .../index.html | 8 +- .../using-qos-ensurance.zh/index.html | 10 +- .../dynamic-scheduler-plugin/index.html | 6 +- zh-cn/docs/tutorials/index.html | 4 +- zh-cn/docs/tutorials/index.xml | 43 - .../index.html | 36 +- .../how-to-develop-recommender/index.html | 6 +- .../hpa-recommendation/index.html | 6 +- .../idlenode-recommendation/index.html | 53 +- .../docs/tutorials/recommendation/index.html | 6 +- zh-cn/docs/tutorials/recommendation/index.xml | 300 ++----- .../pv-recommendation/index.html | 90 -- .../recommendation-framework/index.html | 54 +- .../replicas-recommendation/index.html | 6 +- .../resource-recommendation/index.html | 6 +- .../service-recommendation/index.html | 96 --- .../index.html | 6 +- .../index.html | 6 +- .../using-time-series-prediction/index.html | 6 +- zh-cn/search/index.html | 2 +- zh-cn/sitemap.xml | 2 +- 120 files changed, 1169 insertions(+), 2514 deletions(-) delete mode 100644 docs/tutorials/recommendation/pv-recommendation/index.html delete mode 100644 docs/tutorials/recommendation/service-recommendation/index.html delete mode 100644 zh-cn/docs/tutorials/recommendation/pv-recommendation/index.html delete mode 100644 zh-cn/docs/tutorials/recommendation/service-recommendation/index.html diff --git "a/blog/1/01/01/crane-v0.7\351\200\232\350\277\207\346\216\247\345\210\266\345\217\260\344\270\200\351\224\256\350\212\202\347\234\201\344\272\221\346\210\220\346\234\254/index.html" "b/blog/1/01/01/crane-v0.7\351\200\232\350\277\207\346\216\247\345\210\266\345\217\260\344\270\200\351\224\256\350\212\202\347\234\201\344\272\221\346\210\220\346\234\254/index.html" index 8b79499e3..d8ec0e221 100644 --- "a/blog/1/01/01/crane-v0.7\351\200\232\350\277\207\346\216\247\345\210\266\345\217\260\344\270\200\351\224\256\350\212\202\347\234\201\344\272\221\346\210\220\346\234\254/index.html" +++ "b/blog/1/01/01/crane-v0.7\351\200\232\350\277\207\346\216\247\345\210\266\345\217\260\344\270\200\351\224\256\350\212\202\347\234\201\344\272\221\346\210\220\346\234\254/index.html" @@ -5,13 +5,13 @@ 资源推荐框架 Recommendation Framework Crane 的资源推荐,副本推荐功能在腾讯内部落地帮助自研业务每月节省了大量的成本,取得了很好的效果,详情请见:https://mp.weixin.qq.com/s/1SeMzcf_VRvRysZ9NLI-Sw 。同时,我们认为自动分析集群资源找到浪费并给出优化建议是帮助企业降本的重要方法,引入更多的分析类型至关重要。 因此在 0.7.0 版本中,Crane 设计了 Recommendation Framework,它提供了一个可扩展的推荐框架以支持多种云资源的分析,并内置了多种推荐器:资源推荐,副本推荐,闲置资源推荐。Recommendation Framework 通过 RecommendationRule 和 Recommendation CRD 描述了如何进行资源的分析推荐。 智能推荐的规则 -apiVersion: analysis.crane.io/v1alpha1 kind: RecommendationRule metadata: name: workloads-rule labels: analysis.crane.io/recommendation-rule-preinstall: "true" spec: runInterval: 24h # 每24h运行一次 resourceSelectors: # 资源的信息 - kind: Deployment apiVersion: apps/v1 - kind: StatefulSet apiVersion: apps/v1 namespaceSelector: any: true # 扫描所有namespace recommenders: # 使用 Workload 的副本和资源推荐器 - name: Replicas - name: Resource 推荐的结果">Intelligent Autoscaling Practices Based on Effective HPA for Custom Metrics | Crane
\ No newline at end of file diff --git a/docs/best-practices/index.html b/docs/best-practices/index.html index 1cf6412e1..eebc20477 100644 --- a/docs/best-practices/index.html +++ b/docs/best-practices/index.html @@ -26,9 +26,7 @@
  • -
  • -
  • -
  • +
  • diff --git a/docs/best-practices/index.xml b/docs/best-practices/index.xml index 39d9587e1..efe2ddc4a 100644 --- a/docs/best-practices/index.xml +++ b/docs/best-practices/index.xml @@ -305,11 +305,11 @@ Prometheus is a popular open source monitoring system today, through which user- </span></span><span style="display:flex;"><span> <span style="color:#f92672">name</span>: <span style="color:#ae81ff">sample-app</span> </span></span></code></pre></div><h2 id="summary">Summary</h2> <p>Due to the complexity of production environments, multi-metric-based autoscaling (CPU/Memory/custom metrics) is often a common choice for production applications, so Effective HPA achieves the effectiveness of helping more businesses land horizontal autoscaling in production environments by covering multi-metric autoscaling with predictive algorithms.</p>Docs: How to optimize your application in FinOps era/docs/best-practices/how-to-optimize-your-application-resource/Mon, 01 Jan 0001 00:00:00 +0000/docs/best-practices/how-to-optimize-your-application-resource/ -<p>As more and more enterprises migrate their applications to the Kubernetes platform, it has gradually become an important entry point for resource orchestration and scheduling. As we all know, Kubernetes schedules applications based on the resource quotas requested by the applications, so how to properly configure application resource specifications has become the key to improving cluster utilization. This article will share how to correctly configure application resources based on the FinOps open-source project Crane, and how to promote resource optimization practices within the enterprise.</p> -<h2 id="kubernetes-how-to-manage-resources">Kubernetes How to manage resources</h2> -<h3 id="pod-resource-model">Pod Resource model</h3> -<p>In Kubernetes, the desired amount of resources for a Pod can be selectively set by specifying Request/Limit. When the resource Request is specified for a Container in a Pod, Kube-scheduler uses this information to determine which node to schedule the Pod on. When the resource Request and Limit are specified for a Container, kubelet ensures that the running container can access the requested resources through Cgroup parameters and does not use resources beyond the set limit. Kubelet also reserves system resources equal to the Request amount for the container to use. -example of resource configuration for a Pod:</p> +<p>随着越来越多的企业将应用程序迁移到 Kubernetes 平台,它逐渐成为了资源编排和调度的重要入口。众所周知,Kubernetes 会按照应用程序申请的资源配额进行调度,因此如何合理的配置应用资源规格就成为提升集群利用率的关键。这篇文章将会分享如何基于 FinOps 开源项目 Crane 正确的配置应用资源,以及如何在企业内推进资源优化的实践。</p> +<h2 id="kubernetes-如何管理资源">Kubernetes 如何管理资源</h2> +<h3 id="pod-资源模型">Pod 资源模型</h3> +<p>在 Kubernetes 中可以通过指定 Request/Limit 选择性的为 Pod 设定所需的资源数量。当为 Pod 中的 Container 指定了资源 Request 时, Kube-scheduler 就利用该信息决定将 Pod 调度到哪个节点上。当为 Container 指定了资源 Request 和 Limit 时,kubelet 会通过 Cgroup 参数确保运行的容器可以获取到申请的资源并且不会使用超出所设限制的资源。kubelet 还会为容器预留所 Request 数量的系统资源,供其使用。</p> +<p>以下是一个 Pod 的资源示例:</p> <pre tabindex="0"><code>apiVersion: v1 kind: Pod metadata: @@ -325,14 +325,14 @@ cpu: &#34;250m&#34; limits: memory: &#34;128Mi&#34; cpu: &#34;500m&#34; -</code></pre><p>Once the resource request amount is determined, the resource utilization formula for an application can be derived as follows: Utilization = Resource Usage / Resource Request.</p> -<p>Therefore, to improve the utilization of Pods, we need to configure reasonable resource requests.</p> -<h3 id="workload-resource-model">Workload Resource model</h3> -<p>A workload is an application that runs on Kubernetes, consisting of a group of Pods, such as Deployments and StatefulSets. The number of Pods is referred to as the workload&rsquo;s replica count.</p> -<p>The resource utilization formula for a workload is: Workload Utilization = (Pod1 Usage + Pod2 Usage + &hellip; PodN Usage) / (Request * Replicas).</p> -<p>As the formula shows, improving workload utilization can not only reduce the Request, but also reduce the Replicas.</p> -<h3 id="common-resource-configuration-issues">Common resource configuration issues</h3> -<p>The Canadian software company Densify summarized common resource configuration issues in &ldquo;12 RISK OF KUBERNETES RESOURCE MANAGEMENT&rdquo; [1]. In the table below, we have added an analysis dimension of replica counts based on their findings.</p> +</code></pre><p>在明确了资源的申请量后即可推导出应用的资源利用率公式:Utilization = 资源用量 Usage / 资源申请量 。</p> +<p>因此,为了提升 Pod 的利用率我们需要配置合理的资源 Request。</p> +<h3 id="workload-资源模型">Workload 资源模型</h3> +<p>Workload 是在 Kubernetes 上运行的应用程序。它由一组 Pod 组成,例如 Deployment 和 StatefulSet 统称为 Workload。Pod 的数量称为 Workload 的副本数。</p> +<p>Workload 的资源利用率公式:Workload Utilization = (Pod1 Usage + Pod2 Usage + &hellip; PodN Usage)/ (Request * Replicas)</p> +<p>从公式可知提升 Workload 利用率不仅可以降低 Request,也可以降低 Replicas。</p> +<h3 id="常见的资源配置问题">常见的资源配置问题</h3> +<p>加拿大软件公司 Densify 在《12 RISK OF KUBERNETES RESOURCE MANAGEMENT》[1]中总结了常见的资源配置问题。在下表中我们在它的基础上增加了副本数维度的分析。</p> <table> <thead> <tr> @@ -346,116 +346,114 @@ cpu: &#34;500m&#34; </thead> <tbody> <tr> -<td>Oversized</td> -<td>Excess CPU resources lead to more waste of nodes and resources</td> -<td>K8s scheduler may request excessive Memory resources, leading to more waste of nodes and resources.</td> -<td>Allowing Pods to request excessive CPU resources can create a &rsquo;noisy neighbor&rsquo; risk, affecting other Pods on the same node</td> -<td>Allowing Pods to request excessive Memory resources can create a &rsquo;noisy neighbor&rsquo; risk, which in turn can affect other Pods running on the same node</td> -<td>Excessive Pods can lead to more waste of nodes and resources</td> +<td>过大</td> +<td>多余的CPU资源导致更多节点和资源的浪费</td> +<td>调度器会申请过多Memory资源,导致更多节点和资源的浪费</td> +<td>允许Pod申请过多的CPU资源从而产生“吵闹邻居”风险,影响同一节点上的其他Pod</td> +<td>允许Pod申请过多的Memory资源从而产生“吵闹邻居”风险,从而影响同一节点上的其他Pod</td> +<td>多余的Pod会导致更多节点和资源的浪费</td> </tr> <tr> -<td>Undersized</td> -<td>This can lead to excessive stacking of Pods on nodes, and if all CPU resources are exhausted, it can result in contention and risk of CPU throttling at the node level</td> -<td>This can lead to excessive stacking of Pods on nodes, and if all Memory resources are exhausted, it can result in the risk of Pod termination (OOM Killer) at the node level</td> -<td>The Pod&rsquo;s CPU usage will be limited, and if the actual workload exceeds the limit, it can result in CPU throttling and performance degradation</td> -<td>The Pod&rsquo;s Memory usage will be limited, and if the actual workload exceeds the limit, it can trigger the OOM Killer to terminate processes</td> -<td>Having too few Pods can result in high utilization rates, leading to stability issues such as performance degradation and OOM Killer</td> +<td>过小</td> +<td>会导致在节点上过度堆叠Pod,如果所有CPU资源被用尽,则会在节点级别上产生争抢和CPU throttling的风险</td> +<td>会导致在节点上过度堆叠Pod,如果所有Memory资源都被用尽,则会在节点级别上产生Pod终止的风险(OOM Killer)</td> +<td>会限制Pod的CPU使用,如果实际业务压力超过Limit,会导致CPU throttling和性能下降</td> +<td>会限制Pod的Memory使用,如果实际业务压力超过Limit,会触发OOM Killer杀死进程</td> +<td>过少的Pod会带来过高的利用率,引发诸如性能下降,OOM Killer等稳定性问题</td> </tr> <tr> -<td>Unset</td> -<td>K8s scheduler will be uncertain about how many Pods can be scheduled in the cluster, and excessive stacking of Pods can create significant performance risks and uneven workloads</td> -<td>The scheduler will be uncertain about how many Pods can be scheduled in the cluster, which can lead to excessive stacking and the risk of Pods being OOM killed</td> -<td>Unconstrained Pods can amplify the &rsquo;noisy neighbor&rsquo; effect and create the risk of CPU throttling</td> -<td>Unconstrained Pods can amplify the &rsquo;noisy neighbor&rsquo; risk, and if the node&rsquo;s memory is exhausted, it can trigger the OOM Killer to terminate processes</td> +<td>不设置</td> +<td>调度器将不确定在集群中可以调度多少Pod,并且过度堆叠的Pod会产生显著的性能风险和不均匀的负载</td> +<td>调度器将不确定在集群中可以调度多少Pod,从而产生过度堆叠和Pod被OOM Kill的风险</td> +<td>Pod将不受约束,放大“吵闹邻居”效应,并产生CPU throttling的风险</td> +<td>Pod将不受约束,放大了“吵闹邻居”风险,如果节点内存耗尽,可能会导致OOM Killer启动</td> <td>N/A</td> </tr> </tbody> </table> -<p>As we can see, setting resource limits too low can lead to stability issues, while setting them too high only results in &ldquo;mere&rdquo; resource waste, which can be acceptable during periods of rapid business growth. This is the main reason why resource utilization rates are generally low for many businesses after migrating to the cloud. The following graph shows the resource usage of an application, with 30% resource waste between the peak historical usage of the Pod and its Request amount. -<img src="/images/resource-waste.jpg" alt="Resource Waste"></p> -<h2 id="application-resource-optimization-model">Application Resource Optimization Model</h2> -<p>After mastering Kubernetes&rsquo; resource model, we can further derive a resource optimization model for cloud-native applications:</p> +<p>大家可以发现资源设置过小会引发稳定性问题,而相比之下资源设置大一些“仅仅”会导致资源浪费,在业务快速发展时期这些浪费是可以接受的。这就是许多企业上云后资源利用率普遍偏低的主要原因。下图是一个应用的资源用量图表,该 Pod 的历史用量的峰值与它的申请量 Request 之间,有30%的资源浪费。</p> +<p><img src="/images/resource-waste.jpg" alt="Resource Waste"></p> +<h2 id="应用资源优化模型">应用资源优化模型</h2> +<p>掌握了 Kubernetes 的资源模型后,我们可以进一步推导出云原生应用的资源优化模型:</p> <p><img src="/images/resource-model.png" alt="Crane Overview"></p> -<p>The five lines in the graph from top to bottom are:</p> +<p>图中五条线从上到下分别是:</p> <ol> -<li>Node Capacity: The total amount of resources in all nodes in the cluster, corresponding to the Capacity of the cluster.</li> -<li>Allocated: The total amount of resources allocated by the application, corresponding to the Pod Request.</li> -<li>Weekly Peak: The peak resource usage of the application during a certain period in the past. Weekly peak can be used to predict future resource usage, and configuring resource specifications based on weekly peak has higher security and more general applicability.</li> -<li>Daily Average Peak: The peak resource usage of the application in the past day.</li> -<li>Mean: The average resource usage of the application, corresponding to Usage.</li> +<li>节点容量:集群中所有节点的资源总量,对应集群的 Capacity</li> +<li>已分配:应用申请的资源总量,对应 Pod Request</li> +<li>周峰值:应用在过去一段时间内资源用量的峰值。周峰值可以预测未来一段时间内的资源使用,通过周峰值配置资源规格的安全性较高,普适性更强</li> +<li>日均峰值:应用在近一天内资源用量的峰值</li> +<li>均值:应用的平均资源用量,对应 Usage</li> </ol> -<p>The idle resources can be divided into two categories:</p> +<p>其中资源的闲置分两类:</p> <ol> -<li>Resource Slack: The difference between Capacity and Request.</li> -<li>Usage Slack: The difference between Request and Usage.</li> +<li>Resource Slack:Capacity 和 Request 之间的差值</li> +<li>Usage Slack:Request 和 Usage 之间的差值</li> </ol> <p>Total Slack = Resource Slack + Usage Slack</p> -<p>The goal of resource optimization is to reduce Resource Slack and Usage Slack. The model provides four steps for reducing waste, in order from top to bottom:</p> +<p>资源优化的目标是 <strong>减少 Resource Slack 和 Usage Slack</strong>。模型中针对如何一步步减少浪费提供了四个步骤,从上到下分别是:</p> <ol> -<li>Improving packing rate: Improving the packing rate can bring the Capacity and Request closer together. There are many ways to achieve this, such as:<a href="/zh-cn/docs/tutorials/scheduling-pods-based-on-actual-node-load">Dynamic scheduler</a>、Tencent Cloud Native Node&rsquo;s node amplification function, etc.</li> -<li>Adjusting business specifications to reduce resource locking: Adjusting business specifications based on the weekly peak resource usage can reduce the Request to the weekly peak line.<a href="/docs/tutorials/recommendation/resource-recommendation">Resource recommendation</a> and <a href="/docs/tutorials/recommendation/replicas-recommendation">Replicas Recommendation</a>can help applications achieve this goal.</li> -<li>Adjusting business specifications + scaling to handle burst traffic: Based on the optimization of specifications, HPA can handle burst traffic to reduce the Request to the daily peak line. At this time, the target utilization rate of HPA is low, only to handle burst traffic, and automatic elasticity does not occur most of the time.</li> -<li>Adjusting business specifications + scaling to handle daily traffic changes: Based on the optimization of specifications, HPA can handle daily traffic to reduce the Request to the mean. At this time, the target utilization rate of HPA is equal to the average utilization rate of the application.</li> +<li>提升装箱率:提升装箱率能够让 Capacity 和 Request 更加接近。手段有很多,例如:<a href="/zh-cn/docs/tutorials/scheduling-pods-based-on-actual-node-load">动态调度器</a>、腾讯云原生节点的节点放大功能等</li> +<li>业务规格调整减少资源锁定:根据周峰值资源用量调整业务规格使的 Request 可以减少到周峰值线。<a href="/zh-cn/docs/tutorials/recommendation/resource-recommendation">资源推荐</a>和<a href="/zh-cn/docs/tutorials/recommendation/replicas-recommendation">副本推荐</a>可以帮助应用实现此目标。</li> +<li>业务规格调整+扩缩容兜底流量突发:在规格优化的基础上再通过 HPA 兜底突发流量使的 Request 可以减少到日均峰值线。此时 HPA 的目标利用率偏低,仅为应对突发流量,绝大多数时间内不发生自动弹性</li> +<li>业务规格调整+扩缩容应对日常流量变化:在规格优化的基础上再通过 HPA 应用日常流量使的 Request 可以减少到均值。此时 HPA 的目标利用率等于应用的平均利用率</li> </ol> -<p>Based on this model, the open-source project Crane provides dynamic scheduling, recommendation framework, intelligent elasticity, and mixed deployment capabilities, realizing an all-in-one FinOps cloud resource optimization platform. In this article, we will focus on the recommendation framework.</p> -<h2 id="optimizing-resource-configuration-through-the-crane-recommendation-framework">Optimizing resource configuration through the Crane recommendation framework</h2> -<p>The open-source project Crane has launched the Recommendation Framework, which automatically analyzes the operation of various resources in the cluster and provides optimization suggestions. By analyzing CPU/Memory monitoring data over a period of time and using resource recommendation algorithms, the Recommendation Framework provides resource configuration suggestions, allowing enterprises to make decisions based on the proposed configurations.</p> -<p>In the following example, we will demonstrate how to quickly start a full cluster resource recommendation.</p> -<p>Before embarking on this cost-cutting journey, you need to install Crane in your environment. Please refer to Crane&rsquo;s installation documentation for guidance.</p> -<h3 id="create-recommendationrule">Create RecommendationRule</h3> -<p>Here&rsquo;s a RecommendationRule example: workload-rule.yaml。</p> +<p>开源项目 Crane 基于这套模型,提供了动态调度、推荐框架、智能弹性、混部等技术能力,实现了一站式的 FinOps 云资源优化平台。本文我们将重点介绍推荐框架部分。</p> +<h2 id="通过-crane-推荐框架优化资源配置">通过 Crane 推荐框架优化资源配置</h2> +<p>开源项目 Crane 推出了推荐框架(RecommendationFramework)自动分析集群的各种资源的运行情况并给出优化建议。推荐框架通过分析过去一段时间的 CPU/Memory 监控数据,基于资源推荐算法给出资源配置的建议,企业可以基于建议配置进行决策。</p> +<p>下面我们通过一个例子介绍如何快速开始一次全集群的资源推荐。</p> +<p>在开始降本之旅之前,您需要在环境中安装 Crane,请参考 Crane 的安装文档。</p> +<h3 id="创建-recommendationrule">创建 RecommendationRule</h3> +<p>下面是一个 RecommendationRule 示例: workload-rule.yaml。</p> <div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-yaml" data-lang="yaml"><span style="display:flex;"><span><span style="color:#f92672">apiVersion</span>: <span style="color:#ae81ff">analysis.crane.io/v1alpha1</span> </span></span><span style="display:flex;"><span><span style="color:#f92672">kind</span>: <span style="color:#ae81ff">RecommendationRule</span> </span></span><span style="display:flex;"><span><span style="color:#f92672">metadata</span>: </span></span><span style="display:flex;"><span> <span style="color:#f92672">name</span>: <span style="color:#ae81ff">workloads-rule</span> </span></span><span style="display:flex;"><span> <span style="color:#f92672">spec</span>: -</span></span><span style="display:flex;"><span> <span style="color:#f92672">runInterval</span>: <span style="color:#ae81ff">24h </span> <span style="color:#75715e"># run once every 24 hours</span> -</span></span><span style="display:flex;"><span> <span style="color:#f92672">resourceSelectors</span>: <span style="color:#75715e"># information about resources</span> +</span></span><span style="display:flex;"><span> <span style="color:#f92672">runInterval</span>: <span style="color:#ae81ff">24h </span> <span style="color:#75715e"># 每24h运行一次</span> +</span></span><span style="display:flex;"><span> <span style="color:#f92672">resourceSelectors</span>: <span style="color:#75715e"># 资源的信息</span> </span></span><span style="display:flex;"><span> - <span style="color:#f92672">kind</span>: <span style="color:#ae81ff">Deployment</span> </span></span><span style="display:flex;"><span> <span style="color:#f92672">apiVersion</span>: <span style="color:#ae81ff">apps/v1</span> </span></span><span style="display:flex;"><span> - <span style="color:#f92672">kind</span>: <span style="color:#ae81ff">StatefulSet</span> </span></span><span style="display:flex;"><span> <span style="color:#f92672">apiVersion</span>: <span style="color:#ae81ff">apps/v1</span> </span></span><span style="display:flex;"><span> <span style="color:#f92672">namespaceSelector</span>: -</span></span><span style="display:flex;"><span> <span style="color:#f92672">any</span>: <span style="color:#66d9ef">true</span> <span style="color:#75715e"># scan all namespaces</span> -</span></span><span style="display:flex;"><span> <span style="color:#f92672">recommenders</span>: <span style="color:#75715e"># Use replica and resource recommenders for Workloads</span> +</span></span><span style="display:flex;"><span> <span style="color:#f92672">any</span>: <span style="color:#66d9ef">true</span> <span style="color:#75715e"># 扫描所有namespace</span> +</span></span><span style="display:flex;"><span> <span style="color:#f92672">recommenders</span>: <span style="color:#75715e"># 使用 Workload 的副本和资源推荐器</span> </span></span><span style="display:flex;"><span> - <span style="color:#f92672">name</span>: <span style="color:#ae81ff">Replicas</span> </span></span><span style="display:flex;"><span> - <span style="color:#f92672">name</span>: <span style="color:#ae81ff">Resource</span> -</span></span></code></pre></div><p>In this example:</p> +</span></span></code></pre></div><p>在该示例中:</p> <ul> -<li>Analysis recommendations are run every 24 hours, with the runInterval format set as an interval of time, such as 1h or 1m. Setting it to empty means running only once.</li> -<li>The resources to be analyzed are set through the resourceSelectors array. Each resourceSelector selects resources in the k8s cluster based on kind, apiVersion, and name. When name is not specified, it means all resources under the namespaceSelector.</li> -<li>The namespaceSelector defines the namespaces of the resources to be analyzed. &ldquo;any: true&rdquo; means selecting all namespaces.</li> -<li>The recommenders define which Recommender(s) should be used for analyzing the resources. Currently supported types are: recommenders.</li> -<li>The resource types and recommenders need to be matched. For example, the Resource Recommender only supports Deployments and StatefulSets by default. Please refer to the recommender&rsquo;s documentation for which resource types each Recommender supports.</li> +<li>每隔24小时运行一次分析推荐,runInterval格式为时间间隔,比如: 1h,1m,设置为空表示只运行一次。</li> +<li>待分析的资源通过配置 resourceSelectors 数组设置,每个 resourceSelector 通过 kind,apiVersion,name 选择 k8s 中的资源,当不指定 name 时表示在 namespaceSelector 基础上的所有资源</li> +<li>namespaceSelector 定义了待分析资源的 namespace,any: true 表示选择所有 namespace</li> +<li>recommenders 定义了待分析的资源需要通过哪些 Recommender 进行分析。目前支持的类型:recommenders</li> +<li>资源类型和 recommenders 需要可以匹配,比如 Resource 推荐默认只支持 Deployments 和 StatefulSets,每种 Recommender 支持哪些资源类型请参考 recommender 的文档</li> </ul> <ol> -<li>Create a RecommendationRule with the following command, and the recommendation will start immediately after creation.</li> +<li>通过以下命令创建 RecommendationRule,刚创建时会立刻开始一次推荐。</li> </ol> <div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-shell" data-lang="shell"><span style="display:flex;"><span>kubectl apply -f workload-rules.yaml -</span></span></code></pre></div><p>This example will perform resource and replica recommendations for Deployments and StatefulSets in all namespaces. -2. Check the recommendation progress of the RecommendationRule. Observe the progress of the recommendation task through Status.recommendations. The recommendation tasks are executed sequentially. If the lastStartTime of all tasks is the latest time and the message has a value, it indicates that the current recommendation has been completed.</p> +</span></span></code></pre></div><p>这个例子会对所有 namespace 中的 Deployments 和 StatefulSets 做资源推荐和副本数推荐。 +2. 检查 RecommendationRule 的推荐进度。通过 Status.recommendations 观察推荐任务的进度,推荐任务是顺序执行,如果所有任务的 lastStartTime 为最近时间且 message 有值,则表示这一次推荐完成</p> <div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-shell" data-lang="shell"><span style="display:flex;"><span>kubectl get rr workloads-rule </span></span></code></pre></div><ol start="3"> -<li>Query the recommendation results with the following command:</li> +<li>通过以下命令查询推荐结果:</li> </ol> <div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-shell" data-lang="shell"><span style="display:flex;"><span>kubectl get recommend -</span></span></code></pre></div><p>You can filter the Recommendation by the following labels, for example: kubectl get recommend -l analysis.crane.io/recommendation-rule-name=workloads-rule</p> -<h3 id="adjust-resource-configurations-based-on-optimization-recommendations-from-the-recommendation">Adjust resource configurations based on optimization recommendations from the Recommendation.</h3> -<p>For resource and replica recommendations, users can PATCH status.recommendedInfo to the Workload to update the resource configurations. For example:</p> +</span></span></code></pre></div><p>可通过以下 label 筛选 Recommendation,比如 kubectl get recommend -l analysis.crane.io/recommendation-rule-name=workloads-rule</p> +<h3 id="根据优化建议-recommendation-调整资源配置">根据优化建议 Recommendation 调整资源配置</h3> +<p>对于资源推荐和副本数推荐建议,用户可以 PATCH status.recommendedInfo 到 workload 更新资源配置,例如:</p> <div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-shell" data-lang="shell"><span style="display:flex;"><span>patchData<span style="color:#f92672">=</span><span style="color:#e6db74">`</span>kubectl get recommend workloads-rule-replicas-rckvb -n default -o jsonpath<span style="color:#f92672">=</span><span style="color:#e6db74">&#39;{.status.recommendedInfo}&#39;</span><span style="color:#e6db74">`</span>;kubectl patch Deployment php-apache -n default --patch <span style="color:#e6db74">&#34;</span><span style="color:#e6db74">${</span>patchData<span style="color:#e6db74">}</span><span style="color:#e6db74">&#34;</span> </span></span></code></pre></div><h3 id="recommender">Recommender</h3> -<p>Currently, Crane supports the following Recommenders:</p> +<p>目前 Crane 支持了以下 Recommender:</p> <ul> -<li><a href="/docs/tutorials/recommendation/resource-recommendation"><strong>Resource Recommendation</strong></a>: By using the VPA algorithm to analyze the actual usage of applications, Crane recommends more appropriate resource configurations.</li> -<li><a href="/docs/tutorials/recommendation/replicas-recommendation"><strong>Replicas Recommendation</strong></a>: By using the HPA algorithm to analyze the actual usage of applications, Crane recommends more appropriate replica numbers.</li> -<li><a href="/docs/tutorials/recommendation/hpa-recommendation"><strong>HPA Recommendation</strong></a>: Scan the Workloads in the cluster and recommend HPA configurations for Workloads that are suitable for horizontal scaling.</li> -<li><a href="/docs/tutorials/recommendation/idlenode-recommendation"><strong>Idlenode Recommendation</strong></a>: By scanning the state and utilization of nodes in the cluster, Node recommendation helps users find idle Kubernetes nodes.</li> -<li><a href="/docs/tutorials/recommendation/service-recommendation"><strong>Service Recommendation</strong></a>: By scanning the running status of Services in the cluster, Service recommendation helps users find idle Kubernetes Services.</li> -<li><a href="/docs/tutorials/recommendation/pv-recommendation"><strong>PV Recommendation</strong></a>: By scanning the running status of PV in the cluster, PV recommendation helps users find idle Kubernetes PV.</li> +<li><a href="/zh-cn/docs/tutorials/recommendation/resource-recommendation"><strong>资源推荐</strong></a>: 通过 VPA 算法分析应用的真实用量推荐更合适的资源配置</li> +<li><a href="/zh-cn/docs/tutorials/recommendation/replicas-recommendation"><strong>副本数推荐</strong></a>: 通过 HPA 算法分析应用的真实用量推荐更合适的副本数量</li> +<li><a href="/zh-cn/docs/tutorials/recommendation/hpa-recommendation"><strong>HPA 推荐</strong></a>: 扫描集群中的 Workload,针对适合适合水平弹性的 Workload 推荐 HPA 配置</li> +<li><a href="/zh-cn/docs/tutorials/recommendation/idlenode-recommendation"><strong>闲置节点推荐</strong></a>: 扫描集群中的闲置节点</li> </ul> -<p>This article focuses on optimizing resource configurations for Workloads, therefore, the following section will focus on resource recommendations and replica recommendations.</p> -<h3 id="resource-recommendations">Resource recommendations</h3> -<p>Here&rsquo;s an example of resource recommendations:</p> +<p>本文重点讨论 Workload 的资源配置优化,因此下面重点介绍资源推荐和副本推荐。</p> +<h3 id="资源推荐">资源推荐</h3> +<p>以下是一个资源推荐结果的样例:</p> <div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-yaml" data-lang="yaml"><span style="display:flex;"><span><span style="color:#f92672">status</span>: </span></span><span style="display:flex;"><span> <span style="color:#f92672">recommendedInfo</span>: &gt;-<span style="color:#e6db74"> </span></span></span><span style="display:flex;"><span><span style="color:#e6db74"> </span> {<span style="color:#e6db74">&#34;spec&#34;</span>:{<span style="color:#e6db74">&#34;template&#34;</span>:{<span style="color:#e6db74">&#34;spec&#34;</span>:{<span style="color:#e6db74">&#34;containers&#34;</span>:[{<span style="color:#e6db74">&#34;name&#34;</span>:<span style="color:#e6db74">&#34;craned&#34;</span>,<span style="color:#e6db74">&#34;resources&#34;</span>:{<span style="color:#e6db74">&#34;requests&#34;</span>:{<span style="color:#e6db74">&#34;cpu&#34;</span>:<span style="color:#e6db74">&#34;150m&#34;</span>,<span style="color:#e6db74">&#34;memory&#34;</span>:<span style="color:#e6db74">&#34;256Mi&#34;</span>}}},{<span style="color:#e6db74">&#34;name&#34;</span>:<span style="color:#e6db74">&#34;dashboard&#34;</span>,<span style="color:#e6db74">&#34;resources&#34;</span>:{<span style="color:#e6db74">&#34;requests&#34;</span>:{<span style="color:#e6db74">&#34;cpu&#34;</span>:<span style="color:#e6db74">&#34;150m&#34;</span>,<span style="color:#e6db74">&#34;memory&#34;</span>:<span style="color:#e6db74">&#34;256Mi&#34;</span>}}}]}}}} @@ -469,18 +467,18 @@ cpu: &#34;500m&#34; </span></span><span style="display:flex;"><span> <span style="color:#f92672">reason</span>: <span style="color:#ae81ff">RecommendationReady</span> </span></span><span style="display:flex;"><span> <span style="color:#f92672">message</span>: <span style="color:#ae81ff">Recommendation is ready</span> </span></span><span style="display:flex;"><span> <span style="color:#f92672">lastUpdateTime</span>: <span style="color:#e6db74">&#39;2022-11-30T03:07:49Z&#39;</span> -</span></span></code></pre></div><p>recommendedInfo displays the recommended resource configuration, while currentInfo displays the current resource configuration. The format is JSON, and the recommended results can be updated to TargetRef using Kubectl Patch.</p> -<h4 id="compute-resource-specification-algorithm">Compute resource specification algorithm</h4> -<p>The resource recommendation process is completed in the following steps:</p> +</span></span></code></pre></div><p>recommendedInfo 显示了推荐的资源配置,currentInfo 显示了当前的资源配置,格式是 Json ,可以通过 Kubectl Patch 将推荐结果更新到 TargetRef</p> +<h4 id="计算资源规格算法">计算资源规格算法</h4> +<p>资源推荐按以下步骤完成一次推荐过程:</p> <ol> -<li>Obtain the CPU and memory usage history of the workload in the past week through monitoring data.</li> -<li>Based on the historical usage, use the VPA Histogram to take the P99 percentile and multiply it by an amplification factor.</li> -<li>OOM Protection: If there have been historical OOM events in the container, consider increasing memory appropriately when making memory recommendations.</li> -<li>Resource Specification Regularization: Round up the recommended results to the specified container specifications. -The basic principle is to set the Request slightly higher than the maximum historical usage based on historical resource usage, and consider factors such as OOM and Pod specifications.</li> +<li>通过监控数据,获取 Workload 过去一周的 CPU 和 Memory 历史用量。</li> +<li>基于历史用量通过 VPA Histogram 取 P99 百分位后再乘以放大系数</li> +<li>OOM 保护:如果容器存在历史的 OOM 事件,则考虑 OOM 时的内存适量增大内存推荐结果</li> +<li>资源规格规整:按指定的容器规格对推荐结果向上取整</li> </ol> -<h4 id="replica-recommendations">Replica recommendations</h4> -<p>Here&rsquo;s an example of replica recommendations:</p> +<p>基本原理是基于历史的资源用量,将 Request 配置成略高于历史用量的最大值并且考虑 OOM,Pod 规格等因素。</p> +<h4 id="副本推荐">副本推荐</h4> +<p>以下是一个副本推荐结果的样例:</p> <div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-yaml" data-lang="yaml"><span style="display:flex;"><span><span style="color:#f92672">status</span>: </span></span><span style="display:flex;"><span> <span style="color:#f92672">recommendedInfo</span>: <span style="color:#e6db74">&#39;{&#34;spec&#34;:{&#34;replicas&#34;:1}}&#39;</span> </span></span><span style="display:flex;"><span> <span style="color:#f92672">currentInfo</span>: <span style="color:#e6db74">&#39;{&#34;spec&#34;:{&#34;replicas&#34;:2}}&#39;</span> @@ -492,89 +490,89 @@ The basic principle is to set the Request slightly higher than the maximum histo </span></span><span style="display:flex;"><span> <span style="color:#f92672">reason</span>: <span style="color:#ae81ff">RecommendationReady</span> </span></span><span style="display:flex;"><span> <span style="color:#f92672">message</span>: <span style="color:#ae81ff">Recommendation is ready</span> </span></span><span style="display:flex;"><span> <span style="color:#f92672">lastUpdateTime</span>: <span style="color:#e6db74">&#39;2022-11-29T11:07:45Z&#39;</span> -</span></span></code></pre></div><p>The recommendedInfo displays the recommended replica count, and the currentInfo displays the current replica count in JSON format. The recommended results can be updated to TargetRef using Kubectl Patch.</p> -<p>The replica recommendation process is completed in the following steps:</p> +</span></span></code></pre></div><p>recommendedInfo 显示了推荐的副本数,currentInfo 显示了当前的副本数,格式是 Json ,可以通过 Kubectl Patch 将推荐结果更新到 TargetRef</p> +<p>副本推荐按以下步骤完成一次推荐过程:</p> <ol> -<li>Obtain the CPU and memory usage history of the workload in the past week through monitoring data.</li> -<li>Use the DSP algorithm to predict the future CPU usage for the next week.</li> -<li>Calculate the replica count for CPU and memory separately, and take the larger value.</li> +<li>通过监控数据,获取 Workload 过去一周的 CPU 和 Memory 历史用量。</li> +<li>用 DSP 算法预测未来一周 CPU 用量</li> +<li>分别计算 CPU 和 内存分别对应的副本数,取较大值</li> </ol> -<h4 id="compute-replica-algorithm">Compute replica algorithm</h4> -<p>Taking CPU as an example, assuming that the P99 of the historical CPU usage of the workload is 10 cores, and the Pod CPU Request is 5 cores, the target peak utilization is 50%. It can be inferred that 4 replicas are needed to meet the requirement of the peak utilization not being less than 50%.</p> -<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-yaml" data-lang="yaml"><span style="display:flex;"><span><span style="color:#ae81ff">replicas := int32(math.Ceil(workloadUsage / (TargetUtilization * float64(requestTotal))))</span> -</span></span></code></pre></div><h3 id="differences-with-the-community">Differences with the community</h3> -<p>According to the resource optimization model, the recommendation framework can reduce the Request of the application to the weekly peak, and the recommendation framework only provides specification recommendations without executing changes, which is more secure and applicable to more business types. If further Request reduction is needed, HPA and other solutions can be considered.</p> +<h4 id="计算副本算法">计算副本算法</h4> +<p>以 CPU 举例,假设工作负载 CPU 历史用量的 P99 是10核,Pod CPU Request 是5核,目标峰值利用率是50%,可知副本数是4个可以满足峰值利用率不小于50%。</p> +<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-yaml" data-lang="yaml"><span style="display:flex;"><span><span style="color:#ae81ff">replicas := int32(math.Ceil(workloadUsage / (TargetUtilization * float64(requestTotal) )))</span> +</span></span></code></pre></div><h3 id="和社区的差异">和社区的差异</h3> +<p>由资源优化模型可知,推荐框架能够将应用的 Request 降低到周峰值,并且推荐框架只做规格推荐,不执行变更,安全性更高、适用于更多业务类型。如果需要进一步降低 Request,可以考虑通过 HPA 等方案实现。</p> <table> <thead> <tr> <th></th> -<th>utilization rate</th> -<th>management configuration type</th> -<th>change type</th> +<th>利用率</th> +<th>管理配置类型</th> +<th>变更类型</th> </tr> </thead> <tbody> <tr> -<td>Community HPA</td> -<td>average utilization rate</td> -<td>replica number</td> -<td>automatic scaling</td> +<td>社区 HPA</td> +<td>平均利用率</td> +<td>副本数</td> +<td>自动变更</td> </tr> <tr> -<td>Community VPA</td> -<td>approximate peak utilization rate</td> -<td>resource Request</td> -<td>automatic scaling/recommendation</td> +<td>社区 VPA</td> +<td>近似峰值利用率</td> +<td>资源 Request</td> +<td>自动变更/建议</td> </tr> <tr> -<td>Crane recommendation framework</td> -<td>weekly peak utilization rate</td> -<td>replica number + resources Request</td> -<td>automatic scaling/recommendation</td> +<td>Crane 推荐框架</td> +<td>周峰值利用率</td> +<td>副本数+资源 Request</td> +<td>自动变更/建议</td> </tr> <tr> -<td>advantages of the recommendation framework</td> -<td>Although the weekly peak utilization rate provides relatively small cost reduction space, it is simple to configure, safer, and applicable to more types of applications.</td> -<td>Both replica number and resource Request can be recommended simultaneously, and adjustments can be made as needed.</td> -<td>Provide recommendation suggestions through CRD/Metric, which is convenient for integration into user systems. In the future, it will support automatic updates through CICD</td> +<td>推荐框架的优势</td> +<td>虽然周峰值利用率带来的降本空间较小,但是配置简单,更加安全,适用更多应用类型</td> +<td>可以同时推荐副本数+资源 Request,按需调整</td> +<td>提供CRD/Metric方式的推荐建议,方便集成用户的系统,未来支持通过CICD实现自动更新</td> </tr> </tbody> </table> -<h2 id="best-practices">Best practices</h2> -<p>FinOps recommends using an iterative approach to manage variable costs of cloud services. The continuous management iteration consists of three phases: cost observation (Inform), cost analysis (Recommend), and cost optimization (Operate). In the following section, we will introduce how to use Crane for K8S resource configuration management based on these three phases and the internal practice experience of Tencent.</p> -<h3 id="cost-monitoring--calculating-costsbenefits">Cost Monitoring&ndash;Calculating Costs/Benefits</h3> -<p>Cost observation is the core key to the cost reduction journey. Only by setting clear goals can cost reduction optimization be targeted. Therefore, users need to establish a monitoring and observation system for cluster resources to evaluate whether cost reduction and efficiency improvement are necessary. For example, what is the packing rate of the cluster? What is the average/peak utilization rate of the cluster? What is the resource usage distribution of each Namespace, and what is the average/peak utilization rate of each Workload?</p> -<h3 id="cost-analysis--establishing-systems">Cost Analysis&ndash;Establishing Systems</h3> -<p>The Crane recommendation framework provides a complete set of analysis and optimization tools for full-fledged analysis of cluster resources, and records the recommended results in CRD and Metrics for easy integration into business systems.</p> -<p>The practice within Tencent is as follows:</p> +<h2 id="最佳实践">最佳实践</h2> +<p>FinOps 建议采用迭代方法来管理云服务的可变成本。持续管理的迭代由三个阶段组成:成本观测(Inform)、 成本分析(Recommend)和 成本优化(Operate)。下面我们将基于这三个阶段+腾讯内部的实践经验介绍如何使用 Crane 实现 K8S 资源的配置管理。</p> +<h3 id="成本观测--计算成本收益">成本观测&ndash;计算成本/收益</h3> +<p>成本观测是降本之旅的核心关键。只有明确了目标,降本优化才会有的放矢。因此,用户需要建立集群资源的监控观测系统,来评估是否需要进行降本增效。例如,集群的装箱率是多少?集群的平均/峰值利用率是多少?Namespace 的资源用量分布,Workload 的平均/峰值利用率是多少?</p> +<h3 id="成本分析--建立系统">成本分析&ndash;建立系统</h3> +<p>Crane 的推荐框架提供了一整套分析优化的工具对集群资源进行全方位的分析,并且将推荐结果记录到 CRD 和 Metric,方便业务系统集成。</p> +<p>腾讯内部的实践是:</p> <ol> -<li>Use RecommendationRule to recommend resources and replicas for all workloads in the cluster, updated every 12 hours.</li> -<li>Display the complete recommendation results separately in the control interface.</li> -<li>Display resource/replica recommendations on the workload data display page.</li> -<li>Display observation data of the workload in Grafana charts.</li> -<li>Provide OpenAPI for businesses to obtain recommendations and optimize them according to business needs.</li> +<li>通过 RecommendationRule 对集群中所有的 Workload 进行资源和副本推荐,每12小时更新一次</li> +<li>在管控界面单独展示完整的推荐结果</li> +<li>在 Workload 数据展示页面展示资源/副本推荐</li> +<li>在 Grafana 图表中展示 Workload 的观测数据</li> +<li>提供 OpenAPI 让业务方获取推荐建议,按业务需求进行优化</li> </ol> -<h3 id="cost-optimization--progressive-recommendations">Cost Optimization&ndash;Progressive Recommendations</h3> -<p>The FinOps Foundation has defined a &ldquo;crawl, walk, run&rdquo; maturity method for FinOps, enabling enterprises to start small and gradually expand in scale, scope, and complexity. Similarly, the premise of cost reduction is to ensure stability, as changes in resource configuration and unreasonable configurations may affect business stability. User optimization processes should follow the same approach:</p> -<ol> -<li>Verify the accuracy of the configuration in the CI/CD environment before updating the production environment.</li> -<li>Optimize businesses with severe waste first, and then optimize businesses with relatively low configurations.</li> -<li>Optimize non-core businesses first, and then optimize core businesses.</li> -<li>Configure recommended parameters based on business characteristics: Online businesses require more resource buffers, while offline businesses can accept higher utilization rates.</li> -<li>The release platform prompts users with recommended configurations and updates only after confirmation to prevent unexpected online changes.</li> -<li>Some business clusters automatically update workload configurations based on recommended suggestions to achieve higher utilization rates.</li> -</ol> -<p>In the book &ldquo;Cloud FinOps&rdquo; which introduces FinOps, it shares an example of a Fortune 500 company optimizing resources through an automated system, with the following workflow:</p> +<h3 id="成本优化--渐进式推进">成本优化&ndash;渐进式推进</h3> +<p>FinOps 基金会定义了关于 FinOps 的“爬、走、跑”的成熟度方法,使企业能够从小处着手,并在规模、范围和复杂性上不断扩大。同样的,降本的前提是稳定性保证不受影响,资源配置的变更发布和不合理的配置可能会影响业务稳定性,用户的优化过程也要遵循同样的方式:</p> +<p>1.先在 CI/CD 环境验证配置的准确性再更新生产环境。 +2.先优化浪费严重的业务,再优化已经比较低配置的业务 +3.先优化非核心业务,再优化核心业务 +4.根据业务特征配置推荐参数:线上业务需要更多的资源 buffer 而离线业务则可以接受更高的利用率。 +5.发布平台通过提示用户建议的配置,让用户确认后再更新以防止意料之外的线上变更。 +6.部分业务集群通过自动化工具自动依据推荐建议更新 Workload 配置以实现更高的利用率。</p> +<p>在介绍 FinOps 的书籍《Cloud FinOps》中它分享了一个世界500强公司通过自动化系统进行资源优化的例子,工作流如下:</p> <p><img src="/images/resource-flow.png" alt="Resource flow"></p> -<p>Automated configuration optimization is considered an advanced stage in FinOps and is recommended for use in the advanced stages of FinOps implementation. However, you should consider tracking the recommendations and have the corresponding team manually implement the necessary changes.</p> -<h2 id="roadmap">Roadmap</h2> -<p>Whether or not resource optimization is needed, Crane can be used as a trial object when practicing FinOps. You can first understand the current state of the Kubernetes cluster through cost display, and choose the optimization method based on the problem. Resource configuration optimization, as introduced in this article, is the most direct and common method.</p> -<p>In the future, the Crane recommendation framework will evolve towards more accurate, intelligent, and rich goals:</p> -<p>-Integration with CI/CD frameworks: Automated configuration updates can further improve utilization rates compared to manual updates and are suitable for business scenarios with higher resource utilization rates. --Cost left shift: Discover and solve resource waste earlier through configuration optimization in the CI/CD stage. --Configuration recommendation based on application load characteristics: Identify load patterns and burst tasks based on algorithms and provide reasonable recommendations. --Resource recommendation for task types: Currently, more support is provided for long-running online businesses, but resource recommendations can also optimize configuration for task-type applications. --Analysis of more types of idle resources in Kubernetes: Scan idle resources in the cluster, such as Load Balancer/Storage/Node/GPU.</p> -<h2 id="appendix">Appendix</h2> +<p>自动的配置优化在 FinOps 中属于高级阶段,推荐在实践 FinOps 的高级阶段中使用。不过至少,你应该考虑跟踪你的推荐,并且让对应的团队手动执行所需的变更。</p> +<h2 id="展望未来">展望未来</h2> +<p>无论是否需要资源优化,当你希望实践 FinOps 时,Crane 都可以作为尝试对象。你可以首先通过集群的成本展示了解当前的 Kubernetes 集群的现状,并根据问题所在选择优化的方式,而本文介绍的资源配置优化是最直接和最常见的手段。</p> +<p>未来 Crane 的推荐框架将朝着更准确、更智能、更丰富的目标演进:</p> +<ul> +<li>集成 CI/CD 框架:相比手动更新,自动化方式的配置更新能进一步提升利用率,适用于对资源利用率更高的业务场景。</li> +<li>成本左移:在 CI/CD 阶段通过配置优化尽早的发现资源浪费并解决它们。</li> +<li>基于应用负载特征的配置推荐:基于算法识别负载规律型业务和突发任务型业务,并给出合理的推荐。</li> +<li>任务类型的资源推荐:目前支持的更多是 Long Running 的在线业务,任务类型的应用也可以通过资源推荐优化配置。</li> +<li>更多 Kubernetes 闲置资源类型的分析:扫描集群中闲置的资源,例如 Load Balancer/Storage/Node/GPU。</li> +</ul> +<h2 id="附录">附录</h2> <p>1.The Top 12 Kubernetes Resource Risks: K8s Best Practices: <a href="https://www.densify.com/resources/k8s-resource-risks">Top 12 Kubernetes Resource Risks</a></p>Docs: How Kujiale achieve autoscaling with Crane EHPA/docs/best-practices/how-kujiale-adopt-ehpa/Mon, 01 Jan 0001 00:00:00 +0000/docs/best-practices/how-kujiale-adopt-ehpa/ <p>The original article is:<a href="https://mp.weixin.qq.com/s/3X_hHbisynxDwWx9Lnbp-w">How Kujiale achieve autoscaling with Crane EHPA</a></p> \ No newline at end of file diff --git a/docs/contributing/code-standards/index.html b/docs/contributing/code-standards/index.html index 010884f1e..b859fac16 100644 --- a/docs/contributing/code-standards/index.html +++ b/docs/contributing/code-standards/index.html @@ -1,7 +1,7 @@ Code Standard | Crane