Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[proposal] build tsdb framework and interface for refactor of metric_cache in koordlet #586

Closed
jasonliu747 opened this issue Sep 5, 2022 · 6 comments
Assignees
Labels
area/koordlet help wanted Extra attention is needed kind/proposal Create a report to help us improve
Milestone

Comments

@jasonliu747
Copy link
Member

jasonliu747 commented Sep 5, 2022

What is your proposal:
see #1241 for full description.

Why is this needed:

Is there a suggested solution, if so, please add it:

@jasonliu747 jasonliu747 added kind/proposal Create a report to help us improve help wanted Extra attention is needed area/koordlet labels Sep 5, 2022
@jasonliu747 jasonliu747 added this to the someday milestone Sep 6, 2022
@jasonliu747
Copy link
Member Author

/cc @LambdaHJ
辛苦在这下面补充一下你之前的尝试,还有遇到的问题,谢谢~

@LambdaHJ
Copy link
Contributor

将指标使用tsdb存储,一些业务字段需要以label的方式存储到tsdb,如下:

func (ts *tsstorage) InsertPodResourceMetric(n *podResourceMetric) error {
	rows := []tstorage.Row{
		{
			Metric: "pod_resource_cpu",
			Labels: []tstorage.Label{
				{"name", n.PodUID},
			},
			DataPoint: tstorage.DataPoint{
				Value:     n.CPUUsedCores,
				Timestamp: int64(n.Timestamp.Second()),
			},
		},
		{
			Metric: "pod_resource_memory",
			Labels: []tstorage.Label{
				{"name", n.PodUID},
			},
			DataPoint: tstorage.DataPoint{
				Value:     n.MemoryUsedBytes,
				Timestamp: int64(n.Timestamp.Second()),
			},
		},
	}
	for i := range n.GPUs {
		gpuRows := []tstorage.Row{
			{
				Metric: "pod_resource_gpu_memory",
				Labels: []tstorage.Label{
					{"name", n.PodUID},
					{"DeviceUUID", n.GPUs[i].DeviceUUID},
					{"Minor", strconv.Itoa(int(n.GPUs[i].Minor))},
				},
				DataPoint: tstorage.DataPoint{
					Value:     n.GPUs[i].MemoryUsed,
					Timestamp: int64(n.Timestamp.Second()),
				},
			},
			{
				Metric: "pod_resource_gpu_total_memory",
				Labels: []tstorage.Label{
					{"name", n.PodUID},
					{"deviceUUID", n.GPUs[i].DeviceUUID},
					{"Minor", strconv.Itoa(int(n.GPUs[i].Minor))},
				},
				DataPoint: tstorage.DataPoint{
					Value:     n.GPUs[i].MemoryUsed,
					Timestamp: int64(n.Timestamp.Second()),
				},
			},
			{
				Metric: "pod_resource_gpu_smutil",
				Labels: []tstorage.Label{
					{"name", n.PodUID},
					{"deviceUUID", n.GPUs[i].DeviceUUID},
					{"Minor", strconv.Itoa(int(n.GPUs[i].Minor))},
				},
				DataPoint: tstorage.DataPoint{
					Value:     n.GPUs[i].MemoryUsed,
					Timestamp: int64(n.Timestamp.Second()),
				},
			},
		}
		rows = append(rows, gpuRows...)
	}
	return ts.db.InsertRows(rows)
}

这就要求查询metrics时返回数据要包含label。
但是使用到的嵌入式tsdb查询返回不包含label。
nakabonne/tstorage#36

@zwzhang0107
Copy link
Contributor

可以考虑用prometheus的tsdb库
https://github.com/prometheus/prometheus/tree/main/tsdb

@LambdaHJ
Copy link
Contributor

prometheus的tsdb必须要写磁盘。
image
可能需要挂载hostpath.

@LambdaHJ
Copy link
Contributor

影响模块:metriccache
主要变更代码: storage.go storage_tables.go
副作用:可能会影响使用metriccache
方案:
使用prometheus tsdb模块存储数据。
由于tsdb一定需要写磁盘,规划挂载emptydir解决磁盘读写问题。

@FillZpp FillZpp moved this to 📋 Backlog in Koordinator Backlog Nov 29, 2022
@zwzhang0107 zwzhang0107 modified the milestones: someday, v1.3 Apr 11, 2023
@saintube saintube moved this from 📋 Backlog to 🏗 In progress in Koordinator Backlog Apr 18, 2023
@zwzhang0107 zwzhang0107 changed the title [proposal] metric_cache 存储模块设计及重构 [proposal] refactor metric cache module in koordlet with tsdb storage Apr 23, 2023
@zwzhang0107 zwzhang0107 changed the title [proposal] refactor metric cache module in koordlet with tsdb storage [proposal] refactor metric_cache module in koordlet with tsdb-type storage Apr 23, 2023
@zwzhang0107 zwzhang0107 changed the title [proposal] refactor metric_cache module in koordlet with tsdb-type storage [proposal] refactor metric cache module in koordlet with tsdb-type storage Apr 23, 2023
@zwzhang0107 zwzhang0107 changed the title [proposal] refactor metric cache module in koordlet with tsdb-type storage [proposal] build tsdb framework and interface for refactor of metric_cache in koordlet Apr 23, 2023
@zwzhang0107
Copy link
Contributor

zwzhang0107 commented Apr 28, 2023

Resource Consumption Comparation: before & after pod & container throttled ratio saved to tsdb.

We deploy koordlet v1.2 and latest version with tsdb refactor on the same node and start 60 pods on it for data collecting.

In v1.2, only latest 30min metrics are saved in sqlite, in latest version we set the time range is extended to 24h.

image

image

image

so we set the default time range as 12h now, in future, we should consider save corse-grained metrics for earlier metrics.

@github-project-automation github-project-automation bot moved this from 🏗 In progress to ✅ Done in Koordinator Backlog May 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/koordlet help wanted Extra attention is needed kind/proposal Create a report to help us improve
Projects
Status: Done
Development

No branches or pull requests

3 participants