[pyroscope.java] Clean-up JFR files left behind by previous instances of alloy to reduce risk of filling up disk when alloy's in crash loop #2317

swar8080 · 2024-12-27T16:19:54Z

PR Description

Cleans-up unused JFR files that were accumulating on profiling targets' filesystem by previously terminated instances of alloy. This helps reduce the risk of files piling up if alloy is constantly restarting, like if kubernetes memory limits are too low.

Which issue(s) this PR fixes

Fixes #1960

Notes to the Reviewer

I tested this manually on our EKS cluster with amd64 linux nodes. I both manually created asprof files and used some left behind by terminated alloy pods to test that this new code is executed once during start-up.

I couldn't find any unit or integration tests for this component so let me know if there's suggestions for further testing.

PR Checklist

CHANGELOG.md updated
Tests updated

CLAassistant · 2024-12-27T16:20:01Z

All committers have signed the CLA.

…mize risk of filling up disk when alloy's in a crash loop

ptodev

Hi, thank you for the PR! I'm worried about deleting files in directories which Alloy doesn't "own", so it'd be nice if we use the --storage.path dir instead.

ptodev · 2024-12-30T14:05:39Z

internal/component/pyroscope/java/loop.go

-const spyName = "alloy.java"
+const (
+	spyName       = "alloy.java"
+	processJfrDir = "/tmp"


It's odd that the process has been using /tmp for this. It'd make more sense to use the directory specified by the --storage.path cmd arg, similarly to other processes such as loki.file and prometheus.remote_write.

@grafana/grafana-alloy-profiling-maintainers Would you mind if we change pyroscope.java to use the storage path instead please? I don't feel comfortable with Alloy deleting anything outside of that directory.

Note that if we do this we'd have to update the docs too. They currently mention that tmp_dir is used. The docs also mention this:

The asprof binary runs with root permissions. If you change the tmp_dir configuration to something other than /tmp, then you must ensure that the directory is only writable by root.

I suppose using the storage path still won't be a problem, since root should also have access to it? IDK why the docs say "only writable by root" though.

Hi @ptodev, thanks for reviewing

It's odd that the process has been using /tmp for this. It'd make more sense to use the directory specified by the --storage.path cmd arg

Maybe this was to distinguish the storage directory on alloy containers from a directory on the profiled java process's container. tmp_dir needs to be a directory on the java container for async-profiler to work

I can change the JFR files to be written to tmp_dir as well though, since right now only the async-profiler binaries are written to that path

Hi, thank you for the PR! I'm worried about deleting files in directories which Alloy doesn't "own", so it'd be nice if we use the --storage.path dir instead.

Another option could be changing the file name to something less likely to conflict, like by prepending alloy-

dehaansa · 2024-12-30T14:22:56Z

internal/component/pyroscope/java/loop.go

@@ -63,6 +71,16 @@ func newProfilingLoop(pid int, target discovery.Target, logger log.Logger, profi
 		return p
 	}

+	p.wg.Add(1)
+	go func() {


Should this happen in a goroutine? I would expect that we want cleanup to complete before the regular loop begins, so starting this and allowing the internal scheduler to decide what goes first doesn't seem right.

Hi @dehaansa, I was thinking the clean-up can safely run in parallel to the loop as a small optimization. It has a check to make sure the JFR file eventually created by the loop isn't accidentally deleted during clean-up, since that file always has the same name for the current process running alloy.

Let me know if you think it's still worth changing though!

swar8080 requested review from a team as code owners December 27, 2024 16:19

swar8080 force-pushed the cleanup-old-jfr-files branch from e2a36a1 to fe958a5 Compare December 27, 2024 16:23

Clean-up JFR files left behind by previous instances of alloy to mini…

037e9bd

…mize risk of filling up disk when alloy's in a crash loop

swar8080 force-pushed the cleanup-old-jfr-files branch from fe958a5 to 037e9bd Compare December 27, 2024 16:24

ptodev reviewed Dec 30, 2024

View reviewed changes

dehaansa reviewed Dec 30, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[pyroscope.java] Clean-up JFR files left behind by previous instances of alloy to reduce risk of filling up disk when alloy's in crash loop #2317

[pyroscope.java] Clean-up JFR files left behind by previous instances of alloy to reduce risk of filling up disk when alloy's in crash loop #2317

swar8080 commented Dec 27, 2024

CLAassistant commented Dec 27, 2024 •

edited

Loading

ptodev left a comment

ptodev Dec 30, 2024

swar8080 Dec 30, 2024

swar8080 Dec 30, 2024

dehaansa Dec 30, 2024

swar8080 Dec 30, 2024

[pyroscope.java] Clean-up JFR files left behind by previous instances of alloy to reduce risk of filling up disk when alloy's in crash loop #2317

Are you sure you want to change the base?

[pyroscope.java] Clean-up JFR files left behind by previous instances of alloy to reduce risk of filling up disk when alloy's in crash loop #2317

Conversation

swar8080 commented Dec 27, 2024

PR Description

Which issue(s) this PR fixes

Notes to the Reviewer

PR Checklist

CLAassistant commented Dec 27, 2024 • edited Loading

ptodev left a comment

Choose a reason for hiding this comment

ptodev Dec 30, 2024

Choose a reason for hiding this comment

swar8080 Dec 30, 2024

Choose a reason for hiding this comment

swar8080 Dec 30, 2024

Choose a reason for hiding this comment

dehaansa Dec 30, 2024

Choose a reason for hiding this comment

swar8080 Dec 30, 2024

Choose a reason for hiding this comment

CLAassistant commented Dec 27, 2024 •

edited

Loading