Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[pyroscope.java] Clean-up JFR files left behind by previous instances of alloy to reduce risk of filling up disk when alloy's in crash loop #2317

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

swar8080
Copy link

PR Description

Cleans-up unused JFR files that were accumulating on profiling targets' filesystem by previously terminated instances of alloy. This helps reduce the risk of files piling up if alloy is constantly restarting, like if kubernetes memory limits are too low.

Which issue(s) this PR fixes

Fixes #1960

Notes to the Reviewer

I tested this manually on our EKS cluster with amd64 linux nodes. I both manually created asprof files and used some left behind by terminated alloy pods to test that this new code is executed once during start-up.

image (18)

I couldn't find any unit or integration tests for this component so let me know if there's suggestions for further testing.

PR Checklist

  • CHANGELOG.md updated
  • Tests updated

@swar8080 swar8080 requested review from a team as code owners December 27, 2024 16:19
@CLAassistant
Copy link

CLAassistant commented Dec 27, 2024

CLA assistant check
All committers have signed the CLA.

@swar8080 swar8080 force-pushed the cleanup-old-jfr-files branch from e2a36a1 to fe958a5 Compare December 27, 2024 16:23
…mize risk of filling up disk when alloy's in a crash loop
@swar8080 swar8080 force-pushed the cleanup-old-jfr-files branch from fe958a5 to 037e9bd Compare December 27, 2024 16:24
Copy link
Contributor

@ptodev ptodev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, thank you for the PR! I'm worried about deleting files in directories which Alloy doesn't "own", so it'd be nice if we use the --storage.path dir instead.

const spyName = "alloy.java"
const (
spyName = "alloy.java"
processJfrDir = "/tmp"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's odd that the process has been using /tmp for this. It'd make more sense to use the directory specified by the --storage.path cmd arg, similarly to other processes such as loki.file and prometheus.remote_write.

@grafana/grafana-alloy-profiling-maintainers Would you mind if we change pyroscope.java to use the storage path instead please? I don't feel comfortable with Alloy deleting anything outside of that directory.

Note that if we do this we'd have to update the docs too. They currently mention that tmp_dir is used. The docs also mention this:

The asprof binary runs with root permissions. If you change the tmp_dir configuration to something other than /tmp, then you must ensure that the directory is only writable by root.

I suppose using the storage path still won't be a problem, since root should also have access to it? IDK why the docs say "only writable by root" though.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ptodev, thanks for reviewing

It's odd that the process has been using /tmp for this. It'd make more sense to use the directory specified by the --storage.path cmd arg

Maybe this was to distinguish the storage directory on alloy containers from a directory on the profiled java process's container. tmp_dir needs to be a directory on the java container for async-profiler to work

I can change the JFR files to be written to tmp_dir as well though, since right now only the async-profiler binaries are written to that path

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, thank you for the PR! I'm worried about deleting files in directories which Alloy doesn't "own", so it'd be nice if we use the --storage.path dir instead.

Another option could be changing the file name to something less likely to conflict, like by prepending alloy-

@@ -63,6 +71,16 @@ func newProfilingLoop(pid int, target discovery.Target, logger log.Logger, profi
return p
}

p.wg.Add(1)
go func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this happen in a goroutine? I would expect that we want cleanup to complete before the regular loop begins, so starting this and allowing the internal scheduler to decide what goes first doesn't seem right.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @dehaansa, I was thinking the clean-up can safely run in parallel to the loop as a small optimization. It has a check to make sure the JFR file eventually created by the loop isn't accidentally deleted during clean-up, since that file always has the same name for the current process running alloy.

Let me know if you think it's still worth changing though!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[pyroscope.java] Old JFR files accumulating on disk
4 participants