-
Notifications
You must be signed in to change notification settings - Fork 239
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[pyroscope.java] Clean-up JFR files left behind by previous instances of alloy to reduce risk of filling up disk when alloy's in crash loop #2317
base: main
Are you sure you want to change the base?
Conversation
e2a36a1
to
fe958a5
Compare
…mize risk of filling up disk when alloy's in a crash loop
fe958a5
to
037e9bd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, thank you for the PR! I'm worried about deleting files in directories which Alloy doesn't "own", so it'd be nice if we use the --storage.path
dir instead.
const spyName = "alloy.java" | ||
const ( | ||
spyName = "alloy.java" | ||
processJfrDir = "/tmp" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's odd that the process has been using /tmp
for this. It'd make more sense to use the directory specified by the --storage.path
cmd arg, similarly to other processes such as loki.file
and prometheus.remote_write
.
@grafana/grafana-alloy-profiling-maintainers Would you mind if we change pyroscope.java
to use the storage path instead please? I don't feel comfortable with Alloy deleting anything outside of that directory.
Note that if we do this we'd have to update the docs too. They currently mention that tmp_dir
is used. The docs also mention this:
The asprof binary runs with root permissions. If you change the tmp_dir configuration to something other than /tmp, then you must ensure that the directory is only writable by root.
I suppose using the storage path still won't be a problem, since root should also have access to it? IDK why the docs say "only writable by root" though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @ptodev, thanks for reviewing
It's odd that the process has been using /tmp for this. It'd make more sense to use the directory specified by the --storage.path cmd arg
Maybe this was to distinguish the storage directory on alloy containers from a directory on the profiled java process's container. tmp_dir
needs to be a directory on the java container for async-profiler to work
I can change the JFR files to be written to tmp_dir
as well though, since right now only the async-profiler binaries are written to that path
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, thank you for the PR! I'm worried about deleting files in directories which Alloy doesn't "own", so it'd be nice if we use the --storage.path dir instead.
Another option could be changing the file name to something less likely to conflict, like by prepending alloy-
@@ -63,6 +71,16 @@ func newProfilingLoop(pid int, target discovery.Target, logger log.Logger, profi | |||
return p | |||
} | |||
|
|||
p.wg.Add(1) | |||
go func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this happen in a goroutine? I would expect that we want cleanup to complete before the regular loop begins, so starting this and allowing the internal scheduler to decide what goes first doesn't seem right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @dehaansa, I was thinking the clean-up can safely run in parallel to the loop as a small optimization. It has a check to make sure the JFR file eventually created by the loop isn't accidentally deleted during clean-up, since that file always has the same name for the current process running alloy.
Let me know if you think it's still worth changing though!
PR Description
Cleans-up unused JFR files that were accumulating on profiling targets' filesystem by previously terminated instances of alloy. This helps reduce the risk of files piling up if alloy is constantly restarting, like if kubernetes memory limits are too low.
Which issue(s) this PR fixes
Fixes #1960
Notes to the Reviewer
I tested this manually on our EKS cluster with amd64 linux nodes. I both manually created
asprof
files and used some left behind by terminated alloy pods to test that this new code is executed once during start-up.I couldn't find any unit or integration tests for this component so let me know if there's suggestions for further testing.
PR Checklist