-
Notifications
You must be signed in to change notification settings - Fork 240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[pyroscope.java] Clean-up JFR files left behind by previous instances of alloy to reduce risk of filling up disk when alloy's in crash loop #2317
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -7,6 +7,8 @@ import ( | |
_ "embed" | ||
"fmt" | ||
"os" | ||
"path/filepath" | ||
"regexp" | ||
"strconv" | ||
"strings" | ||
"sync" | ||
|
@@ -23,7 +25,12 @@ import ( | |
gopsutil "github.com/shirou/gopsutil/v3/process" | ||
) | ||
|
||
const spyName = "alloy.java" | ||
const ( | ||
spyName = "alloy.java" | ||
processJfrDir = "/tmp" | ||
) | ||
|
||
var jfrFileNamePattern = regexp.MustCompile("^asprof-\\d+-\\d+\\.jfr$") | ||
|
||
type profilingLoop struct { | ||
logger log.Logger | ||
|
@@ -45,14 +52,15 @@ type profilingLoop struct { | |
func newProfilingLoop(pid int, target discovery.Target, logger log.Logger, profiler *asprof.Profiler, output *pyroscope.Fanout, cfg ProfilingConfig) *profilingLoop { | ||
ctx, cancel := context.WithCancel(context.Background()) | ||
dist, err := profiler.DistributionForProcess(pid) | ||
jfrFileName := fmt.Sprintf("asprof-%d-%d.jfr", os.Getpid(), pid) | ||
p := &profilingLoop{ | ||
logger: log.With(logger, "pid", pid), | ||
output: output, | ||
pid: pid, | ||
target: target, | ||
cancel: cancel, | ||
dist: dist, | ||
jfrFile: fmt.Sprintf("/tmp/asprof-%d-%d.jfr", os.Getpid(), pid), | ||
jfrFile: filepath.Join(processJfrDir, jfrFileName), | ||
cfg: cfg, | ||
profiler: profiler, | ||
} | ||
|
@@ -63,6 +71,16 @@ func newProfilingLoop(pid int, target discovery.Target, logger log.Logger, profi | |
return p | ||
} | ||
|
||
p.wg.Add(1) | ||
go func() { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should this happen in a goroutine? I would expect that we want cleanup to complete before the regular loop begins, so starting this and allowing the internal scheduler to decide what goes first doesn't seem right. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hi @dehaansa, I was thinking the clean-up can safely run in parallel to the loop as a small optimization. It has a check to make sure the JFR file eventually created by the loop isn't accidentally deleted during clean-up, since that file always has the same name for the current process running alloy. Let me know if you think it's still worth changing though! |
||
defer p.wg.Done() | ||
// Clean-up files that weren't removed by a previous instance of alloy | ||
err := p.cleanupOldJFRFiles(jfrFileName) | ||
if err != nil { | ||
_ = level.Warn(p.logger).Log("msg", "failed cleaning-up java jfr files created by a previous instance of alloy", "err", err) | ||
} | ||
}() | ||
|
||
p.wg.Add(1) | ||
go func() { | ||
defer p.wg.Done() | ||
|
@@ -275,3 +293,22 @@ func (p *profilingLoop) alive() bool { | |
} | ||
return err == nil && exists | ||
} | ||
|
||
func (p *profilingLoop) cleanupOldJFRFiles(myFileName string) error { | ||
dir := asprof.ProcessPath(processJfrDir, p.pid) | ||
files, err := os.ReadDir(dir) | ||
if err != nil { | ||
return err | ||
} | ||
|
||
for _, file := range files { | ||
if !file.IsDir() && jfrFileNamePattern.MatchString(file.Name()) && file.Name() != myFileName { | ||
_ = level.Debug(p.logger).Log("msg", "deleting jfr file created by previous alloy process", "file", file.Name()) | ||
err := os.Remove(filepath.Join(dir, file.Name())) | ||
if err != nil { | ||
return err | ||
} | ||
} | ||
} | ||
return nil | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's odd that the process has been using
/tmp
for this. It'd make more sense to use the directory specified by the--storage.path
cmd arg, similarly to other processes such asloki.file
andprometheus.remote_write
.@grafana/grafana-alloy-profiling-maintainers Would you mind if we change
pyroscope.java
to use the storage path instead please? I don't feel comfortable with Alloy deleting anything outside of that directory.Note that if we do this we'd have to update the docs too. They currently mention that
tmp_dir
is used. The docs also mention this:I suppose using the storage path still won't be a problem, since root should also have access to it? IDK why the docs say "only writable by root" though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @ptodev, thanks for reviewing
Maybe this was to distinguish the storage directory on alloy containers from a directory on the profiled java process's container.
tmp_dir
needs to be a directory on the java container for async-profiler to workI can change the JFR files to be written to
tmp_dir
as well though, since right now only the async-profiler binaries are written to that pathThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another option could be changing the file name to something less likely to conflict, like by prepending
alloy-