Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[pyroscope.java] Clean-up JFR files left behind by previous instances of alloy to reduce risk of filling up disk when alloy's in crash loop #2317

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,8 @@ Main (unreleased)

- Fixed an issue where the `otelcol.processor.interval` could not be used because the debug metrics were not set to default. (@wildum)

- Fixed an issue where `pyroscope.java` did not remove unused JFR files created by previous Alloy instances. (@swar8080)

### Other changes

- Change the stability of the `livedebugging` feature from "experimental" to "generally available". (@wildum)
Expand Down
41 changes: 39 additions & 2 deletions internal/component/pyroscope/java/loop.go
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,8 @@ import (
_ "embed"
"fmt"
"os"
"path/filepath"
"regexp"
"strconv"
"strings"
"sync"
Expand All @@ -23,7 +25,12 @@ import (
gopsutil "github.com/shirou/gopsutil/v3/process"
)

const spyName = "alloy.java"
const (
spyName = "alloy.java"
processJfrDir = "/tmp"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's odd that the process has been using /tmp for this. It'd make more sense to use the directory specified by the --storage.path cmd arg, similarly to other processes such as loki.file and prometheus.remote_write.

@grafana/grafana-alloy-profiling-maintainers Would you mind if we change pyroscope.java to use the storage path instead please? I don't feel comfortable with Alloy deleting anything outside of that directory.

Note that if we do this we'd have to update the docs too. They currently mention that tmp_dir is used. The docs also mention this:

The asprof binary runs with root permissions. If you change the tmp_dir configuration to something other than /tmp, then you must ensure that the directory is only writable by root.

I suppose using the storage path still won't be a problem, since root should also have access to it? IDK why the docs say "only writable by root" though.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ptodev, thanks for reviewing

It's odd that the process has been using /tmp for this. It'd make more sense to use the directory specified by the --storage.path cmd arg

Maybe this was to distinguish the storage directory on alloy containers from a directory on the profiled java process's container. tmp_dir needs to be a directory on the java container for async-profiler to work

I can change the JFR files to be written to tmp_dir as well though, since right now only the async-profiler binaries are written to that path

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, thank you for the PR! I'm worried about deleting files in directories which Alloy doesn't "own", so it'd be nice if we use the --storage.path dir instead.

Another option could be changing the file name to something less likely to conflict, like by prepending alloy-

)

var jfrFileNamePattern = regexp.MustCompile("^asprof-\\d+-\\d+\\.jfr$")

type profilingLoop struct {
logger log.Logger
Expand All @@ -45,14 +52,15 @@ type profilingLoop struct {
func newProfilingLoop(pid int, target discovery.Target, logger log.Logger, profiler *asprof.Profiler, output *pyroscope.Fanout, cfg ProfilingConfig) *profilingLoop {
ctx, cancel := context.WithCancel(context.Background())
dist, err := profiler.DistributionForProcess(pid)
jfrFileName := fmt.Sprintf("asprof-%d-%d.jfr", os.Getpid(), pid)
p := &profilingLoop{
logger: log.With(logger, "pid", pid),
output: output,
pid: pid,
target: target,
cancel: cancel,
dist: dist,
jfrFile: fmt.Sprintf("/tmp/asprof-%d-%d.jfr", os.Getpid(), pid),
jfrFile: filepath.Join(processJfrDir, jfrFileName),
cfg: cfg,
profiler: profiler,
}
Expand All @@ -63,6 +71,16 @@ func newProfilingLoop(pid int, target discovery.Target, logger log.Logger, profi
return p
}

p.wg.Add(1)
go func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this happen in a goroutine? I would expect that we want cleanup to complete before the regular loop begins, so starting this and allowing the internal scheduler to decide what goes first doesn't seem right.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @dehaansa, I was thinking the clean-up can safely run in parallel to the loop as a small optimization. It has a check to make sure the JFR file eventually created by the loop isn't accidentally deleted during clean-up, since that file always has the same name for the current process running alloy.

Let me know if you think it's still worth changing though!

defer p.wg.Done()
// Clean-up files that weren't removed by a previous instance of alloy
err := p.cleanupOldJFRFiles(jfrFileName)
if err != nil {
_ = level.Warn(p.logger).Log("msg", "failed cleaning-up java jfr files created by a previous instance of alloy", "err", err)
}
}()

p.wg.Add(1)
go func() {
defer p.wg.Done()
Expand Down Expand Up @@ -275,3 +293,22 @@ func (p *profilingLoop) alive() bool {
}
return err == nil && exists
}

func (p *profilingLoop) cleanupOldJFRFiles(myFileName string) error {
dir := asprof.ProcessPath(processJfrDir, p.pid)
files, err := os.ReadDir(dir)
if err != nil {
return err
}

for _, file := range files {
if !file.IsDir() && jfrFileNamePattern.MatchString(file.Name()) && file.Name() != myFileName {
_ = level.Debug(p.logger).Log("msg", "deleting jfr file created by previous alloy process", "file", file.Name())
err := os.Remove(filepath.Join(dir, file.Name()))
if err != nil {
return err
}
}
}
return nil
}