Skip to content

common: fix unclosed logrus logging pipes #4728

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

lzap
Copy link
Contributor

@lzap lzap commented May 22, 2025

It looks like both CloudAPI and WeldrAPI consume memory, process can go up to several gigabytes pretty quickly just by running a simple script. This was reported few days ago: #4726

while sleep 0.001; do
    curl --unix-socket /run/weldr/api.socket -XGET http://localhost/api/
    curl --unix-socket /run/cloudapi/api.socket -XGET http://localhost/api/
done

I was able to quickly reproduce on Fedora 42, extracting core with gcore PID and running dlv core /usr/libexec/osbuild-composer/osbuild-composer core.PID revealed that there are tens of thousands of blocked goroutines all with the following stack:

(dlv) goroutine 70207
Switched from 67162 to 70207 (thread 885812)

(dlv) stack
0  0x00005607664f026e in runtime.gopark
   at /usr/lib/golang/src/runtime/proc.go:436
1  0x00005607664cf2b7 in runtime.selectgo
   at /usr/lib/golang/src/runtime/select.go:351
2  0x000056076653ec45 in io.(*pipe).read
   at /usr/lib/golang/src/io/pipe.go:57
3  0x000056076653f35a in io.(*PipeReader).Read
   at /usr/lib/golang/src/io/pipe.go:134
4  0x0000560766712c1e in bufio.(*Scanner).Scan
   at /usr/lib/golang/src/bufio/scan.go:219
5  0x00005607669576fd in github.com/sirupsen/logrus.(*Entry).writerScanner
   at /usr/src/debug/osbuild-composer-142-1.20250522git99f5fec.fc42.x86_64/vendor/github.com/sirupsen/logrus/writer.go:86
6  0x00005607669575a8 in github.com/sirupsen/logrus.(*Entry).WriterLevel.gowrap1
   at /usr/src/debug/osbuild-composer-142-1.20250522git99f5fec.fc42.x86_64/vendor/github.com/sirupsen/logrus/writer.go:57
7  0x00005607664f80a1 in runtime.goexit
   at /usr/lib/golang/src/runtime/asm_amd64.s:1700

There is a logrus logger method called Write and WriteLevel which create a new logging entry, create a PIPE and spawn a goroutine that is reading from that PIPE. The caller is expected to close the PIPE writer which was not happening in both main.go files.

But that was not the culprit of the problem, a new goroutine was created every single HTTP request. I tested both go-systemd library as well as httprouter library but they were scaling well and no problems there. Until I realized that both WeldrAPI and CloudAPI use a common logging proxy EchoLogrusLogger that had a Write method implemented as follows:

func (l *EchoLogrusLogger) Write(p []byte) (n int, err error) {
	return l.Logger.WithContext(l.Ctx).Writer().Write(p)
}

The Writer() call creates a new PIPE and a goroutine but is never closed, this method is used from both WeldrAPI (through the standard log package) as well as CloudAPI (through Echo logger) at least once per request. The obvious fix and the one we can use if we chose to backport this fix is:

func (l *EchoLogrusLogger) Write(p []byte) (n int, err error) {
    w := l.Logger.WithContext(l.Ctx).Writer()
    defer w.Close()
    return w.Write(p)
}

But this PR aims for better solution that prevents from spawning a new goroutine each Write call in the first place. Writer is created only once per logger, that is once for the WeldrAPI since it uses a singleton and once per request for CloudAPI, much better than every single event. Cheers!

@lzap lzap requested review from thozza and a team as code owners May 22, 2025 12:00
@lzap lzap requested review from mvo5 and supakeen and removed request for a team May 22, 2025 12:00
Copy link
Member

@thozza thozza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for figuring this out. There is one (probably a typo or copy&paste) 🐛 in the PR...

}

func (l *EchoLogrusLogger) ctxLogger() *logrus.Entry {
return l.ctxLogger()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should have been something like return l.logger.WithContext(l.Ctx)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch yeah I was doing a last minute improvements and I was planning on storing the logger in the instance directly with context, but logrus has such a weird API because WithContext does not return a logger but an event. I will remove this helper method, it does not make the code any more clear.

Copy link
Contributor

@mvo5 mvo5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit of a meta-question - AIUI this was an important issue but it looks like there is no test for the fix, is there a way to test for this?

@lzap
Copy link
Contributor Author

lzap commented May 22, 2025

A bit of a meta-question - AIUI this was an important issue but it looks like there is no test for the fix, is there a way to test for this?

Good point, I could add a test that just logs some data. The test would not test the fact we close the PIPE in all places tho. And I could do that if you ask me to, however, logrus is going away, this sprint I have a ticket to finish with logrus>slog refactoring. All logging code will be extracted into the common logging library which has extensive tests through and through.

https://github.com/osbuild/logging

In fact I already started the work.

@lzap
Copy link
Contributor Author

lzap commented May 22, 2025

For the record, I introduced the method in v111 (Spring 2024): 5ce8f65 and few weeks (v111) it was made effective by setting the default logger to this implementation: f3c0dae

Copy link
Contributor

@bcl bcl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

Copy link
Member

@thozza thozza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have one question, otherwise the PR LGTM.

Comment on lines -181 to +209
return l.Logger.WithContext(l.Ctx).Writer().Write(p)
return l.writer.Write(p)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the retionale for this method to not use the context?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good observation, Writer interface is only used for the global logger that is used in WeldrAPI. I am going to change all logging in composer to use slog that will properly handle context, this file will actually be deleted completely. Context was not in use in this case (was not set for the global logger).

@mvo5
Copy link
Contributor

mvo5 commented May 26, 2025

A bit of a meta-question - AIUI this was an important issue but it looks like there is no test for the fix, is there a way to test for this?

Good point, I could add a test that just logs some data. The test would not test the fact we close the PIPE in all places tho. And I could do that if you ask me to, however, logrus is going away, this sprint I have a ticket to finish with logrus>slog refactoring. All logging code will be extracted into the common logging library which has extensive tests through and through.
[..]

That is fine, if this code goes away very soon obviously we don't need to do busywork here. Mentioning why no tests is provided/needed in the PR description (and ideally commit message(s)) would be good IMHO. I mean, I think our default assumption is that code changes have tests (especially if they fix visible regressions) so knowing why they are omitted is very useful information.

@lzap
Copy link
Contributor Author

lzap commented May 26, 2025

Expect the first logging refactoring this week, I will follow the CRC case and do it in two waves - first I will use a logrus proxy and remove this dependency completely from the project, next I will refactor all log calls to make use of the cleaner slog API.

@lzap lzap requested a review from mvo5 May 26, 2025 12:26
@lzap
Copy link
Contributor Author

lzap commented May 26, 2025

@mvo5 could you give me green light here so I can start on the refactoring right away to avoid conflicts? thanks.

only rebased GH was down

@lzap lzap removed the request for review from supakeen May 26, 2025 12:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants