Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MemoryError seen on expire_distros execution #240

Merged
merged 1 commit into from
Nov 14, 2024

Conversation

cbouchar
Copy link
Contributor

When the beaker_expire_distros hourly cron job ran, the MemoryError is seen on the lab controller as well as Server. This same error is seen when executing the bkr CLI 'bkr distro-trees-list --limit=8000 --labcontroller=hostname | grep "ID:" | wc -l'
As it turns out, expire_distros is calling the same underlying get method. The error is sporadic. It is likely due to difficulty attaining a contiguous memory chunk for a lot of data. The solution is to get smaller chunks. Similar filters as the cli distro-trees-list are now allow thru expired_distros.py which passes them to the same get operation. These filters also allows for more variation for beaker_expire_distros. Stepping thru architectures seemed to be the best choice of filters for the chunks. When --arch=all, the expire_distros.py code knows to step thru a list of arch to perform removal instead of trying to do it all at once. So the cron job is updated to include --arch=all.

When the beaker_expire_distros hourly cron job ran, the
MemoryError is seen on the lab controller as well as Server.
This same error is seen when executing the bkr CLI
'bkr distro-trees-list --limit=8000 --labcontroller=hostname |
grep "ID:" | wc -l'
As it turns out, expire_distros is calling the same underlying
get method.  The error is sporadic.  It is likely due to
difficulty attaining a contiguous memory chunk for
a lot of data.  The solution is to get smaller chunks.  Similar
filters as the cli distro-trees-list are now allow thru
expired_distros.py which passes them to the same get operation.
These filters also allows for more variation for beaker_expire_distros.
Stepping thru architectures seemed to be the best choice of filters
for the chunks.  When --arch=all, the expire_distros.py code knows
to step thru a list of arch to perform removal instead of trying
to do it all at once. So the cron job is updated to include --arch=all.
Copy link
Collaborator

@JohnVillalovos JohnVillalovos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Carol. It looks good to me, though I am unfamiliar with this portion of the code base.

except KeyboardInterrupt:
pass
startmsg = str("INFO: expire_distros running with --lab-controller=" + options.lab_controller)
for i in range(1,len(sys.argv)):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Idea, but for sure not required:

startmsg += ' '.join(sys.argv[1:])

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would work for sure, but do we want to report it that way? I mean, we are not using these options from argv, but rather parsed options (L157). I would work with that instead data structure instead.

@StykMartin
Copy link
Contributor

Hi,

I will review this during the day, please do not merge it yet.

Thank you

filter['family'] = options.family

arch_list = []
if options.arch:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean having all is the same as not filtering at all. We should prepare the value before and we can skip this if else completely.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have thought about this further and I think there is no point in having an alternative behavior between None arch and all. We are basically doing the same thing, the difference is whether we iterate over the dataset at once or not. I think we should unify this and just go per arch as doing it all at once will always be a problem on larger instances.

arch_list = []
if options.arch:
if options.arch == "all":
arch_list = [ "x86_64", "ppc", "ppc64le", "ppc64", "i386", "s390", "s390x", "aarch64", "ia64", "arm", "armhfp" ]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question - this list is quite fragile, isn't there an API that can tell us what kind of architectures we have?

remove_all=False):
remove_all=False,
filter=None):
filter_on_arch = True if (filter is not None and filter and 'arch' in filter.keys()) else False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
filter_on_arch = True if (filter is not None and filter and 'arch' in filter.keys()) else False
filter_on_arch = filter and 'arch' in filter

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we could streamline this further by converting filter to dict if it is set to None.

@@ -108,21 +110,29 @@ def check_all_trees(ignore_errors=False,
else:
rdistro_trees = distro_trees

print('INFO: expire_distros to remove %d entries for arch %s' % (len(rdistro_trees),
filter['arch'] if (filter_on_arch) else 'unset'))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can do filter.get('arch', 'unset') and maybe skip the filter_on_arch variable altogether, it is only used for this purpose.

sys.exit(1)

if (len(distro_trees) == 0):
if (filter is None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have changed the business logic here. In this particular case it will report that distributions are missing if the filter is not set, but if we use `all' it will simply not work and even if the controller is empty we will not report anything as we are going arch by arch. My suggestion is to remove the non-filtered way altogether and always run arch by arch and we can write business logic to take that into account.

@cbouchar
Copy link
Contributor Author

I'm doing everything I can to get you my changesets. I really don't have time to redevelop and retest them all and this changeset is working very well in RH. We have cron jobs that drive this code(see below) where I added --arch=all. Since you left I also have added logging so we have to clue that cronjobs have failed. I don't want log files for each arch. The additionally benefit of this changeset is it opens up filters so operations folks can remove distros where they couldn't before.
#!/bin/sh
exec flock -n /var/run/beaker_expire_distros.cron.lock beaker-expire-distros --arch=all > /var/log/beaker-cron/beaker-expire-distros.out 2>&1

Copy link
Contributor

@StykMartin StykMartin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not against the pull request at all, I think this change is good and necessary.

I understand that with a large beaker instance full of different distros, the query will return a significant amount of distros for analysis and this can/will cause OOM.
This patch addresses that issue as we limit it per architecture, but what I wanted to do is tweak the code a bit to remove some of the edges. However, since you are trying to align your downstream with upstream, we can merge it as is and adjust it in a separate PR in the future.

@StykMartin StykMartin merged commit 7c2c336 into beaker-project:python-3 Nov 14, 2024
20 checks passed
@cbouchar
Copy link
Contributor Author

Martin: My heart is in the right place to get what I have to you. My operations person was ecstatic with this change since he additionally had the ability to manage the distros where he couldn't before. Opening up the filter to beaker-expire-distros makes the command so much more useful! Please go ahead and add comments and perhaps we can merge now and follow-up later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants