-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MemoryError seen on expire_distros execution #240
MemoryError seen on expire_distros execution #240
Conversation
When the beaker_expire_distros hourly cron job ran, the MemoryError is seen on the lab controller as well as Server. This same error is seen when executing the bkr CLI 'bkr distro-trees-list --limit=8000 --labcontroller=hostname | grep "ID:" | wc -l' As it turns out, expire_distros is calling the same underlying get method. The error is sporadic. It is likely due to difficulty attaining a contiguous memory chunk for a lot of data. The solution is to get smaller chunks. Similar filters as the cli distro-trees-list are now allow thru expired_distros.py which passes them to the same get operation. These filters also allows for more variation for beaker_expire_distros. Stepping thru architectures seemed to be the best choice of filters for the chunks. When --arch=all, the expire_distros.py code knows to step thru a list of arch to perform removal instead of trying to do it all at once. So the cron job is updated to include --arch=all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Carol. It looks good to me, though I am unfamiliar with this portion of the code base.
except KeyboardInterrupt: | ||
pass | ||
startmsg = str("INFO: expire_distros running with --lab-controller=" + options.lab_controller) | ||
for i in range(1,len(sys.argv)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Idea, but for sure not required:
startmsg += ' '.join(sys.argv[1:])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would work for sure, but do we want to report it that way? I mean, we are not using these options from argv, but rather parsed options (L157). I would work with that instead data structure instead.
Hi, I will review this during the day, please do not merge it yet. Thank you |
filter['family'] = options.family | ||
|
||
arch_list = [] | ||
if options.arch: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean having all
is the same as not filtering at all. We should prepare the value before and we can skip this if else completely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have thought about this further and I think there is no point in having an alternative behavior between None
arch and all
. We are basically doing the same thing, the difference is whether we iterate over the dataset at once or not. I think we should unify this and just go per arch as doing it all at once will always be a problem on larger instances.
arch_list = [] | ||
if options.arch: | ||
if options.arch == "all": | ||
arch_list = [ "x86_64", "ppc", "ppc64le", "ppc64", "i386", "s390", "s390x", "aarch64", "ia64", "arm", "armhfp" ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question - this list is quite fragile, isn't there an API that can tell us what kind of architectures we have?
remove_all=False): | ||
remove_all=False, | ||
filter=None): | ||
filter_on_arch = True if (filter is not None and filter and 'arch' in filter.keys()) else False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
filter_on_arch = True if (filter is not None and filter and 'arch' in filter.keys()) else False | |
filter_on_arch = filter and 'arch' in filter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we could streamline this further by converting filter
to dict
if it is set to None
.
@@ -108,21 +110,29 @@ def check_all_trees(ignore_errors=False, | |||
else: | |||
rdistro_trees = distro_trees | |||
|
|||
print('INFO: expire_distros to remove %d entries for arch %s' % (len(rdistro_trees), | |||
filter['arch'] if (filter_on_arch) else 'unset')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can do filter.get('arch', 'unset')
and maybe skip the filter_on_arch variable altogether, it is only used for this purpose.
sys.exit(1) | ||
|
||
if (len(distro_trees) == 0): | ||
if (filter is None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have changed the business logic here. In this particular case it will report that distributions are missing if the filter is not set, but if we use `all' it will simply not work and even if the controller is empty we will not report anything as we are going arch by arch. My suggestion is to remove the non-filtered way altogether and always run arch by arch and we can write business logic to take that into account.
I'm doing everything I can to get you my changesets. I really don't have time to redevelop and retest them all and this changeset is working very well in RH. We have cron jobs that drive this code(see below) where I added --arch=all. Since you left I also have added logging so we have to clue that cronjobs have failed. I don't want log files for each arch. The additionally benefit of this changeset is it opens up filters so operations folks can remove distros where they couldn't before. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not against the pull request at all, I think this change is good and necessary.
I understand that with a large beaker instance full of different distros, the query will return a significant amount of distros for analysis and this can/will cause OOM.
This patch addresses that issue as we limit it per architecture, but what I wanted to do is tweak the code a bit to remove some of the edges. However, since you are trying to align your downstream with upstream, we can merge it as is and adjust it in a separate PR in the future.
Martin: My heart is in the right place to get what I have to you. My operations person was ecstatic with this change since he additionally had the ability to manage the distros where he couldn't before. Opening up the filter to beaker-expire-distros makes the command so much more useful! Please go ahead and add comments and perhaps we can merge now and follow-up later. |
When the beaker_expire_distros hourly cron job ran, the MemoryError is seen on the lab controller as well as Server. This same error is seen when executing the bkr CLI 'bkr distro-trees-list --limit=8000 --labcontroller=hostname | grep "ID:" | wc -l'
As it turns out, expire_distros is calling the same underlying get method. The error is sporadic. It is likely due to difficulty attaining a contiguous memory chunk for a lot of data. The solution is to get smaller chunks. Similar filters as the cli distro-trees-list are now allow thru expired_distros.py which passes them to the same get operation. These filters also allows for more variation for beaker_expire_distros. Stepping thru architectures seemed to be the best choice of filters for the chunks. When --arch=all, the expire_distros.py code knows to step thru a list of arch to perform removal instead of trying to do it all at once. So the cron job is updated to include --arch=all.