MemoryError seen on expire_distros execution #240

cbouchar · 2024-11-13T21:08:01Z

When the beaker_expire_distros hourly cron job ran, the MemoryError is seen on the lab controller as well as Server. This same error is seen when executing the bkr CLI 'bkr distro-trees-list --limit=8000 --labcontroller=hostname | grep "ID:" | wc -l'
As it turns out, expire_distros is calling the same underlying get method. The error is sporadic. It is likely due to difficulty attaining a contiguous memory chunk for a lot of data. The solution is to get smaller chunks. Similar filters as the cli distro-trees-list are now allow thru expired_distros.py which passes them to the same get operation. These filters also allows for more variation for beaker_expire_distros. Stepping thru architectures seemed to be the best choice of filters for the chunks. When --arch=all, the expire_distros.py code knows to step thru a list of arch to perform removal instead of trying to do it all at once. So the cron job is updated to include --arch=all.

When the beaker_expire_distros hourly cron job ran, the MemoryError is seen on the lab controller as well as Server. This same error is seen when executing the bkr CLI 'bkr distro-trees-list --limit=8000 --labcontroller=hostname | grep "ID:" | wc -l' As it turns out, expire_distros is calling the same underlying get method. The error is sporadic. It is likely due to difficulty attaining a contiguous memory chunk for a lot of data. The solution is to get smaller chunks. Similar filters as the cli distro-trees-list are now allow thru expired_distros.py which passes them to the same get operation. These filters also allows for more variation for beaker_expire_distros. Stepping thru architectures seemed to be the best choice of filters for the chunks. When --arch=all, the expire_distros.py code knows to step thru a list of arch to perform removal instead of trying to do it all at once. So the cron job is updated to include --arch=all.

JohnVillalovos

Thanks Carol. It looks good to me, though I am unfamiliar with this portion of the code base.

JohnVillalovos · 2024-11-13T21:41:01Z

LabController/src/bkr/labcontroller/expire_distros.py

-    except KeyboardInterrupt:
-        pass
+    startmsg = str("INFO: expire_distros running with --lab-controller=" + options.lab_controller)
+    for i in range(1,len(sys.argv)):


Idea, but for sure not required:

startmsg += ' '.join(sys.argv[1:])

This would work for sure, but do we want to report it that way? I mean, we are not using these options from argv, but rather parsed options (L157). I would work with that instead data structure instead.

StykMartin · 2024-11-14T06:59:20Z

Hi,

I will review this during the day, please do not merge it yet.

Thank you

StykMartin · 2024-11-14T08:23:37Z

LabController/src/bkr/labcontroller/expire_distros.py

+        filter['family'] = options.family
+
+    arch_list = []
+    if options.arch:


I mean having all is the same as not filtering at all. We should prepare the value before and we can skip this if else completely.

I have thought about this further and I think there is no point in having an alternative behavior between None arch and all. We are basically doing the same thing, the difference is whether we iterate over the dataset at once or not. I think we should unify this and just go per arch as doing it all at once will always be a problem on larger instances.

StykMartin · 2024-11-14T08:24:29Z

LabController/src/bkr/labcontroller/expire_distros.py

+    arch_list = []
+    if options.arch:
+        if options.arch == "all":
+            arch_list = [ "x86_64", "ppc", "ppc64le", "ppc64", "i386", "s390", "s390x", "aarch64", "ia64", "arm", "armhfp" ]


Question - this list is quite fragile, isn't there an API that can tell us what kind of architectures we have?

StykMartin · 2024-11-14T08:38:10Z

LabController/src/bkr/labcontroller/expire_distros.py

-                    remove_all=False):
+                    remove_all=False,
+                    filter=None):
+    filter_on_arch = True if (filter is not None and filter and 'arch' in filter.keys()) else False


Suggested change

filter_on_arch = True if (filter is not None and filter and 'arch' in filter.keys()) else False

filter_on_arch = filter and 'arch' in filter

Perhaps we could streamline this further by converting filter to dict if it is set to None.

StykMartin · 2024-11-14T08:41:11Z

LabController/src/bkr/labcontroller/expire_distros.py

@@ -108,21 +110,29 @@ def check_all_trees(ignore_errors=False,
    else:
        rdistro_trees = distro_trees

+    print('INFO: expire_distros to remove %d entries for arch %s' % (len(rdistro_trees),
+          filter['arch'] if (filter_on_arch) else 'unset'))


We can do filter.get('arch', 'unset') and maybe skip the filter_on_arch variable altogether, it is only used for this purpose.

StykMartin · 2024-11-14T09:10:10Z

LabController/src/bkr/labcontroller/expire_distros.py

-        sys.exit(1)
-
+        if (len(distro_trees) == 0):
+            if (filter is None):


We have changed the business logic here. In this particular case it will report that distributions are missing if the filter is not set, but if we use `all' it will simply not work and even if the controller is empty we will not report anything as we are going arch by arch. My suggestion is to remove the non-filtered way altogether and always run arch by arch and we can write business logic to take that into account.

cbouchar · 2024-11-14T13:21:06Z

I'm doing everything I can to get you my changesets. I really don't have time to redevelop and retest them all and this changeset is working very well in RH. We have cron jobs that drive this code(see below) where I added --arch=all. Since you left I also have added logging so we have to clue that cronjobs have failed. I don't want log files for each arch. The additionally benefit of this changeset is it opens up filters so operations folks can remove distros where they couldn't before.
#!/bin/sh
exec flock -n /var/run/beaker_expire_distros.cron.lock beaker-expire-distros --arch=all > /var/log/beaker-cron/beaker-expire-distros.out 2>&1

StykMartin

I'm not against the pull request at all, I think this change is good and necessary.

I understand that with a large beaker instance full of different distros, the query will return a significant amount of distros for analysis and this can/will cause OOM.
This patch addresses that issue as we limit it per architecture, but what I wanted to do is tweak the code a bit to remove some of the edges. However, since you are trying to align your downstream with upstream, we can merge it as is and adjust it in a separate PR in the future.

cbouchar · 2024-11-14T14:02:11Z

Martin: My heart is in the right place to get what I have to you. My operations person was ecstatic with this change since he additionally had the ability to manage the distros where he couldn't before. Opening up the filter to beaker-expire-distros makes the command so much more useful! Please go ahead and add comments and perhaps we can merge now and follow-up later.

cbouchar self-assigned this Nov 13, 2024

cbouchar requested review from StykMartin, JohnVillalovos, p3ck and jchristi November 13, 2024 21:09

JohnVillalovos approved these changes Nov 13, 2024

View reviewed changes

StykMartin reviewed Nov 14, 2024

View reviewed changes

StykMartin approved these changes Nov 14, 2024

View reviewed changes

StykMartin merged commit 7c2c336 into beaker-project:python-3 Nov 14, 2024
20 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MemoryError seen on expire_distros execution #240

MemoryError seen on expire_distros execution #240

cbouchar commented Nov 13, 2024

JohnVillalovos left a comment

JohnVillalovos Nov 13, 2024

StykMartin Nov 14, 2024

StykMartin commented Nov 14, 2024

StykMartin Nov 14, 2024

StykMartin Nov 14, 2024

StykMartin Nov 14, 2024

StykMartin Nov 14, 2024

StykMartin Nov 14, 2024

StykMartin Nov 14, 2024

StykMartin Nov 14, 2024

cbouchar commented Nov 14, 2024

StykMartin left a comment

cbouchar commented Nov 14, 2024

	filter_on_arch = True if (filter is not None and filter and 'arch' in filter.keys()) else False
	filter_on_arch = filter and 'arch' in filter

MemoryError seen on expire_distros execution #240

MemoryError seen on expire_distros execution #240

Conversation

cbouchar commented Nov 13, 2024

JohnVillalovos left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

StykMartin commented Nov 14, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cbouchar commented Nov 14, 2024

StykMartin left a comment

Choose a reason for hiding this comment

cbouchar commented Nov 14, 2024