Merge branch 'main' into chaos_summary

krkn-chaos · Dec 11, 2024 · 9821565 · 9821565
2 parents b172a3c + 0c30d89
commit 9821565
Show file tree

Hide file tree

Showing 16 changed files with 344 additions and 137 deletions.
diff --git a/ROADMAP.md b/ROADMAP.md
@@ -6,10 +6,11 @@ Following are a list of enhancements that we are planning to work on adding supp
 - [x] [Centralized storage for chaos experiments artifacts](https://github.com/krkn-chaos/krkn/issues/423)
 - [ ] [Support for causing DNS outages](https://github.com/krkn-chaos/krkn/issues/394)
 - [x] [Chaos recommender](https://github.com/krkn-chaos/krkn/tree/main/utils/chaos-recommender) to suggest scenarios having probability of impacting the service under test using profiling results 
-- [ ] Chaos AI integration to improve and automate test coverage
+- [] Chaos AI integration to improve test coverage while reducing fault space to save costs and execution time
 - [x] [Support for pod level network traffic shaping](https://github.com/krkn-chaos/krkn/issues/393)
 - [ ] [Ability to visualize the metrics that are being captured by Kraken and stored in Elasticsearch](https://github.com/krkn-chaos/krkn/issues/124)
-- [ ] Support for running all the scenarios of Kraken on Kubernetes distribution - see https://github.com/krkn-chaos/krkn/issues/185, https://github.com/redhat-chaos/krkn/issues/186
-- [ ] Continue to improve [Chaos Testing Guide](https://krkn-chaos.github.io/krkn) in terms of adding best practices, test environment recommendations and scenarios to make sure the OpenShift platform, as well the applications running on top it, are resilient and performant under chaotic conditions.
-- [ ] [Switch documentation references to Kubernetes](https://github.com/krkn-chaos/krkn/issues/495)
-- [ ] [OCP and Kubernetes functionalities segregation](https://github.com/krkn-chaos/krkn/issues/497)
+- [x] Support for running all the scenarios of Kraken on Kubernetes distribution - see https://github.com/krkn-chaos/krkn/issues/185, https://github.com/redhat-chaos/krkn/issues/186
+- [x] Continue to improve [Chaos Testing Guide](https://krkn-chaos.github.io/krkn) in terms of adding best practices, test environment recommendations and scenarios to make sure the OpenShift platform, as well the applications running on top it, are resilient and performant under chaotic conditions.
+- [x] [Switch documentation references to Kubernetes](https://github.com/krkn-chaos/krkn/issues/495)
+- [x] [OCP and Kubernetes functionalities segregation](https://github.com/krkn-chaos/krkn/issues/497)
+- [x] [Krknctl - client for running Krkn scenarios with ease](https://github.com/krkn-chaos/krknctl)
diff --git a/docs/cluster_shut_down_scenarios.md b/docs/cluster_shut_down_scenarios.md
@@ -8,6 +8,7 @@ Current accepted cloud types:
 * [GCP](cloud_setup.md#gcp)
 * [AWS](cloud_setup.md#aws)
 * [Openstack](cloud_setup.md#openstack)
+* [IBMCloud](cloud_setup.md#ibmcloud)
 
 
 ```

diff --git a/docs/network_chaos.md b/docs/network_chaos.md
@@ -18,7 +18,7 @@ network_chaos:                                    # Scenario to create an outage
 ```
 
 ##### Sample scenario config for ingress traffic shaping (using a plugin)
-'''
+```
 - id: network_chaos
   config:
     node_interface_name:                            # Dictionary with key as node name(s) and value as a list of its interfaces to test
@@ -35,7 +35,7 @@ network_chaos:                                    # Scenario to create an outage
         bandwidth: 10mbit
     wait_duration: 120
     test_duration: 60
-  '''
+```
 
   Note: For ingress traffic shaping, ensure that your node doesn't have any [IFB](https://wiki.linuxfoundation.org/networking/ifb) interfaces already present. The scenario relies on creating IFBs to do the shaping, and they are deleted at the end of the scenario.
 

diff --git a/docs/node_scenarios.md b/docs/node_scenarios.md
@@ -4,14 +4,15 @@ The following node chaos scenarios are supported:
 
 1. **node_start_scenario**: Scenario to stop the node instance.
 2. **node_stop_scenario**: Scenario to stop the node instance.
-3. **node_stop_start_scenario**: Scenario to stop and then start the node instance. Not supported on VMware.
+3. **node_stop_start_scenario**: Scenario to stop the node instance for specified duration and then start the node instance. Not supported on VMware.
 4. **node_termination_scenario**: Scenario to terminate the node instance.
 5. **node_reboot_scenario**: Scenario to reboot the node instance.
 6. **stop_kubelet_scenario**: Scenario to stop the kubelet of the node instance.
 7. **stop_start_kubelet_scenario**: Scenario to stop and start the kubelet of the node instance.
 8. **restart_kubelet_scenario**: Scenario to restart the kubelet of the node instance.
 9. **node_crash_scenario**: Scenario to crash the node instance.
 10. **stop_start_helper_node_scenario**: Scenario to stop and start the helper node and check service status.
+11. **node_disk_detach_attach_scenario**: Scenario to detach node disk for specified duration.
 
 
 **NOTE**: If the node does not recover from the node_crash_scenario injection, reboot the node to get it back to Ready state.
@@ -20,6 +21,8 @@ The following node chaos scenarios are supported:
 , node_reboot_scenario and stop_start_kubelet_scenario are supported on AWS, Azure, OpenStack, BareMetal, GCP
 , VMware and Alibaba.
 
+**NOTE**: node_disk_detach_attach_scenario is supported only on AWS and cannot detach root disk.
+
 
 #### AWS
 
@@ -57,6 +60,8 @@ kind was primarily designed for testing Kubernetes itself, but may be used for l
 #### GCP
 Cloud setup instructions can be found [here](cloud_setup.md#gcp). Sample scenario config can be found [here](https://github.com/krkn-chaos/krkn/blob/main/scenarios/openshift/gcp_node_scenarios.yml).
 
+NOTE: The parallel option is not available for GCP, the api doesn't perform processes at the same time
+
 
 #### Openstack
 

diff --git a/docs/zone_outage.md b/docs/zone_outage.md
@@ -13,10 +13,12 @@ zone_outage:                                         # Scenario to create an out
   duration: 600                                      # Duration in seconds after which the zone will be back online.
   vpc_id:                                            # Cluster virtual private network to target.
   subnet_id: [subnet1, subnet2]                      # List of subnet-id's to deny both ingress and egress traffic.
+  default_acl_id: acl-xxxxxxxx                       # (Optional) ID of an existing network ACL to use instead of creating a new one. If provided, this ACL will not be deleted after the scenario.
 ```
 
 **NOTE**: vpc_id and subnet_id can be obtained from the cloud web console by selecting one of the instances in the targeted zone ( us-west-2a for example ).
 **NOTE**: Multiple zones will experience downtime in case of targeting multiple subnets which might have an impact on the cluster health especially if the zones have control plane components deployed.
+**NOTE**: default_acl_id can be obtained from the AWS VPC Console by selecting "Network ACLs" from the left sidebar ( the ID will be in the format 'acl-xxxxxxxx' ). Make sure the selected ACL has the desired ingress/egress rules for your outage scenario ( i.e., deny all ).
 
 ##### Debugging steps in case of failures
 In case of failures during the steps which revert back the network acl to allow traffic and bring back the cluster nodes in the zone, the nodes in the particular zone will be in `NotReady` condition. Here is how to fix it:

diff --git a/krkn/scenario_plugins/native/node_scenarios/ibmcloud_plugin.py b/krkn/scenario_plugins/native/node_scenarios/ibmcloud_plugin.py
@@ -34,7 +34,16 @@ def __init__(self):
             self.service.set_service_url(service_url)
         except Exception as e:
             logging.error("error authenticating" + str(e))
-            sys.exit(1)
+
+
+    # Get the instance ID of the node
+    def get_instance_id(self, node_name):
+        node_list = self.list_instances()
+        for node in node_list:
+            if node_name == node["vpc_name"]:
+                return node["vpc_id"]
+        logging.error("Couldn't find node with name " + str(node_name) + ", you could try another region")
+        sys.exit(1)
 
     def delete_instance(self, instance_id):
         """

diff --git a/krkn/scenario_plugins/network_chaos/network_chaos_scenario_plugin.py b/krkn/scenario_plugins/network_chaos/network_chaos_scenario_plugin.py
@@ -42,19 +42,13 @@ def run(
                 test_egress = get_yaml_item_value(
                     test_dict, "egress", {"bandwidth": "100mbit"}
                 )
+
                 if test_node:
                     node_name_list = test_node.split(",")
+                    nodelst = common_node_functions.get_node_by_name(node_name_list, lib_telemetry.get_lib_kubernetes())
                 else:
-                    node_name_list = [test_node]
-                nodelst = []
-                for single_node_name in node_name_list:
-                    nodelst.extend(
-                        common_node_functions.get_node(
-                            single_node_name,
-                            test_node_label,
-                            test_instance_count,
-                            lib_telemetry.get_lib_kubernetes(),
-                        )
+                    nodelst = common_node_functions.get_node(
+                        test_node_label, test_instance_count, lib_telemetry.get_lib_kubernetes()
                     )
                 file_loader = FileSystemLoader(
                     os.path.abspath(os.path.dirname(__file__))
@@ -149,7 +143,10 @@ def run(
                 finally:
                     logging.info("Deleting jobs")
                     self.delete_job(joblst[:], lib_telemetry.get_lib_kubernetes())
-        except (RuntimeError, Exception):
+        except (RuntimeError, Exception) as e:
+            logging.error(
+                "NetworkChaosScenarioPlugin exiting due to Exception %s" % e
+            )
             scenario_telemetry.exit_status = 1
             return 1
         else:

diff --git a/krkn/scenario_plugins/node_actions/abstract_node_scenarios.py b/krkn/scenario_plugins/node_actions/abstract_node_scenarios.py
@@ -36,6 +36,20 @@ def helper_node_stop_start_scenario(self, instance_kill_count, node, timeout):
         self.helper_node_start_scenario(instance_kill_count, node, timeout)
         logging.info("helper_node_stop_start_scenario has been successfully injected!")
 
+    # Node scenario to detach and attach the disk
+    def node_disk_detach_attach_scenario(self, instance_kill_count, node, timeout, duration):
+        logging.info("Starting disk_detach_attach_scenario injection")
+        disk_attachment_details = self.get_disk_attachment_info(instance_kill_count, node)
+        if disk_attachment_details:
+            self.disk_detach_scenario(instance_kill_count, node, timeout)
+            logging.info("Waiting for %s seconds before attaching the disk" % (duration))
+            time.sleep(duration)
+            self.disk_attach_scenario(instance_kill_count, disk_attachment_details, timeout)
+            logging.info("node_disk_detach_attach_scenario has been successfully injected!")
+        else:
+            logging.error("Node %s has only root disk attached" % (node))
+            logging.error("node_disk_detach_attach_scenario failed!")
+
     # Node scenario to terminate the node
     def node_termination_scenario(self, instance_kill_count, node, timeout):
         pass

diff --git a/krkn/scenario_plugins/node_actions/aws_node_scenarios.py b/krkn/scenario_plugins/node_actions/aws_node_scenarios.py
@@ -12,7 +12,8 @@
 class AWS:
     def __init__(self):
         self.boto_client = boto3.client("ec2")
-        self.boto_instance = boto3.resource("ec2").Instance("id")
+        self.boto_resource = boto3.resource("ec2")
+        self.boto_instance = self.boto_resource.Instance("id")
 
     # Get the instance ID of the node
     def get_instance_id(self, node):
@@ -179,6 +180,72 @@ def delete_network_acl(self, acl_id):
 
             raise RuntimeError()
 
+    # Detach volume
+    def detach_volumes(self, volumes_ids: list):
+        for volume in volumes_ids:
+            try:
+                self.boto_client.detach_volume(VolumeId=volume, Force=True)
+            except Exception as e:
+                logging.error(
+                    "Detaching volume %s failed with exception: %s"
+                    % (volume, e)
+                )
+
+    # Attach volume
+    def attach_volume(self, attachment: dict):
+        try:
+            if self.get_volume_state(attachment["VolumeId"]) == "in-use":
+                logging.info(
+                    "Volume %s is already in use." % attachment["VolumeId"]
+                )
+                return
+            logging.info(
+                "Attaching the %s volumes to instance %s."
+                % (attachment["VolumeId"], attachment["InstanceId"])
+            )
+            self.boto_client.attach_volume(
+                InstanceId=attachment["InstanceId"],
+                Device=attachment["Device"],
+                VolumeId=attachment["VolumeId"]
+            )
+        except Exception as e:
+            logging.error(
+                "Failed attaching disk %s to the %s instance. "
+                "Encountered following exception: %s"
+                % (attachment['VolumeId'], attachment['InstanceId'], e)
+            )
+            raise RuntimeError()
+
+    # Get IDs of node volumes
+    def get_volumes_ids(self, instance_id: list):
+        response = self.boto_client.describe_instances(InstanceIds=instance_id)
+        instance_attachment_details = response["Reservations"][0]["Instances"][0]["BlockDeviceMappings"]
+        root_volume_device_name = self.get_root_volume_id(instance_id)
+        volume_ids = []
+        for device in instance_attachment_details:
+            if device["DeviceName"] != root_volume_device_name:
+                volume_id = device["Ebs"]["VolumeId"]
+                volume_ids.append(volume_id)
+        return volume_ids
+
+    # Get volumes attachment details
+    def get_volume_attachment_details(self, volume_ids: list):
+        response = self.boto_client.describe_volumes(VolumeIds=volume_ids)
+        volumes_details = response["Volumes"]
+        return volumes_details
+
+    # Get root volume
+    def get_root_volume_id(self, instance_id):
+        instance_id = instance_id[0]
+        instance = self.boto_resource.Instance(instance_id)
+        root_volume_id = instance.root_device_name
+        return root_volume_id
+
+    # Get volume state
+    def get_volume_state(self, volume_id: str):
+        volume = self.boto_resource.Volume(volume_id)
+        state = volume.state
+        return state
 
 # krkn_lib
 class aws_node_scenarios(abstract_node_scenarios):
@@ -290,3 +357,49 @@ def node_reboot_scenario(self, instance_kill_count, node, timeout):
                 logging.error("node_reboot_scenario injection failed!")
 
                 raise RuntimeError()
+
+    # Get volume attachment info
+    def get_disk_attachment_info(self, instance_kill_count, node):
+        for _ in range(instance_kill_count):
+            try:
+                logging.info("Obtaining disk attachment information")
+                instance_id = (self.aws.get_instance_id(node)).split()
+                volumes_ids = self.aws.get_volumes_ids(instance_id)
+                if volumes_ids:
+                    vol_attachment_details = self.aws.get_volume_attachment_details(
+                        volumes_ids
+                    )
+                    return vol_attachment_details
+                return
+            except Exception as e:
+                logging.error(
+                    "Failed to obtain disk attachment information of %s node. "
+                    "Encounteres following exception: %s." % (node, e)
+                )
+                raise RuntimeError()
+
+    # Node scenario to detach the volume
+    def disk_detach_scenario(self, instance_kill_count, node, timeout):
+        for _ in range(instance_kill_count):
+            try:
+                logging.info("Starting disk_detach_scenario injection")
+                instance_id = (self.aws.get_instance_id(node)).split()
+                volumes_ids = self.aws.get_volumes_ids(instance_id)
+                logging.info(
+                    "Detaching the %s volumes from instance %s "
+                    % (volumes_ids, node)
+                )
+                self.aws.detach_volumes(volumes_ids)
+            except Exception as e:
+                logging.error(
+                    "Failed to detach disk from %s node. Encountered following"
+                    "exception: %s." % (node, e)
+                )
+                logging.debug("")
+                raise RuntimeError()
+
+    # Node scenario to attach the volume
+    def disk_attach_scenario(self, instance_kill_count, attachment_details, timeout):
+        for _ in range(instance_kill_count):
+            for attachment in attachment_details:
+                self.aws.attach_volume(attachment["Attachments"][0])
diff --git a/krkn/scenario_plugins/node_actions/common_node_functions.py b/krkn/scenario_plugins/node_actions/common_node_functions.py
@@ -8,19 +8,28 @@
 node_general = False
 
 
+def get_node_by_name(node_name_list, kubecli: KrknKubernetes):
+    killable_nodes = kubecli.list_killable_nodes()
+    for node_name in node_name_list:
+        if node_name not in killable_nodes:
+            logging.info(
+                f"Node with provided ${node_name} does not exist or the node might "
+                "be in NotReady state."
+            )
+            return
+    return node_name_list
+
+
 # Pick a random node with specified label selector
-def get_node(node_name, label_selector, instance_kill_count, kubecli: KrknKubernetes):
-    if node_name in kubecli.list_killable_nodes():
-        return [node_name]
-    elif node_name:
-        logging.info(
-            "Node with provided node_name does not exist or the node might "
-            "be in NotReady state."
-        )
-    nodes = kubecli.list_killable_nodes(label_selector)
+def get_node(label_selector, instance_kill_count, kubecli: KrknKubernetes):
+
+    label_selector_list  = label_selector.split(",")
+    nodes = []
+    for label_selector in label_selector_list: 
+        nodes.extend(kubecli.list_killable_nodes(label_selector))
     if not nodes:
         raise Exception("Ready nodes with the provided label selector do not exist")
-    logging.info("Ready nodes with the label selector %s: %s" % (label_selector, nodes))
+    logging.info("Ready nodes with the label selector %s: %s" % (label_selector_list, nodes))
     number_of_nodes = len(nodes)
     if instance_kill_count == number_of_nodes:
         return nodes
@@ -35,22 +44,19 @@ def get_node(node_name, label_selector, instance_kill_count, kubecli: KrknKubern
 # krkn_lib
 # Wait until the node status becomes Ready
 def wait_for_ready_status(node, timeout, kubecli: KrknKubernetes):
-    resource_version = kubecli.get_node_resource_version(node)
-    kubecli.watch_node_status(node, "True", timeout, resource_version)
+    kubecli.watch_node_status(node, "True", timeout)
 
 
 # krkn_lib
 # Wait until the node status becomes Not Ready
 def wait_for_not_ready_status(node, timeout, kubecli: KrknKubernetes):
-    resource_version = kubecli.get_node_resource_version(node)
-    kubecli.watch_node_status(node, "False", timeout, resource_version)
+    kubecli.watch_node_status(node, "False", timeout)
 
 
 # krkn_lib
 # Wait until the node status becomes Unknown
 def wait_for_unknown_status(node, timeout, kubecli: KrknKubernetes):
-    resource_version = kubecli.get_node_resource_version(node)
-    kubecli.watch_node_status(node, "Unknown", timeout, resource_version)
+    kubecli.watch_node_status(node, "Unknown", timeout)
 
 
 # Get the ip of the cluster node