Skip to content

Commit

Permalink
Merge branch 'main' into chaos_summary
Browse files Browse the repository at this point in the history
  • Loading branch information
shahsahil264 authored Dec 11, 2024
2 parents b172a3c + 0c30d89 commit 9821565
Show file tree
Hide file tree
Showing 16 changed files with 344 additions and 137 deletions.
11 changes: 6 additions & 5 deletions ROADMAP.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,10 +6,11 @@ Following are a list of enhancements that we are planning to work on adding supp
- [x] [Centralized storage for chaos experiments artifacts](https://github.com/krkn-chaos/krkn/issues/423)
- [ ] [Support for causing DNS outages](https://github.com/krkn-chaos/krkn/issues/394)
- [x] [Chaos recommender](https://github.com/krkn-chaos/krkn/tree/main/utils/chaos-recommender) to suggest scenarios having probability of impacting the service under test using profiling results
- [ ] Chaos AI integration to improve and automate test coverage
- [] Chaos AI integration to improve test coverage while reducing fault space to save costs and execution time
- [x] [Support for pod level network traffic shaping](https://github.com/krkn-chaos/krkn/issues/393)
- [ ] [Ability to visualize the metrics that are being captured by Kraken and stored in Elasticsearch](https://github.com/krkn-chaos/krkn/issues/124)
- [ ] Support for running all the scenarios of Kraken on Kubernetes distribution - see https://github.com/krkn-chaos/krkn/issues/185, https://github.com/redhat-chaos/krkn/issues/186
- [ ] Continue to improve [Chaos Testing Guide](https://krkn-chaos.github.io/krkn) in terms of adding best practices, test environment recommendations and scenarios to make sure the OpenShift platform, as well the applications running on top it, are resilient and performant under chaotic conditions.
- [ ] [Switch documentation references to Kubernetes](https://github.com/krkn-chaos/krkn/issues/495)
- [ ] [OCP and Kubernetes functionalities segregation](https://github.com/krkn-chaos/krkn/issues/497)
- [x] Support for running all the scenarios of Kraken on Kubernetes distribution - see https://github.com/krkn-chaos/krkn/issues/185, https://github.com/redhat-chaos/krkn/issues/186
- [x] Continue to improve [Chaos Testing Guide](https://krkn-chaos.github.io/krkn) in terms of adding best practices, test environment recommendations and scenarios to make sure the OpenShift platform, as well the applications running on top it, are resilient and performant under chaotic conditions.
- [x] [Switch documentation references to Kubernetes](https://github.com/krkn-chaos/krkn/issues/495)
- [x] [OCP and Kubernetes functionalities segregation](https://github.com/krkn-chaos/krkn/issues/497)
- [x] [Krknctl - client for running Krkn scenarios with ease](https://github.com/krkn-chaos/krknctl)
1 change: 1 addition & 0 deletions docs/cluster_shut_down_scenarios.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ Current accepted cloud types:
* [GCP](cloud_setup.md#gcp)
* [AWS](cloud_setup.md#aws)
* [Openstack](cloud_setup.md#openstack)
* [IBMCloud](cloud_setup.md#ibmcloud)


```
Expand Down
4 changes: 2 additions & 2 deletions docs/network_chaos.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ network_chaos: # Scenario to create an outage
```

##### Sample scenario config for ingress traffic shaping (using a plugin)
'''
```
- id: network_chaos
config:
node_interface_name: # Dictionary with key as node name(s) and value as a list of its interfaces to test
Expand All @@ -35,7 +35,7 @@ network_chaos: # Scenario to create an outage
bandwidth: 10mbit
wait_duration: 120
test_duration: 60
'''
```

Note: For ingress traffic shaping, ensure that your node doesn't have any [IFB](https://wiki.linuxfoundation.org/networking/ifb) interfaces already present. The scenario relies on creating IFBs to do the shaping, and they are deleted at the end of the scenario.

Expand Down
7 changes: 6 additions & 1 deletion docs/node_scenarios.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,14 +4,15 @@ The following node chaos scenarios are supported:

1. **node_start_scenario**: Scenario to stop the node instance.
2. **node_stop_scenario**: Scenario to stop the node instance.
3. **node_stop_start_scenario**: Scenario to stop and then start the node instance. Not supported on VMware.
3. **node_stop_start_scenario**: Scenario to stop the node instance for specified duration and then start the node instance. Not supported on VMware.
4. **node_termination_scenario**: Scenario to terminate the node instance.
5. **node_reboot_scenario**: Scenario to reboot the node instance.
6. **stop_kubelet_scenario**: Scenario to stop the kubelet of the node instance.
7. **stop_start_kubelet_scenario**: Scenario to stop and start the kubelet of the node instance.
8. **restart_kubelet_scenario**: Scenario to restart the kubelet of the node instance.
9. **node_crash_scenario**: Scenario to crash the node instance.
10. **stop_start_helper_node_scenario**: Scenario to stop and start the helper node and check service status.
11. **node_disk_detach_attach_scenario**: Scenario to detach node disk for specified duration.


**NOTE**: If the node does not recover from the node_crash_scenario injection, reboot the node to get it back to Ready state.
Expand All @@ -20,6 +21,8 @@ The following node chaos scenarios are supported:
, node_reboot_scenario and stop_start_kubelet_scenario are supported on AWS, Azure, OpenStack, BareMetal, GCP
, VMware and Alibaba.

**NOTE**: node_disk_detach_attach_scenario is supported only on AWS and cannot detach root disk.


#### AWS

Expand Down Expand Up @@ -57,6 +60,8 @@ kind was primarily designed for testing Kubernetes itself, but may be used for l
#### GCP
Cloud setup instructions can be found [here](cloud_setup.md#gcp). Sample scenario config can be found [here](https://github.com/krkn-chaos/krkn/blob/main/scenarios/openshift/gcp_node_scenarios.yml).

NOTE: The parallel option is not available for GCP, the api doesn't perform processes at the same time


#### Openstack

Expand Down
2 changes: 2 additions & 0 deletions docs/zone_outage.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,10 +13,12 @@ zone_outage: # Scenario to create an out
duration: 600 # Duration in seconds after which the zone will be back online.
vpc_id: # Cluster virtual private network to target.
subnet_id: [subnet1, subnet2] # List of subnet-id's to deny both ingress and egress traffic.
default_acl_id: acl-xxxxxxxx # (Optional) ID of an existing network ACL to use instead of creating a new one. If provided, this ACL will not be deleted after the scenario.
```

**NOTE**: vpc_id and subnet_id can be obtained from the cloud web console by selecting one of the instances in the targeted zone ( us-west-2a for example ).
**NOTE**: Multiple zones will experience downtime in case of targeting multiple subnets which might have an impact on the cluster health especially if the zones have control plane components deployed.
**NOTE**: default_acl_id can be obtained from the AWS VPC Console by selecting "Network ACLs" from the left sidebar ( the ID will be in the format 'acl-xxxxxxxx' ). Make sure the selected ACL has the desired ingress/egress rules for your outage scenario ( i.e., deny all ).

##### Debugging steps in case of failures
In case of failures during the steps which revert back the network acl to allow traffic and bring back the cluster nodes in the zone, the nodes in the particular zone will be in `NotReady` condition. Here is how to fix it:
Expand Down
11 changes: 10 additions & 1 deletion krkn/scenario_plugins/native/node_scenarios/ibmcloud_plugin.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,16 @@ def __init__(self):
self.service.set_service_url(service_url)
except Exception as e:
logging.error("error authenticating" + str(e))
sys.exit(1)


# Get the instance ID of the node
def get_instance_id(self, node_name):
node_list = self.list_instances()
for node in node_list:
if node_name == node["vpc_name"]:
return node["vpc_id"]
logging.error("Couldn't find node with name " + str(node_name) + ", you could try another region")
sys.exit(1)

def delete_instance(self, instance_id):
"""
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -42,19 +42,13 @@ def run(
test_egress = get_yaml_item_value(
test_dict, "egress", {"bandwidth": "100mbit"}
)

if test_node:
node_name_list = test_node.split(",")
nodelst = common_node_functions.get_node_by_name(node_name_list, lib_telemetry.get_lib_kubernetes())
else:
node_name_list = [test_node]
nodelst = []
for single_node_name in node_name_list:
nodelst.extend(
common_node_functions.get_node(
single_node_name,
test_node_label,
test_instance_count,
lib_telemetry.get_lib_kubernetes(),
)
nodelst = common_node_functions.get_node(
test_node_label, test_instance_count, lib_telemetry.get_lib_kubernetes()
)
file_loader = FileSystemLoader(
os.path.abspath(os.path.dirname(__file__))
Expand Down Expand Up @@ -149,7 +143,10 @@ def run(
finally:
logging.info("Deleting jobs")
self.delete_job(joblst[:], lib_telemetry.get_lib_kubernetes())
except (RuntimeError, Exception):
except (RuntimeError, Exception) as e:
logging.error(
"NetworkChaosScenarioPlugin exiting due to Exception %s" % e
)
scenario_telemetry.exit_status = 1
return 1
else:
Expand Down
14 changes: 14 additions & 0 deletions krkn/scenario_plugins/node_actions/abstract_node_scenarios.py
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,20 @@ def helper_node_stop_start_scenario(self, instance_kill_count, node, timeout):
self.helper_node_start_scenario(instance_kill_count, node, timeout)
logging.info("helper_node_stop_start_scenario has been successfully injected!")

# Node scenario to detach and attach the disk
def node_disk_detach_attach_scenario(self, instance_kill_count, node, timeout, duration):
logging.info("Starting disk_detach_attach_scenario injection")
disk_attachment_details = self.get_disk_attachment_info(instance_kill_count, node)
if disk_attachment_details:
self.disk_detach_scenario(instance_kill_count, node, timeout)
logging.info("Waiting for %s seconds before attaching the disk" % (duration))
time.sleep(duration)
self.disk_attach_scenario(instance_kill_count, disk_attachment_details, timeout)
logging.info("node_disk_detach_attach_scenario has been successfully injected!")
else:
logging.error("Node %s has only root disk attached" % (node))
logging.error("node_disk_detach_attach_scenario failed!")

# Node scenario to terminate the node
def node_termination_scenario(self, instance_kill_count, node, timeout):
pass
Expand Down
115 changes: 114 additions & 1 deletion krkn/scenario_plugins/node_actions/aws_node_scenarios.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,8 @@
class AWS:
def __init__(self):
self.boto_client = boto3.client("ec2")
self.boto_instance = boto3.resource("ec2").Instance("id")
self.boto_resource = boto3.resource("ec2")
self.boto_instance = self.boto_resource.Instance("id")

# Get the instance ID of the node
def get_instance_id(self, node):
Expand Down Expand Up @@ -179,6 +180,72 @@ def delete_network_acl(self, acl_id):

raise RuntimeError()

# Detach volume
def detach_volumes(self, volumes_ids: list):
for volume in volumes_ids:
try:
self.boto_client.detach_volume(VolumeId=volume, Force=True)
except Exception as e:
logging.error(
"Detaching volume %s failed with exception: %s"
% (volume, e)
)

# Attach volume
def attach_volume(self, attachment: dict):
try:
if self.get_volume_state(attachment["VolumeId"]) == "in-use":
logging.info(
"Volume %s is already in use." % attachment["VolumeId"]
)
return
logging.info(
"Attaching the %s volumes to instance %s."
% (attachment["VolumeId"], attachment["InstanceId"])
)
self.boto_client.attach_volume(
InstanceId=attachment["InstanceId"],
Device=attachment["Device"],
VolumeId=attachment["VolumeId"]
)
except Exception as e:
logging.error(
"Failed attaching disk %s to the %s instance. "
"Encountered following exception: %s"
% (attachment['VolumeId'], attachment['InstanceId'], e)
)
raise RuntimeError()

# Get IDs of node volumes
def get_volumes_ids(self, instance_id: list):
response = self.boto_client.describe_instances(InstanceIds=instance_id)
instance_attachment_details = response["Reservations"][0]["Instances"][0]["BlockDeviceMappings"]
root_volume_device_name = self.get_root_volume_id(instance_id)
volume_ids = []
for device in instance_attachment_details:
if device["DeviceName"] != root_volume_device_name:
volume_id = device["Ebs"]["VolumeId"]
volume_ids.append(volume_id)
return volume_ids

# Get volumes attachment details
def get_volume_attachment_details(self, volume_ids: list):
response = self.boto_client.describe_volumes(VolumeIds=volume_ids)
volumes_details = response["Volumes"]
return volumes_details

# Get root volume
def get_root_volume_id(self, instance_id):
instance_id = instance_id[0]
instance = self.boto_resource.Instance(instance_id)
root_volume_id = instance.root_device_name
return root_volume_id

# Get volume state
def get_volume_state(self, volume_id: str):
volume = self.boto_resource.Volume(volume_id)
state = volume.state
return state

# krkn_lib
class aws_node_scenarios(abstract_node_scenarios):
Expand Down Expand Up @@ -290,3 +357,49 @@ def node_reboot_scenario(self, instance_kill_count, node, timeout):
logging.error("node_reboot_scenario injection failed!")

raise RuntimeError()

# Get volume attachment info
def get_disk_attachment_info(self, instance_kill_count, node):
for _ in range(instance_kill_count):
try:
logging.info("Obtaining disk attachment information")
instance_id = (self.aws.get_instance_id(node)).split()
volumes_ids = self.aws.get_volumes_ids(instance_id)
if volumes_ids:
vol_attachment_details = self.aws.get_volume_attachment_details(
volumes_ids
)
return vol_attachment_details
return
except Exception as e:
logging.error(
"Failed to obtain disk attachment information of %s node. "
"Encounteres following exception: %s." % (node, e)
)
raise RuntimeError()

# Node scenario to detach the volume
def disk_detach_scenario(self, instance_kill_count, node, timeout):
for _ in range(instance_kill_count):
try:
logging.info("Starting disk_detach_scenario injection")
instance_id = (self.aws.get_instance_id(node)).split()
volumes_ids = self.aws.get_volumes_ids(instance_id)
logging.info(
"Detaching the %s volumes from instance %s "
% (volumes_ids, node)
)
self.aws.detach_volumes(volumes_ids)
except Exception as e:
logging.error(
"Failed to detach disk from %s node. Encountered following"
"exception: %s." % (node, e)
)
logging.debug("")
raise RuntimeError()

# Node scenario to attach the volume
def disk_attach_scenario(self, instance_kill_count, attachment_details, timeout):
for _ in range(instance_kill_count):
for attachment in attachment_details:
self.aws.attach_volume(attachment["Attachments"][0])
38 changes: 22 additions & 16 deletions krkn/scenario_plugins/node_actions/common_node_functions.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,19 +8,28 @@
node_general = False


def get_node_by_name(node_name_list, kubecli: KrknKubernetes):
killable_nodes = kubecli.list_killable_nodes()
for node_name in node_name_list:
if node_name not in killable_nodes:
logging.info(
f"Node with provided ${node_name} does not exist or the node might "
"be in NotReady state."
)
return
return node_name_list


# Pick a random node with specified label selector
def get_node(node_name, label_selector, instance_kill_count, kubecli: KrknKubernetes):
if node_name in kubecli.list_killable_nodes():
return [node_name]
elif node_name:
logging.info(
"Node with provided node_name does not exist or the node might "
"be in NotReady state."
)
nodes = kubecli.list_killable_nodes(label_selector)
def get_node(label_selector, instance_kill_count, kubecli: KrknKubernetes):

label_selector_list = label_selector.split(",")
nodes = []
for label_selector in label_selector_list:
nodes.extend(kubecli.list_killable_nodes(label_selector))
if not nodes:
raise Exception("Ready nodes with the provided label selector do not exist")
logging.info("Ready nodes with the label selector %s: %s" % (label_selector, nodes))
logging.info("Ready nodes with the label selector %s: %s" % (label_selector_list, nodes))
number_of_nodes = len(nodes)
if instance_kill_count == number_of_nodes:
return nodes
Expand All @@ -35,22 +44,19 @@ def get_node(node_name, label_selector, instance_kill_count, kubecli: KrknKubern
# krkn_lib
# Wait until the node status becomes Ready
def wait_for_ready_status(node, timeout, kubecli: KrknKubernetes):
resource_version = kubecli.get_node_resource_version(node)
kubecli.watch_node_status(node, "True", timeout, resource_version)
kubecli.watch_node_status(node, "True", timeout)


# krkn_lib
# Wait until the node status becomes Not Ready
def wait_for_not_ready_status(node, timeout, kubecli: KrknKubernetes):
resource_version = kubecli.get_node_resource_version(node)
kubecli.watch_node_status(node, "False", timeout, resource_version)
kubecli.watch_node_status(node, "False", timeout)


# krkn_lib
# Wait until the node status becomes Unknown
def wait_for_unknown_status(node, timeout, kubecli: KrknKubernetes):
resource_version = kubecli.get_node_resource_version(node)
kubecli.watch_node_status(node, "Unknown", timeout, resource_version)
kubecli.watch_node_status(node, "Unknown", timeout)


# Get the ip of the cluster node
Expand Down
Loading

0 comments on commit 9821565

Please sign in to comment.