Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update ADOPTERS.md with Litmus usage details #2191

Open
ksatchit opened this issue Oct 6, 2020 · 32 comments
Open

Update ADOPTERS.md with Litmus usage details #2191

ksatchit opened this issue Oct 6, 2020 · 32 comments
Labels
project/community Issues raised by community members

Comments

@ksatchit
Copy link
Member

ksatchit commented Oct 6, 2020

The LitmusChaos Community is working towards increasing adoption of chaos engineering practices within the Kubernetes world & is focused on collaboration with other cloud-native projects. One of the ways of tracking the project's reach is via an ADOPTERS list. The purpose of this issue is to get a list of organizations/individuals who are using Litmus to power their chaosengineering practice and also share broadly their usecases & reasons for choosing Litmus.

Please comment on this issue with details like:

  • Applications/Workloads or Infra that are being subjected to chaos by Litmus
  • Why was Litmus chosen & how it is helping you (a brief description on the usecase)
  • Are you using it as part of devtest, CI/CD, in staging/pre-prod/prod or other
  • If you would like your name (as standalone user) or organization name to be added to the Adopters.md, please provide a preferred contact handle like github id, twitter id, linkedin id, website etc.

This information will be used to create a PR on the ADOPTERS.md file, which you can approve. Alternatively, feel free to create a PR and reference this issue !

@ksatchit ksatchit added the project/community Issues raised by community members label Oct 6, 2020
@divya-mohan0209
Copy link

  • I am currently using LitmusChaos to demonstrate a POC for Chaos Engineering on Serverless Architecture.
  • I shall be presenting this at the DevFest Siberia 2020.
  • GitHub ID: divya-mohan0209, Twitter Handle: Divya_Mohan02

@barkardk
Copy link
Contributor

barkardk commented Oct 6, 2020

I am using LitmusChaos as a part of our QA cycle at the moment to verify resiliency and catch bugs. For now it is only used in AWS EKS and Ec2 instances , we are expanding it to usage in Azure hopefully soon.
Litmus looked solid, easy to implement and most of all easy to customise.
gitHub id xkbarkar, Netapp Inc

@keerthisagar40
Copy link

keerthisagar40 commented Nov 4, 2020

  • k8 pods hosted on both aws and azure .
  • Needed a clean way to introduce anomolies in the system to figure out its behaviour , litmus was the one that was clean and easy to use
  • using it part of QA cycle
  • Akridata

@ishantanu
Copy link

  • Currently working on using Litmus for introducing Chaos in Kubernetes clusters.
  • I was looking for a cloud-native way of introducing Chaos and after going through the details and other options, Litmus was probably the best fit.
  • Usage of Litmus is still in preliminary stages. A limited set of chaos experiments are used for testing resiliency. This will change in the future.
  • GitHub ID - ishantanu

@xunholy
Copy link
Member

xunholy commented Nov 20, 2020

Applications/Workloads or Infra that are being subjected to chaos by Litmus:

  • Internal workload pods and storage resilience (OpenEBS); This is to test my built-in cluster resilience running whilst running on arm64 architecture and building confidence in the design and overall architecture.

Why was Litmus chosen & how it is helping you (a brief description on the usecase):

  • I reviewed several chaos tools and felt that Litmus being associated with CNCF and being an open-source tool aligns with my own personal preference and values. It has a very active community and repository, and there was well-documented information that helped during the initial learning phases.

Are you using it as part of devtest, CI/CD, in staging/pre-prod/prod or other:

  • I'm using it to run my RPi Kubernetes cluster which is my home cluster. This is running my personal production workloads.

If you would like your name (as standalone user) or organization name to be added to the Adopters.md, please provide a preferred contact handle like github id, twitter id, linkedin id, website etc:

@olegch
Copy link
Contributor

olegch commented Jan 25, 2021

Applications/Workloads or Infra that are being subjected to chaos by Litmus

  • Kublr-provisioned Kubernetes clusters; we apply litmus chaos load to stress-test the clusters and identify the weak spots and components prone to failures under stress when customer applications stress the system

Why was Litmus chosen & how it is helping you (a brief description on the usecase)

  • Litmus is well-documented, well-supported open source tool with a great community and development team. It is flexible and allows us to adjust the chaos tests any way we need.

Are you using it as part of devtest, CI/CD, in staging/pre-prod/prod or other

  • This is currently used as a part of development testing and adhoc experiments, although we are working on including litmus chaos tests into our standard automated QA process

If you would like your name (as standalone user) or organization name to be added to the Adopters.md, please provide a preferred contact handle like github id, twitter id, linkedin id, website etc.

ajeshbaby added a commit to ajeshbaby/litmus that referenced this issue Jan 26, 2021
ajeshbaby added a commit to ajeshbaby/litmus that referenced this issue Jan 26, 2021
ajeshbaby added a commit that referenced this issue Jan 26, 2021
* Adding Kublr as adopter reference #2191
@imrajdas imrajdas pinned this issue Mar 12, 2021
@niebomin
Copy link
Contributor

niebomin commented Apr 8, 2021

Please add VMware as adopter. Will add more description later. Use case is Chaos Engineering in CD.

@Jonsy13 Jonsy13 unpinned this issue Apr 21, 2021
@Jonsy13 Jonsy13 pinned this issue Apr 21, 2021
@ajeshbaby ajeshbaby unpinned this issue May 4, 2021
@asibece
Copy link

asibece commented Jul 8, 2021

Why do we use Litmus.
To ensure resilience, detect bugs and test rollouts. We are still in the early stages.

How do we use Litmus.
Litmus is being used as part of dev/test cycles to catch bugs & verify resiliency.

Benefits in using Litmus.
The litmus is easy to use and extend/develop based on custom requirements and well-supported open source tool.

@SomeshJoshi19
Copy link

Please consider the shared file here as adopter for Pravega to acknowledge usage of Litmus Chaos, thanks.
Pravega.md

@shilpa7252
Copy link

Why do we use Litmus.
To inject network related faults on kubernetes environment

How do we use Litmus.
Litmus is being used as part of QE testing

Benefits in using Litmus.
The litmus is easy to use and to inject faults in environment

@nikhil-neu
Copy link

We are using litmus chaos to inject faults in our aks environments. Before arriving at litmus we explored other tools , but found litmus to be the most well rounded one and the one that aligned closest to the principles of chaos
We are using litmus in our pre prod environments in the ci cd stage as a gate for releases

The chaos gated deployments make use of the in-built git ops integration in litmus

https://www.neudesic.com/

@chris-cmsoft
Copy link

We have used Litmus to build out Chaos Engineering platforms with some of our large E-Commerce customers to improve resilience for big sales periods such as Black Friday.

We looked into quite a few tools, and Litmus provided us with the flexibility we needed, whilst bootstrapping many of the components we would have to write ourselves.

We also used Litmus Chaos experiments when discussing some of our customer's architecture constraints, and showing them real world cases of how to make Kubernetes more resilient.

  • One concrete use case was our customer wanting to build a cluster per app, whilst we wanted to build bigger clusters for easier management. We would use Litmus to show what application failure looks like on one part of the cluster, and show global resilience in their cluster when this happens.

The Litmus community and *product have been a great addition to our tool stack, and provided many benefits for us.

@bbarin
Copy link
Contributor

bbarin commented Dec 3, 2021

We have been using Litmus 2.X at iFood for a couple of months, replacing chaostoolkit as it provides a wider range of experiments out-of-the-box. We've started using it to validate the fallback mechanisms of critical services monthly. Right now, we are expanding its usage to go further and inject failures to drop access to databases, redis, Kafka and AWS services and learn from it and take some countermeasures to improve the critical services.
I hope Litmus to become the de-facto tool to implement Chaos Engineering in a simple manner.
Github: bbarin
website: ifood.com.br

@vadheraju
Copy link

vadheraju commented May 23, 2022

We at FIS Global, have been embarking on to larger SRE program to transform platform teams from purely operations focused to bring in SRE/Automation culture and mindset. As part of that larger effort, Chaos/Resiliency Engineering is identified as key program to improve stability and availability thus improve overall reliability of applications across organization and provide superior customer experience. We have chosen Litmus as a Chaos Engineering Tool because, It

  • Fulfills all of resiliency testing requirements
  • Has good and responsive community
  • Has good documentation
  • is built on loosely coupled architecture
  • Has nice dashboard features
  • Exposes APIs to integrate with CI/CD pipelines

Where we are using Litmus

  • Currently, using in Applications/Workloads but idea is to expand to Infrastructure, e.g. using network latency to identify and understand resiliency of upstream application/component when downstream application/component is slow, Use Pod delete under production load to understand the application's ability to self heal.
  • Simulate experiments using Litmus to understand utilization of JVM's key resources such as thread pool, connection pool, heap memory etc
  • Kafka Resiliency : Kafka itself is a complex distributed architecture solution, planning to use Litmus network and memory hog experiments to simulate latency between Producer and Broker, Consumer and Broker, Leader and Follower, and also trying to understand how cluster behaves under Memory and CPU pressure.
  • Integrate Litmus with CI/CD over APIs so that Chaos Testing can be autonomous

@vraton
Copy link

vraton commented Jun 9, 2022

In adidas, we started months ago with a new initiative about how to implement chaos engineering practices in order to provide the engineering teams a guide and tools about how to test the resilience of the applications through chaos engineering. With this goal in mind, we started to define some best practices and processes to be shared with our engineering team, and we started to evaluate a few tools.

After an evaluation of different tools, we decided to go ahead with Litmus Chaos.
How are we using Litmus chaos:

  • Applications/Workloads or Infra that are being subjected to chaos by Litmus

    • Litmus chaos will be provided by our platform team as part of their services. It will be running on kubernetes and will be available for engineering teams.
    • Experiments, like pod deletion, network latency or packetloss, applied between functional dependencies like checkout & Payments, login, order creation...
    • Not applied in production yet.
  • Why was Litmus chosen & how it is helping you (a brief description of the usecase). We defined a set of priorities (with different value) and stoppers, we analyzed the tooling and selected the most valued one:

    • Prio 1 & Stoppers if not: Full detailed documentation in English available, API / Shared Libraries, Control Injecting Failure, Permissions scope isolated, Authorization, chaos Scenarios - Parallel, works with: Kuberentes, OpenSource
    • Prio 2: Installation and Management, Metrics / Reporting, Halt attack, Automatic rollback, High/admin permissions on the node, Chaos scenarios as code, chaos attacks - Serial, Custom or Specialized Attacks, Custom or Specialized Scenarios, Works with: AWS
    • Prio 3: Access to the logs, Scheduling attacks, Health Checks, Application Attacks, Target Radomization, Network Attacks, VMs Attacks, Public API, Web UI
  • Are you using it as part of devtest, CI/CD, in staging/pre-prod/prod, or other

    • Staging/pre-prod
    • Planned to go to production and through CI/CD pipelines.
  • If you would like your name (as standalone user) or organization name to be added to the Adopters.md, please provide a preferred contact handle like GitHub id, Twitter id, LinkedIn id, website etc.

@eran-levy
Copy link

We are utilizing Chaos Engineering for something else at the moment :) We found it very useful to bring our engineering confidence while responding to production incidents and train them on cloud native engineering practices, check out this article where I elaborate more on our workshop - https://www.infoq.com/articles/chaos-engineering-cloud-native/

@jonathasb-cit
Copy link

After an evaluation period of some Chaos Engineering tools, we chose Litmus because it is a more mature tool that would meet most of our needs. We are in the implementation, configuration, and process definition phase.
AB-Inbev's BEES is a huge project that has hundreds of microservices, it has been a great challenge to adapt Litmus in this process, making customizations and counting on the help of the Litmus community to evolve the tool and thus achieve our goal of making it available to the teams.
Some points that made us choose Litmus:

  • Based on K8S resources
  • SSO
  • Customization of attacks, attacks in parallel
  • Installation on multiple clusters
  • GitOps

@rutu-k
Copy link

rutu-k commented Sep 14, 2022

At InfraCloud, we are using Litmus to develop Resiliency Frameworks.
Why do we use Litmus.
To simulate various Chaos scenarios using fault injection templates provided by Litmus. Litmus also helps to incorporate custom fault templates developed using AWS SSM documents.

How do we use Litmus.
Currently, we have tested with different kind of scenarios including faults like pod deletion, network latency, resource stressing, network partitioning in databases, and many more.

Benefits in using Litmus.

  • Easy deployment.
  • Easy Fault injection.
  • Custom Grading for experiments
  • SSM integration helps to inject fault in both EKS and external AWS components.

Company website: https://www.infracloud.io/
Company GitHub: https://github.com/infracloudio

@tao12345666333
Copy link

We practice chaos engineering using Litmus in the Apache APISIX Ingress.

Litmus also helped us find hidden bugs.

Project website: https://apisix.apache.org/
This is the text version of my online sharing content. https://dev.to/apisix/building-a-more-robust-apache-apisix-ingress-controller-with-litmus-chaos-3ldn

@abdiakhate
Copy link

At Baobab Group, we use LitmusChaos to orchestrate chaos on Kubernetes to help developers and SREs find weaknesses in their application deployments.

We use it on QA and Preprod stages in order to see how the Workloads and AWS ressources behave in case of failure injection.

How do we use Litmus.
We use it on our Kubernetes workloads like pod deletion or CPU hog and we plan to extend it on cloud services..

Benefits in using Litmus.

  • GitOps friendly
  • Integrate easily in cloud native environment.
  • Easy Fault injection.
  • Visualize chaos scenario

Company website: https://baobab.com/

@imrajdas imrajdas pinned this issue Jul 25, 2023
@amityt amityt unpinned this issue Aug 21, 2023
@ajeshbaby ajeshbaby pinned this issue Nov 2, 2023
@amityt amityt unpinned this issue Jan 19, 2024
@imrajdas imrajdas pinned this issue Jan 22, 2024
@Jonsy13 Jonsy13 unpinned this issue Jan 23, 2024
@Jonsy13 Jonsy13 pinned this issue Jan 23, 2024
@prithvi1307
Copy link
Contributor

User comment by IFS
image (21)

@safeercm
Copy link

Flipkart is an adoptor of Litmus Chaos. In addition to using the core features, we have also built a VM chaos platform leveraging Litmus. The details are covered in this talk we gave at Chaos Carnival 2024 - Building a Chaos Platform for Virtual Machines with OpenSource Tools

  • Applications/Workloads or Infra that are being subjected to chaos by Litmus
    • Stateless services running on our Kubernetes infrastructure
    • VMs running stateful workloads ( Using the VM Chaos platform built on top of Lirmus )
  • Why was Litmus chosen & how it is helping you (a brief description on the usecase)
    • We did an exhaustive analysis of top opensource chaos tools based on various scenarios. Litmus chaos was a winner in terms of
      • Good User experience and interface
      • Stable and secure chaos infrastructure
      • Detailed documentation and active opensource community
      • Ease of modifying code ( We modified both backend and front end to suit our needs )
      • Pre-built kubernetes native experiments
    • Litmus helps us in testing out the failure scenarios - which helps in validating assumptions about failover capabilities of our infrastructure as well as validating run books and failure recovery
  • Are you using it as part of devtest, CI/CD, in staging/pre-prod/prod or other
    • Currently we are using this in pre-prod infrastructure.
  • If you would like your name (as standalone user) or organization name to be added to the Adopters.md, please provide a preferred contact handle like github id, twitter id, linkedin id, website etc.

@MichaelMorrisEst
Copy link
Contributor

Why do we use Litmus.
We are using Litmus at Ericsson to perform resilience testing of our applications and to gain an understanding of how they perform in failure scenarios

How do we use Litmus.
We are using Litmus in pre production CI testing phase

Benefits in using Litmus.
Litmus is easy to use and provides a good level of functionality with the included fault scenarios, whilst the architecture allows for easily deploying custom faults if required.
It provides the means to easily test scenarios that would otherwise be difficult to test

@SahilKr24 SahilKr24 unpinned this issue Mar 14, 2024
@smitthakkar96
Copy link
Contributor

smitthakkar96 commented Mar 14, 2024

Within Delivery Hero, two of our entities, Hungerstation and PedidosYa, have been leveraging Litmus to enhance the resilience of their services. We use various faults offered by Litmus such Network Latency, Network Corruption etc. Using Litmus the teams have been able to test mechanisms such as circuit breaking, fallbacks, scaling behaviour, context timeouts etc. Building on this experience, we are currently developing an internal Chaos Engineering Platform, based on Litmus, as part of our Global Developer Platform initiative. This platform aims to standardize and elevate chaos engineering practices across all Delivery Hero verticals.

@akria18
Copy link
Member

akria18 commented Mar 18, 2024

At Talend, we are using Litmus 2.x and Litmus 3.x within our pipeline and for weekly checks. Litmus was the solution we chose to help us on our journey with chaos engineering.

How do we use Litmus?

Litmus is deployed in our environment to validate our observability/security stack and to help promote our builds before they go live into production. We use it within a weekly job that utilizes Litmus as a chaos controller, along with a custom-built tool that collects results after injected experiments and sends them to Slack in report form for better resilience improvements in our observability/security stack.

We have also started using it to validate our SLIs/SLOs and their runbooks. Additionally, we use it in our Jenkins pipeline when we want to promote builds to production after QA tests and validate that the new version supports newly injected turbulences, etc.

Benefits of using Litmus?

Litmus is a straightforward framework that provides multiple experiments and is easy to use by developers. It allows for the creation of specific chaos workflows depending on their needs.

@ajeshbaby ajeshbaby pinned this issue Mar 21, 2024
@neelanjan00 neelanjan00 unpinned this issue Apr 4, 2024
@alininja
Copy link
Contributor

alininja commented May 7, 2024

Thank you for creating such a wonderful software 🙏

Here are the requested details:

  • Applications/Workloads or Infra that are being subjected to chaos by Litmus
    • Example application hosted on OpenShift to investigate adding Litmus to an internal, production OpenShift cluster
  • Why was Litmus chosen & how it is helping you (a brief description on the usecase)
    • It was chosen because it provided:
      • a web interface
      • integrated with Kubernetes
      • installable via Helm chart
      • has sufficient chaos scenarios/experiments
    • We want to use it to stress all of the internally developed applications
  • Are you using it as part of devtest, CI/CD, in staging/pre-prod/prod or other
    • we will eventually integrate Litmus into staging/prod environments via CI/CD pipelines
  • If you would like your name (as standalone user) or organization name to be added to the Adopters.md, please provide a preferred contact handle like github id, twitter id, linkedin id, website etc.
    • No thank you

@prithvi1307 , I hope this is okay. Please let me know if there's anything else I can provide.

Thanks again for making such an awesome product 🙇

@dhruv5176
Copy link

Leveraging Litmus Chaos Engineering in Kubernetes Infrastructure

We have a Kubernetes-based infrastructure pivotal to our operations, where reliability and resilience are paramount. Recognizing the need for robust testing methodologies, we turned to Litmus Chaos Engineering to fortify our systems against potential failures and to ensure seamless operations even under adverse conditions.

Why Litmus:
Litmus emerged as our tool of choice due to its comprehensive suite of chaos engineering capabilities tailored specifically for Kubernetes environments. Its versatility in orchestrating controlled chaos experiments aligns perfectly with our commitment to enhancing system reliability while maintaining agility.

Use Case and Implementation:
We have seamlessly integrated Litmus Chaos Engineering into various stages of our development and deployment pipeline, spanning from development and testing to staging and production environments. Leveraging Litmus, we meticulously craft and execute chaos experiments, meticulously observing how our infrastructure behaves under stress, and ensuring it meets our predefined Service Level Objectives (SLOs) and Service Level Indicators (SLIs).

Achievements:
Our journey with Litmus Chaos Engineering has been marked by significant milestones:

Successful deployment of Chaos Center and Litmus Delegate, empowering us with centralized chaos management capabilities.
Establishment of secure access to Chaos Center through HTTPS, coupled with domain customization for enhanced usability.
Implementation of WAF ACL to restrict access to Chaos Center, ensuring secure interactions.
Integration of Azure SSO for streamlined user management and authentication.
Seamless connectivity between Chaos Center and target nodes, facilitating efficient chaos experimentation.
Execution of numerous successful experiments, validating the resilience and scalability of our infrastructure.

Next Steps:
As we continue to harness the power of Litmus Chaos Engineering, we remain committed to expanding our chaos engineering initiatives, further refining our chaos experiments, and continually enhancing the resilience of our Kubernetes infrastructure.

Contact Information:
Dhruv's LinkedIn profile.

We are excited about the possibilities that Litmus Chaos Engineering unlocks for us and look forward to sharing our insights and experiences with the community.

@ledbruno
Copy link
Contributor

Nubank

Nubank is the world’s largest digital banking platform outside of Asia, serving over 100 million customers across Brazil, Mexico, and Colombia.

Applications/Workloads or Infra that are being subjected to chaos by Litmus

  • Money transfer services are covered by chaos experiments, which are critical flows from a business perspective.

Why was Litmus chosen & how it is helping you (a brief description on the use case)

  • Experiment/Failure types are aligned with our first needs.
  • Compatible with our Kubernetes environment.
  • It can be used through an API.
  • Probes can be easily integrated with our observability/monitoring tools.
  • It is helping us to promote Chaos Engineering practices, perform Gamedays and running schedule experiments at staging and production.

How do we use Litmus

  • We use Litmus as a "Chaos engine" of an internal tool that is available for "product teams" to validate "service resilience hypothesis".
  • Engineers use it through NuCLI during "game days" or schedule experiments for continuous resilience.
  • Available on Dozens of k8s clusters, for several countries/teams.
  • We use an abstraction json file as the "experiment definition" that relies on our own service/shard/alert canonical approach.

Benefits of Litmus

  • Compatible with our Kubernetes environment which enables "service oriented" chaos engineering at Nu.
  • Open Source.
  • Wide range of experiments that can be extended.
  • Active community on slack channel, which helps in solving issues.

@bjoky
Copy link

bjoky commented Sep 2, 2024

At Infor we have a resilience team for one of our products. In that team we are using Litmus as the main tool for chaos engineering. Some of our reasons for choosing it were that it is an open-source tool with an active community and that it runs in Kubernetes.

Litmus was our entry point into chaos engineering, and it provided us with a palette of possible experiments and types of failures to choose from. We use it to simulate failures on workloads in Kubernetes environments for development, testing and pre-production, but not yet in production environments.

So far, we have mainly used Litmus for “game day” style workshops. We gather the team working with a component and we run a few experiments together with them. But we have also started using it for running automated experiments in controlled environments and are also looking into integrating it in our CI/CD pipeline.

Our experience with running Litmus and chaos engineering workshops has in general been positive. Besides running the tool, we have also put emphasis on the preparation and follow-up phases of our chaos experiments. We have found that the discussions about resilience and chaos engineering is of value for the developers and helps create a culture of resilience that improves the quality of our product.

@alicicek1
Copy link
Contributor

alicicek1 commented Sep 5, 2024

Wingie Enuygun Company

Wingie Enuygun Company is a leading travel and technology company providing seamless travel solutions across various platforms.

Why do we use Litmus

We use Litmus to identify bottlenecks in our systems, detect issues early, and foresee potential errors. This allows us to take proactive measures and maintain the resilience and performance of our infrastructure.

How do we use Litmus

Litmus is integrated into our QA cycles, where it plays a crucial role in catching bugs and verifying the overall resilience of our systems.

Benefits in using Litmus

Litmus chaos experiments are straightforward to implement and can be easily customized or extended to meet our specific requirements, enabling us to effectively manage and optimize our systems at Wingie Enuygun.

@Devendranathashok
Copy link

Resilience is a key aspect in creating fault-tolerant environments, and leveraging tools like Litmus has been instrumental in automating resilience testing. Litmus has enabled us to simulate real-time chaos scenarios, allowing us to thoroughly verify the robustness of both our infrastructure and applications.

We began with a proof of concept (POC) on a playground cluster. While we explored other tools during this process, Litmus stood out significantly, not only in its capabilities but also due to its excellent user interface. Although we faced a few challenges during the initial setup of Litmus on OpenShift, the team provided timely support, helping us overcome these obstacles and successfully complete the POC.

Now, we've successfully deployed Litmus in a non-production cluster environment, and our SRE team is in the process of transitioning from manual chaos testing to automated chaos tests. This shift will enable us to schedule, automate, and efficiently track the outcomes of these tests, enhancing the resilience of our systems.

@imrajdas imrajdas pinned this issue Sep 9, 2024
@sidvijay18
Copy link

At PokerBaazi, we leverage Litmus Chaos to subject critical components of our infrastructure to controlled chaos experiments. These include:

  1. Microservices Infrastructure: Our backend is designed as a microservices architecture, running on Kubernetes. We conduct experiments on inter-service communication, API latencies, and service resilience during node failures or resource constraints.
  2. Load Balancers and Networking: We simulate disruptions in networking, such as packet drops or DNS failures, to ensure our applications maintain connectivity and continue serving users.
  3. Application Workloads: High-demand applications like our gaming engine and payment/promotions api's are put under stress to evaluate their fault tolerance and recovery mechanisms during peak loads or unexpected outages.

We chose Litmus Chaos for several compelling reasons:

  1. Kubernetes-Native Integration: Since our infrastructure is heavily Kubernetes-based, Litmus seamlessly integrates with our stack, making it a natural fit.
  2. Ease of Use and Open-Source: Litmus offers a user-friendly interface along with robust documentation, allowing our teams to adopt it quickly without steep learning curves.
  3. Custom Experiment Support: The ability to create tailored chaos experiments aligned with our specific workloads ensures we can target critical failure scenarios unique to our ecosystem.
  4. Community Support and Scalability: Being an open-source project with an active community, Litmus evolves rapidly, allowing us to leverage the latest chaos engineering methodologies and tools.

Litmus has been instrumental in identifying hidden weaknesses in our system, such as unexpected dependencies or cascading failures. This has enabled us to proactively address potential issues, enhance system resilience, and meet our uptime commitments.

We use Litmus Chaos in various environments to ensure robust testing at every stage of development:

  1. Development: Initial chaos experiments are conducted in isolated dev environments to identify basic resilience issues and ensure service fault tolerance during early-stage development.
  2. Staging/Pre-Production: In staging, we run more comprehensive chaos scenarios simulating real-world failures, such as pod crashes, resource exhaustion, or external API downtime, to ensure the production-like environment is resilient.
  3. Production: Selected, low-risk chaos experiments are conducted in production under strict supervision to verify real-time system robustness and validate SLAs in live conditions.

Litmus Chaos has transformed our approach to building and maintaining a highly resilient gaming platform, allowing us to deliver exceptional user experiences while preparing for the unexpected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
project/community Issues raised by community members
Projects
None yet
Development

No branches or pull requests