akka-cluster-custom-downing

Introduction

Akka cluster has akka.cluster.auto-down-unreachable-after configuration property. It enables to down unreachable nodes automatically after specified duration. As the Akka documentation says, using auto-down feature is dangerous because it may cause split brain and lead to multiple clusters. The fundamental problem is that actual node down and temporary network partition cannot be distinguished from each other. You must realize that auto-downed node may be actually alive and one of the worst consequence is that resources that must not be shared are damaged by multiple Cluster Singleton or duplicate entity sharded by Cluster Sharding.

akka-cluster-custom-downing provides configurable auto-downing strategy you can choose based on your distributed application design. It lets you configure which nodes can be downed automatically and who is responsible to execute a downing action.

Theoretical background

It is noteworthy that there exists no perfect split brain resolving strategy. According to so called "FLP impossibility result", no deterministic consensus can be made in asynchronous system even if only one process fails. As you know, akka cluster is asynchronous system. It means that akka cluster cannot make a consensus about their views synchronised with gossip protocol if faulty processes exist.

It is always beneficial to clarify what kind of failure a distributed algorithm handles. Akka cluster has a failure detector and split brain resolver (and also auto-down) handles its failure. In this case the failure indicates the process crashes and do not differentiate crash from omission. Actually the failure detector cannot differentiate crash and omission. In the case of omission, i.e. when the process marked as faulty by failure detector actually are alive, some algorithm go wrong. For example auto-down feature of akka cluster is based on leader which may diverged under omission fault so may result in split brain.

Note that leader election in Akka cluster is based on asynchronously replicated state. A leader is the first member selected from sorted members that is not unreachable. Under network partition some partition may have different view of unreachable members from the others so leader will diverge.

Split brain resolver including akka-cluster-custom-downing basically resolve the problem in the way that,

do not depends on a leader (or leader only)
force some resolved processes to crash, i.e. omission fault is same as crash fault
not perfect as indicated in FLP impossibility, but occurrence of the imperfection is pretty rare

My slide may help you understand these things. I am sorry it is written in japanese. Let me know if there are good document explains split brain problem. http://www.slideshare.net/TanUkkii/akka-cluster-66880662

Status

akka-cluster-custom-downing is currently under development.

akka-cluster-custom-downing is for general purpose. In some case you find it is better to implement your system specific strategy by yourself. akka-cluster-custom-downing shows how to implement split brain resolving strategies so may be your help.

Distributed systems are challenging. Actually I do not use akka cluster or akka-cluster-custom-downing in production. I hope to know how you use them.

Installation

For sbt, add following lines to build.sbt.

resolvers += Resolver.bintrayRepo("tanukkii007", "maven")

libraryDependencies += "com.github.TanUkkii007" %% "akka-cluster-custom-downing" % "0.0.7"

Both Scala 2.11 and 2.12 are supported.

Usage of split brain resolving strategy

OldestAutoDowning

OldestAutoDowning automatically downs unreachable members. A node responsible to down is the oldest member of a specified role. If oldest-member-role is not specified, the oldest member among all cluster members fulfills its duty.

You can enable this strategy with following configuration.

akka.cluster.downing-provider-class = "tanukki.akka.cluster.autodown.OldestAutoDowning"

custom-downing {
  stable-after = 20s
  
  oldest-auto-downing {
    oldest-member-role = ""
    down-if-alone = true
  }
}

Unlike leader based downing strategy, the oldest based downing strategy is much safer. It is because the oldest member is uniquely determined by all members even if gossip is not converged, while different leader might be viewed by members under gossip unconvergence.

Downside of the oldest based downing strategy is loss of downing functionality when the oldest member itself fails. If down-if-alone is set to be true, such scenario can be avoided because the secondary oldest member will down the oldest member if the oldest member get unreachable alone.

QuorumLeaderAutoDowning

QuorumLeaderAutoDowning automatically downs unreachable nodes after specified duration if the number of remaining members are larger than or equal to configured quorum size. This strategy is same as static quorum strategy of Split Brain Resolver from Typesafe reactive platform. If down-if-out-of-quorum is set to be true, remaining members which number is under quorum size will shutdown ActorSystem by themselves. If role is specified, the number of remaining members in the role is used to be compared with quorum size.

akka.cluster.downing-provider-class = "tanukki.akka.cluster.autodown.QuorumLeaderAutoDowning"

custom-downing {
  stable-after = 20s
  
  quorum-leader-auto-downing {
    role = ""
    quorum-size = 0
    down-if-out-of-quorum = true
  }
}

Usage of Leader based downing strategies

It is not recommended that your auto-down strategy depends on a leader.

With careful consideration, however you may use auto-down feature safely for a specific design of distributed application.

If following conditions are met, nodes can be safely down automatically.

a node will be isolated from incoming requests if the node is detached from the other cluster members
a node mutates shared resources only if it receives a request

How this conditions are applied is vary according to application design.

LeaderAutoDowningRoles

LeaderAutoDowningRoles automatically downs nodes with specified roles. Like akka.cluster.AutoDowning, which is provided with Akka Cluster, a node responsible to down is the leader.

You can enable this strategy with following configuration.

akka.cluster.downing-provider-class = "tanukki.akka.cluster.autodown.LeaderAutoDowningRoles"

custom-downing {
  stable-after = 20s

  leader-auto-downing-roles {
    target-roles = [worker]
  }
}

RoleLeaderAutoDowningRoles

RoleLeaderAutoDowningRoles automatically downs nodes with specified roles. A node responsible to down is the role leader of a specified role.

You can enable this strategy with following configuration.

akka.cluster.downing-provider-class = "tanukki.akka.cluster.autodown.RoleLeaderAutoDowningRoles"

custom-downing {
  stable-after = 20s
  
  role-leader-auto-downing-roles {
    leader-role = "master"
    target-roles = [worker]
  }
}

Name		Name	Last commit message	Last commit date
Latest commit History 60 Commits
img		img
project		project
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
build.sbt		build.sbt
circle.yml		circle.yml
version.sbt		version.sbt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

akka-cluster-custom-downing

Introduction

Theoretical background

Status

Installation

Usage of split brain resolving strategy

OldestAutoDowning

QuorumLeaderAutoDowning

Usage of Leader based downing strategies

LeaderAutoDowningRoles

RoleLeaderAutoDowningRoles

About

Releases

Packages

Languages

License

fflatorre/akka-cluster-custom-downing

Folders and files

Latest commit

History

Repository files navigation

akka-cluster-custom-downing

Introduction

Theoretical background

Status

Installation

Usage of split brain resolving strategy

OldestAutoDowning

QuorumLeaderAutoDowning

Usage of Leader based downing strategies

LeaderAutoDowningRoles

RoleLeaderAutoDowningRoles

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages