Skip to content

Commit

Permalink
branching for 3.3.x
Browse files Browse the repository at this point in the history
  • Loading branch information
popduke committed Aug 26, 2024
1 parent b86efce commit 44f32c5
Show file tree
Hide file tree
Showing 571 changed files with 10,707 additions and 187 deletions.
138 changes: 36 additions & 102 deletions website/docs/04_cluster/2_high_availability.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,132 +3,66 @@ sidebar_position: 2
title: "High Availability"
---

Due to the decentralized cluster-building capability provided by the Gossip-based protocol (base-cluster) we offer, BifroMQ StandardCluster nodes do not require additional service discovery components to establish a cluster. This eliminates
operational risks associated with single points of failure and enables the flexible scalability of the cluster, ensuring high availability across the entire system.
# High Availability

From an implementation perspective, BifroMQ internally decomposes the workload, forming logically independent subclusters for each type of load. Specifically, modules related to storage leverage the Raft algorithm to guarantee consistency
and high availability.
BifroMQ is designed to deliver high availability by leveraging a decentralized cluster-building capability provided by its Gossip-based protocol (base-cluster). This architecture eliminates the need for additional service discovery components to establish a cluster, thereby reducing the operational risks associated with single points of failure. The result is a system that can scale flexibly and maintain high availability across all nodes in the cluster.

Internally, BifroMQ decomposes workloads and organizes them into logically independent subclusters. Each type of load is managed by a separate subcluster, and modules related to storage use the Raft algorithm to ensure consistency and high availability.

## How to Enable High Availability in a Cluster

Currently, the BifroMQ StandardCluster employs a deployment model that encapsulates all workloads within a single process. As different subclusters have varying requirements for high availability, enabling high availability in the BifroMQ
cluster necessitates meeting the following conditions.
The BifroMQ StandardCluster employs a deployment model where all workloads are encapsulated within a single process. Given that different subclusters have varying requirements for high availability, enabling high availability in a BifroMQ cluster requires the following conditions to be met.

### Cluster Node Count

For the stateful distributed storage, base-kv, in a cluster, the availability of the service is ensured only if more than half of the nodes in the cluster are alive. Therefore, **the cluster deployment must have a node count greater than or
equal to 3**.
For the stateful distributed storage engine, base-kv, within a cluster, service availability is guaranteed only if more than half of the nodes in the cluster are alive. Therefore, **the cluster must have a minimum of 3 nodes** to ensure high availability.

### Configuration of base-cluster
### Clustering Configuration

The configurations related to base-cluster are centralized in the "clusterConfig" section of the configuration file:

| Parameter Name | Default Value | Recommended Value |
|----------------|---------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| host | Null | Input the IP addresses that are mutually accessible within the cluster. |
| ip | Null | Input the ports that are mutually accessible within the cluster. |
| seedEndpoints | Not Null | Entrance for the request of a new node joining the cluster. It is recommended to input the list of endpoints for all nodes in the cluster or a unified proxy address for all nodes. |
| Parameter Name | Default Value | Recommended Value |
|-------------------|---------------|-----------------------------------------------------------------------------------------------------------------|
| env | Test | ***Prod*** or ***Test*** to isolate clusters, ensuring nodes from one cluster do not accidentally join another. |
| host | Not Set | The IP addresses that must be accessible by all nodes within the cluster. |
| port | Not Set | The ports that are mutually accessible within the cluster. |
| clusterDomainName | Not Set | The domain name registered for cluster nodes. |
| seedEndpoints | Not Null | A list of existing nodes in the cluster. |

### Configuration of Replica Count

### Configuration of base-kv Replica Count
BifroMQ leverages base-kv to support three native MQTT load types:

BifroMQ's internal storage is divided into three parts: MQTT subscription routes, MQTT messages from connections with `cleanSession=false`, and retained messages. The corresponding module names in BifroMQ are: `dist-worker` for MQTT
subscription routes, `inbox-store` for MQTT messages with `cleanSession=false` connections, and `retain-store` for retained messages. Each module forms an independent base-kv subcluster.
- MQTT Dynamic Subscriptions (DistWorker)
- MQTT Offline Messages (InboxStore)
- MQTT Retained Messages (RetainStore)

The configuration for the number of replicas in base-kv is passed through system variables. To achieve high availability, the following system variables need to be modified:
The base-kv module uses the Raft protocol to ensure consistency and high availability of shard replicas. To achieve high availability, you must modify the following system variables:

| System Variable Name | Default Value | Recommended Value |
|--------------------------------|---------------|---------------------------------------|
| dist_worker_range_voter_count | 3 | At least 3, preferably an odd number. |
| inbox_store_range_voter_count | 1 | At least 3, preferably an odd number. |
| retain_store_range_voter_count | 3 | At least 3, preferably an odd number. |

## Impact of High Availability

### Throughput of Messages with `cleanSession=false`

After enabling multiple replicas in inbox-store, each offline message write operation must wait for the Raft protocol to complete the replication between the Leader and the majority of members before it is considered truly successful.
Additionally, saving a message requires actual execution of the write operation multiple times, depending on the number of replicas. Therefore, the number of replicas will impact the response latency of messages with `cleanSession=false`
and the overall throughput of the cluster.

During deployment, it is advisable to assess and set a reasonable number of replicas based on the actual usage scenario, striking a balance between high availability and performance.

## Analysis of Failure Scenarios and Recovery

Taking a three-node deployment of BifroMQ with inbox-store configured for triple replication as an example.

### One Node Failure

#### Impact

According to the Raft protocol, the remaining two replicas can continue to function correctly. If the crashed node is the leader, the remaining two replicas will elect a new leader and continue to operate, ensuring no data loss.

#### Recovery

Restarting the crashed node or starting a new node will be automatically discovered by BifroMQ, and it will be automatically added to the Raft cluster, forming a triple replica to restore high availability.

### Two Node Failure

#### Impact

According to the Raft protocol, the remaining single replica cannot achieve a consensus among more than half of the nodes, and therefore, it cannot operate normally.

#### Recovery

* When restarting the crashed node, the node will reload the previously persisted replica data. BifroMQ cluster discovers the new node and automatically adds it to the Raft cluster, forming a triple replica to restore high availability,
with no data loss.
* Starting two new nodes will be automatically discovered by BifroMQ, and it will add them to the Raft cluster automatically, forming a triple replica to restore high availability.
***Note: If the remaining nodes after a crash are followers and have not synchronized to the latest progress of the leader, this recovery method may result in the loss of some data that has already achieved Raft consensus and been
written.***

### Three Node Failure

Similar to the impact and recovery scenarios with two nodes crashing.

### Automatic Recovery Capability (Optional)

The Raft protocol requires that more than half of the nodes in the cluster must be alive for the system to operate. In the scenario where two nodes crash, as described above, the remaining single replica will be unable to function.

BifroMQ provides a configurable capability that allows the cluster to detect whether it is in a Raft lost majority situation and proactively reduce the Voter list in the Raft configuration, enabling it to continue functioning.

Add the following parameters to the configuration file to override the existing policy:

```
stateStoreConfig:
inboxStoreConfig:
balanceConfig:
balancers:
- com.baidu.bifromq.inbox.store.balance.ReplicaCntBalancerFactory
- com.baidu.bifromq.inbox.store.balance.RangeSplitBalancerFactory
- com.baidu.bifromq.inbox.store.balance.RangeLeaderBalancerFactory
- com.baidu.bifromq.inbox.store.balance.RecoveryBalancerFactory
distWorkerConfig:
balanceConfig:
balancers:
- com.baidu.bifromq.dist.worker.balance.ReplicaCntBalancerFactory
- com.baidu.bifromq.dist.worker.balance.RecoveryBalancerFactory
retainStoreConfig:
balanceConfig:
balancers:
- com.baidu.bifromq.retain.store.balance.ReplicaCntBalancerFactory
- com.baidu.bifromq.retain.store.balance.RecoveryBalancerFactory
```

***Note: As analyzed in the previous section, if the remaining minority replicas after a crash are followers and have not synchronized to the latest progress of the leader, there will be some data loss after automatic recovery. Before
enabling this feature, an assessment should be conducted to determine whether this type of data loss is acceptable.***











### Configuration of StoreBalancers

BifroMQ's base-kv module implements decentralized management of the persistent service cluster, including capabilities such as initialization, sharding, load balancing, and recovery. The following StoreBalancers are built-in by default:

- **RangeBootstrapBalancer**: Initializes the cluster (same effect as the ***bootstrap*** setting in the config file).
- **RedundantEpochRemovalBalancer**: Cleans up redundant ranges.
- **RangeLeaderBalancer**: Balances the distribution of shard leaders across the cluster.
- **ReplicaCntBalancer**: Balances the number of shard replicas (both Voter and Learner) across the cluster based on the configured settings.
- **RangeSplitBalancer**: Splits shards according to predefined load strategies.
- **UnreachableReplicaRemovalBalancer**: Removes unreachable shard replicas.
- ~~**RecoveryBalancer**: Deprecated since version 3.3.0 and no longer in use.~~

The built-in Balancers are suitable for most use cases. However, for more complex operational scenarios and SLA requirements, users can customize StoreBalancers according to their needs, provided they have a deep understanding of the base-kv architecture. The BifroMQ team also offers professional consulting services in this area.

## Performance Impact

Enabling multiple replicas has the following performance impacts by default:

- **DistWorker**: Optimized for read performance by default, with a maximum of 3 Voter replicas; the remaining replicas are Learners. As the number of nodes increases, subscription routing performance can scale horizontally.
- **InboxStore**: Configured to balance read and write performance through sharding. Since write performance is significantly affected by the number of Voter replicas, the default configuration uses a single replica. Users can adjust the number of Voter replicas in this setting to enhance availability as needed.
- **RetainStore**: Optimized for read performance by default, with a maximum of 3 Voter replicas; the remaining replicas are Learners. As the number of nodes increases, retained message matching performance can scale horizontally.
86 changes: 86 additions & 0 deletions website/docs/04_cluster/3_upgrade.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
---
sidebar_position: 3
title: "Upgrade"
---

# Upgrade Guide

BifroMQ supports multiple upgrade strategies to ensure smooth transitions between versions. This document provides a detailed guide on upgrading BifroMQ, including version compatibility, upgrade methods, and specific steps for each method.

## Version Compatibility

BifroMQ version compatibility is categorized into two main aspects: **data compatibility** and **inter-cluster RPC protocol compatibility**. The versioning scheme follows semantic versioning conventions:

- **x.y.z**:
- **x**: Major version number.
- **y**: Minor version number.
- **z**: Patch version number.

### Compatibility Rules

1. **Patch Version Upgrades (x.y.z)**:
- When only the patch version (z) changes, **both data compatibility and inter-cluster RPC protocol compatibility** are guaranteed.

2. **Minor Version Upgrades (x.y)**:
- When the minor version (y) changes, **inter-cluster RPC protocol compatibility** is guaranteed.
- **Data compatibility** may change, but any such changes will be explicitly documented.

3. **Major Version Upgrades (x)**:
- Major version upgrades may introduce **breaking changes** that affect both data compatibility and inter-cluster RPC protocol compatibility.
- **Rolling upgrades within the cluster are not supported** for major version changes unless explicitly stated in the release notes that both data compatibility and inter-cluster RPC protocol compatibility are maintained. Users will need to manage the migration of their applications and data from the old version cluster to the new version cluster independently.

## Upgrade Methods

BifroMQ supports two rolling upgrade methods:

1. **In-Place Upgrade**
2. **Replace Upgrade**

### In-Place Upgrade

This method is suitable when the new version maintains both data compatibility and inter-cluster RPC protocol compatibility with the existing version.

#### Steps for In-Place Upgrade

1. **Stop the Current Version**:
- Gracefully shut down the BifroMQ service.

2. **Preserve the Data Directory**:
- Ensure the `data` directory is retained.

3. **Update Files and Configurations**:
- Replace the necessary binaries and update the configuration files as required by the new version.

4. **Start the New Version**:
- Launch the new version of BifroMQ using the existing data.

### Replace Upgrade

Use this method when the new version does not maintain data compatibility or when inter-cluster RPC protocol compatibility is ensured.

#### Steps for Replace Upgrade

1. **Add New Version Nodes**:
- Gradually introduce nodes running the new version into the cluster.

2. **Remove Old Version Nodes**:
- Sequentially remove nodes running the old version from the cluster until all nodes are upgraded to the new version.

3. **Automatic Data Migration**:
- During the upgrade, BifroMQ automatically handles data migration and synchronization between nodes.

#### Important Considerations for Replace Upgrade

- **Voter Configuration**:
- Ensure that the following BifroSysProp properties for `dist-worker`, `inbox-store`, and `retain-store` have more than one Voter configured, otherwise data migration will not happen:
- `dist_worker_range_voter_count`
- `inbox_store_range_voter_count`
- `retain_store_range_voter_count`

- **Node Upgrade Limitation**:
- Do not upgrade more than half of the nodes simultaneously to avoid making the cluster unavailable.

## Summary

- **In-Place Upgrade**: Suitable for upgrades within the same major and minor versions (x.y), where data and RPC protocol compatibility are maintained.
- **Replace Upgrade**: Necessary for upgrades across minor versions or where data compatibility may not be maintained. Follow the step-by-step process to ensure a smooth transition.
4 changes: 2 additions & 2 deletions website/docs/99_test_report/1_test_report.md
Original file line number Diff line number Diff line change
Expand Up @@ -533,13 +533,13 @@ accept.
### Memory

* vm.max_map_count: Limits the number of VMAs (Virtual Memory Areas) that a process can have. It can be increased to
221184.
221184.

### Maximum Open Files

* nofile: Specifies the maximum number of files that a single process can open.
* nr_open: Specifies the maximum number of files that can be allocated per process, usually defaulting to 1024 * 1024 =
1048576.
1048576.
* file-max: Specifies the maximum number of files that the system kernel can open, with a default value of 185745.

### NetFilter Tuning
Expand Down
2 changes: 1 addition & 1 deletion website/docusaurus.config.js
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,7 @@ const config = {
lastVersion: 'current',
versions: {
current: {
label: '3.2.x',
label: '3.3.x',
path: '',
},
},
Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
{
"version.label": {
"message": "3.2.x",
"message": "3.3.x",
"description": "The label for version current"
},
"sidebar.tutorialSidebar.category.Get Started": {
Expand Down
Loading

0 comments on commit 44f32c5

Please sign in to comment.