-
Gateways: 2 gateway node with shared IP and HA with keepalived
-
Ansible-master: A control node to manage all nodes and execute the ansible playbooks.
-
etcd: 3 nodes for etcd cluster with etcd v3.4.7
-
K8S: 3 master nodes + 1 worker node
Note: The etcd servers are connected to the masters api and there is no loadbalancing via the LBs.
The gateways are the communication way to the nodes. In this architecture, All nodes has private network. So there is no way accessing nodes unless the gateway. In other way, all nodes can access the internet via NAT protocol through the gateway node. So as we see, the gateway could be our point of failures. So we need to HA the gateway node and make it as high available as possible. For this purpose we could use the keepalived
.
The other role of gateways are:
- Loadbalancing requests between the kubernetes master nodes.
- Loadbalancing requests from the user to the cluster.
These gateways has two components:
- Keepalived: To ha the gateways I installed the keepalived and configure for ha the public IP.
- Haproxy: To loadbalance the client traffic to the ingress. (In our scenario we had not implement any ingress. We talk about the production.)
You can deploy the gateways via the lb
roles in ansible folder.
Related to the etcd official doc about the hardware requirement for small cluster we need the following hardware requirement:
The cluster has been initialized with three nodes and TLS authentication. All certificates has been issued by OpenSSL
.
The etcd has been implemented via the systemd service and etcd binary files from offical repo.
This process has been automated with the TLS authentication and you can use it by the etcd-implementation
Related to the etcd official documentation:
etcd is designed to withstand machine failures.
So we can have some plans and scenarios to escape from any etcd disaster. For all the scenarios the most important thing is to have the etcd backup periodically.
select one of the etcd nodes and execute the following command:
etcdctl --endpoints https://192.168.0.10:2379 --cert=/etc/etcd/ssl/peer.crt --key=/etc/etcd/ssl/peer.key --cacert=/etc/etcd/ssl/ca.crt snapshot save /home/ubuntu/snapshot-22-07-14.db
Note: It's hardly recommended to save the etcd snapshot on another storage such as the object storage for keep it safe.
In this scenario the cluster is functional and can write and read the requests. So before any more failure happens and the quorum fails, we need to repair the failed node or replace it with new node. For this implementation follow the steps:
-
stop the etcd.service on the node:
systemctl stop etcd.service
-
remove the node from the cluster:
etcdctl --endpoints=https://192.168.0.10:2379,https://192.168.0.11:2379,https://192.168.0.12:2379 --cert=/etc/etcd/ssl/peer.crt --key=/etc/etcd/ssl/peer.key --cacert=/etc/etcd/ssl/ca.crt member remove 6f03c26636586d04
-
add the new node (or the currently node) to the cluster:
etcdctl --endpoints=https://192.168.0.10:2379,https://192.168.0.11:2379,https://192.168.0.12:2379 --cert=/etc/etcd/ssl/peer.crt --key=/etc/etcd/ssl/peer.key --cacert=/etc/etcd/ssl/ca.crt member add etcd-3 --peer-urls=https://192.168.0.12:2380
-
If you are doing this process on a failure node which only lost the data dir you need only change the
/etc/etcd/etcd.conf.yaml
and set theinitial-cluster-state:exisiting
. But if you add replace node, after installing the etcd and etcdctl you need to add the following conf to/etc/etcd/etcd.conf.yaml
:data-dir: /var/lib/etcd/etcd-1.etcd name: etcd-1 initial-advertise-peer-urls: https://192.168.0.10:2380 listen-peer-urls: https://192.168.0.10:2380,https://127.0.0.1:2380 advertise-client-urls: https://192.168.0.10:2379 listen-client-urls: https://192.168.0.10:2379,https://127.0.0.1:2379 initial-cluster-state: existing initial-cluster: etcd-1=https://192.168.0.10:2380,etcd-2=https://192.168.0.11:2380,etcd-3=https://192.168.0.12:2380 client-transport-security: cert-file: /etc/etcd/ssl/server.crt key-file: /etc/etcd/ssl/server.key trusted-ca-file: /etc/etcd/ssl/ca.crt peer-transport-security: cert-file: /etc/etcd/ssl/peer.crt key-file: /etc/etcd/ssl/peer.key trusted-ca-file: /etc/etcd/ssl/ca.crt
-
start the etcd.service:
systemctl start etcd.service
In this scenario when the leader fails, the etcd cluster automatically start a new election to choose the new leader. This process takes the election timeout
to replace the new leader.
According to the etcd official documentation:
During the leader election the cluster cannot process any writes. Write requests sent during the election are queued for processing until a new leader is elected.
So after new leader elected, this scenario is similar to Minor follower failure.
When the majority of the nodes has failed, the cluster has collapsed. So there are two options:
- Try to transform the scenario to the Minor follower failure by get some nodes back online!
- Initialize the new-cluster on the healty node and join new members to it!
Let's see the second option:
-
First you need to choose one of the nodes to work on it.
-
Stop the
etcd.service
and delete the data-direcotyr on all nodes.systemctl stop etcd.service rm -rf /etc/etcd
-
Restore the snapshot.
etcdctl --endpoints https://192.168.0.10:2379 --cert=/etc/etcd/ssl/peer.crt --key=/etc/etcd/ssl/peer.key --cacert=/etc/etcd/ssl/ca.crt --initial-cluster=etcd-1=https://192.168.0.10:2380,etcd-2=https://192.168.0.11:2380,etcd-3=https://192.168.0.12:2380 --initial-cluster-token=etcd-cluster-1 --initial-advertise-peer-urls=https://192.168.0.10:2380 --name=etcd-1 --skip-hash-check=true --data-dir /var/lib/etcd snapshot restore /home/ubuntu/snapshot-22-07-14.db
-
edit the
/etc/etcd/etcd.conf.yaml
configuration file and add theforce-new-cluster
flag. Be sure that theinitial-cluster-state
set to new.
name: etcd-1
initial-advertise-peer-urls: https://192.168.0.10:2380
listen-peer-urls: https://192.168.0.10:2380,https://127.0.0.1:2380
advertise-client-urls: https://192.168.0.10:2379
listen-client-urls: https://192.168.0.10:2379,https://127.0.0.1:2379
initial-cluster-state: new
force-new-cluster: true
initial-cluster: etcd-1=https://192.168.0.10:2380,etcd-2=https://192.168.0.11:2380,etcd-3=https://192.168.0.12:2380
client-transport-security:
cert-file: /etc/etcd/ssl/server.crt
key-file: /etc/etcd/ssl/server.key
trusted-ca-file: /etc/etcd/ssl/ca.crt
peer-transport-security:
cert-file: /etc/etcd/ssl/peer.crt
key-file: /etc/etcd/ssl/peer.key
trusted-ca-file: /etc/etcd/ssl/ca.crt
- Start the etcd.service to initialize a new cluster:
systemctl start etcd.service
- Check that if the etcd cluster has been initialized secussessfuly:
etcdctl --endpoints=https://192.168.0.10:2379 --cert=/etc/etcd/ssl/peer.crt --key=/etc/etcd/ssl/peer.key --cacert=/etc/etcd/ssl/ca.crt endpoint health
- Now it's time to add new members:
etcdctl --endpoints=https://192.168.0.10:2379 --cert=/etc/etcd/ssl/peer.crt --key=/etc/etcd/ssl/peer.key --cacert=/etc/etcd/ssl/ca.crt member add etcd-2 --peer-urls=https://192.168.0.11:2380
The new member added to cluster. We need to start the etcd on the etcd-2 node but remember that we only have two initial_cluster nodes and if we start the etcd with another initial member we got the member count is unequal
- Edit the
/etc/etcd/etcd.conf.yaml
file:
data-dir: /var/lib/etcd
name: etcd-2
initial-advertise-peer-urls: https://192.168.0.11:2380
listen-peer-urls: https://192.168.0.11:2380,https://127.0.0.1:2380
advertise-client-urls: https://192.168.0.11:2379
listen-client-urls: https://192.168.0.11:2379,https://127.0.0.1:2379
initial-cluster-state: existing
initial-cluster: etcd-1=https://192.168.0.10:2380,etcd-2=https://192.168.0.11:2380
client-transport-security:
cert-file: /etc/etcd/ssl/server.crt
key-file: /etc/etcd/ssl/server.key
trusted-ca-file: /etc/etcd/ssl/ca.crt
peer-transport-security:
cert-file: /etc/etcd/ssl/peer.crt
key-file: /etc/etcd/ssl/peer.key
trusted-ca-file: /etc/etcd/ssl/ca.crt
It's time to start the service:
systemctl start etcd.service
- Repeat this process for other members.
Note: If your nodes are damaged and you have to use new nodes, you have to install etcd and etcdctl before the above proccess.
Unfrotunately, This process hadn't been automated.
This cluster has 3 masters and 1 workers. For loadbalancing masters, I used the haproxy. The cluster has been initialized via kubeadm
.
This process has been automated via the following roles:
My first choice to implement the storageClaas and the local-path provisioner was the kubernetes local-static-provisioner which due to the errors I got from the persistentVolume in Mariadb-replication I changed the component to rancher local-path-provisioner.
kubectl apply -f https://raw.githubusercontent.com/rancher/local-path-provisioner/v0.0.22/deploy/local-path-storage.yaml
And testing:
kubectl create -f https://raw.githubusercontent.com/rancher/local-path-provisioner/master/examples/pvc/pvc.yaml
kubectl create -f https://raw.githubusercontent.com/rancher/local-path-provisioner/master/examples/pod/pod.yaml
For specific configuration you can edit the Configmap in local-path-storage
namespace called local-path-config
and change the config.json parameters:
data:
config.json: |-
{
"nodePathMap":[
{
"node":"DEFAULT_PATH_FOR_NON_LISTED_NODES",
"paths":["/opt/local-path-provisioner"]
},
{
"node":"k8s-worker-1",
"paths":["/disks"]
}
]
}
Installing helm chart can be done by the helm scripts:
curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3
chmod 700 get_helm.sh
./get_helm.sh
For mariadb-replication I prefer to use the bitnami mariadb Helm chart
.
-
Add the helm chart to your repo:
helm repo add bitnami https://charts.bitnami.com/bitnami
-
You need to first extract the
vaules.yml
file from the helm charthelm show values bitnami/mariadb > values.yml
-
Edit the
values.yml
and install the helm:Things need to set:
- StorageClass
- PVC capacity
- ...
helm install RELEASE_NAME bitnami/mariadb -f values.yml --set rootUser.password=<password> --set replication.password=<password>
After that check the pods ans statefulset
There is an example values.yml in this repo.
The upgrade process is one the most tricky process in k8s. You have to pay caution and follow the best practices and steps.
- Install the kubeadm on the first master: (The selected version)
- check the
kubeadm upgrade plan
and if there is no manual installation needed continue the steps. - Check the
kubeadm upgrade
in dry-run command and check the result. - run the `kubeadm upgrade apply -f {{kubernetes_version}}
- Install the new kubelet package and restart the kubelet
- Install the new kubectl package
- Install the kubeadm on the first master: (The selected version)
- run the
kubeadm upgrade node
- Install the new kubelet package and restart the kubelet
- Install the new kubectl package
- Drain the worker node
- Install the kubeadm on the first master: (The selected version)
- run the
kubeadm upgrade node
- Install the new kubelet package and restart the kubelet
- Install the new kubectl package
- Uncordon the node
This process has been automated and you can use the following roles:
At first, I wanted to have a virtual private IP
which can be handle by the openstack. And I wanted to assign a floating IP
to the virtual ip port to have the internet via this port. But Because of the bug in the OVN module of openstack my scenario has failed. So I had to assing two fixed public IP to each nodes and assing a virtual public IP
to my gateway nodes to handle the HA and config the keepalived.
All the configuration was successfully done but at the last part the floating ip that has been assinged to the virtual ip port had no ping and connection.
One of the most hard challenges I have was the etcd TLS authentication. I tried to issued the certs with openSSL
but the certs had not worked. Related to the github issue these problems caused because of the lack of good etcd documentation.
After many research and tests I figured out the problems:
-
Key_usage: I need to add the
keyEncipherment
key_usage to the server csr. Related to the serveruser website the keyEncipherment is:Certificate may be used to encrypt a symmetric key which is then transferred to the target.
Target decrypts key, subsequently using it to encrypt & decrypt data between the entities
-
Extended_key_usage: This flag is additional restrictions need to add to the certs. For server cert should be
serverAuth
, client certs should beclientAuth
and for peer certs should be both -
Hostname is SAN: There is a section in csr which you can specify the hostname and IPs. I just used the IP variable.
Thanks to this article which helps me a lot to troubleshoot.
For pulling images, because of sanctions we need to use proxy. It can be used proxy in the containerd service as an environment:
[Unit]
Description=containerd container runtime
Documentation=https://containerd.io
After=network.target local-fs.target
[Service]
ExecStartPre=-/sbin/modprobe overlay
ExecStart=/usr/local/bin/containerd
Type=notify
Delegate=yes
KillMode=process
Restart=always
RestartSec=5
# Having non-zero Limit*s causes performance problems due to accounting overhead
# in the kernel. We recommend using cgroups to do container-local accounting.
LimitNPROC=infinity
LimitCORE=infinity
LimitNOFILE=infinity
# Comment TasksMax if your systemd version does not supports it.
# Only systemd 226 and above support this version.
TasksMax=infinity
OOMScoreAdjust=-999
Environment="HTTP_PROXY=http://<proxy_ip:<port>/"
Environment="HTTPS_PROXY=http://<proxy_ip:<port>/"
But this environment set proxy for the whole cluster! So I got errors while I deployed the calico and the calico node can't communicate to the other components.
So I solved this problem by adding NO_PROXY
:
[Unit]
Description=containerd container runtime
Documentation=https://containerd.io
After=network.target local-fs.target
[Service]
ExecStartPre=-/sbin/modprobe overlay
ExecStart=/usr/local/bin/containerd
Type=notify
Delegate=yes
KillMode=process
Restart=always
RestartSec=5
# Having non-zero Limit*s causes performance problems due to accounting overhead
# in the kernel. We recommend using cgroups to do container-local accounting.
LimitNPROC=infinity
LimitCORE=infinity
LimitNOFILE=infinity
# Comment TasksMax if your systemd version does not supports it.
# Only systemd 226 and above support this version.
TasksMax=infinity
OOMScoreAdjust=-999
Environment="HTTP_PROXY=http://<proxy_ip:<port>/"
Environment="HTTPS_PROXY=http://<proxy_ip:<port>/"
Environment="NO_PROXY=localhost,127.0.0.1,10.244.0.0/16,192.168.0.0/24,10.96.0.1/32"
Note: The 10.96.0.1
was the IP calico couldn't connect to it.
I deployed the kubernetes local-storage provisioner and everything was fine. I added 10 GB disk to the worker node and the provisioner has detected the disk automatically and create a new 10 GB PV.
I tried to deploy very simple nginx with PVC but the the pod stuck on the pending
state because there is no persistent volume availabe on every node.
At last I understood that the problem was about the capacity of PV which was 1110Mi and not 10Gi!
After upgrading the first master and everything goes fine, some pod didn't deleted and cluster takes some minutes to delete these pods and pulling the new images.
- Is TLS needed for the etcd? Is it best practice?
- Is needed to drain master nodes for upgrading?
-
Monitoring
-
Auto etcd backup
-
Auto etcd disaster recovery
-
Install StorageClaas / provisioner / Helm via ansible
-
Password Clear-text ansible files(Using vault)
-
Comments in playbooks and doc for roles