Skip to content

Post Mortem 2020 10 01

Cassie Tarakajian edited this page Oct 7, 2020 · 1 revision

Post Mortem - 2020/10/01

Issue Summary

On Thursday, October 1, the p5.js editor was completely inaccessible from approximately 1:20PM ET to 1:45PM ET.

Timeline

(all times in Eastern Time)

  • 1:00PM Begin upgrade of Kubernetes cluster nodes
  • 1:30PM Notices that site is down
  • 1:34PM Makes public announcement via Twitter
  • 1:4? Adds new cluster IP addresses to MongoDB Atlas IP whitelist
  • 1:45PM Site is back online

Root Cause

I had upgraded the Kubernetes cluster nodes to try to get basic authentication for the staging url working. I thought that, because there are two nodes, it would be fine to do at any time because there would be no site downtime. However, when the nodes came back online, because they had spawned new VMs, they had been given new IP addresses, which were not whitelisted by the MongoDB database.

Resolution and recovery

After checking the logs of the service pods, I noticed that they kept saying they couldn't connect to the Mongo database. I tried deleting the environment variable k8 secret, and that didn't work. Eventually I Googled "gke mongodb" and found someone with a similar issue, and then I remembered that the database had an IP whitelist. I added the new IP addresses to the whitelist, restarted the pods, and then everything worked again.

Corrective and Preventative Measures

I changed the IP addresses of the cluster VMs to be static rather than ephemeral, but I have no idea if this will actually fix the issue unless I try updating the nodes again. I have a feeling it will not fix it because of the rolling updates—I think what will happen is that the new nodes will get created as new VMs, with new IP addresses, and then I will just have two lingering static IP addresses not connected to any VM.

If the above doesn't work, then I could add a CloudNAT, which will involve downtime to make the cluster private.

Clone this wiki locally