If You Run Ceph With Only 3 Nodes, You Are Asking For Trouble

When I first tried Rook Ceph, I installed it on one machine using Minikube. A one-node deployment is great when you would like a quick way to understand how Rook Ceph operates or as a test environment for developing apps that use Rook Ceph. One node, however, is definitely not suitable for production data storage, where you need more than a single point of failure. The distributed nature of Ceph makes it shine when used on multiple nodes.

How many nodes do I need to run a production Ceph cluster?

Different people give different answers. Our experience suggests that a robust Ceph cluster needs at least four nodes. You might be compelled to run a production Ceph cluster with three nodes, which is the minimum number of nodes the default settings allow. This works fine - until one of your nodes fails and your cluster is left vulnerable. Don’t get me wrong: Ceph is great at tolerating failures and the cluster will continue running A-OK with one node offline. But what happens if you lose another node? Correlated node failures are not uncommon and losing two of your three nodes renders your Ceph cluster unavailable.

Let me illustrate this with an imperfect analogy. Let’s say that you are getting ready for your commute to work in the morning, but you discover that your car has a flat tire. Most cars have a spare tire in the back, which you can just use and get to work that day or until you get your car serviced. The general advice is that spare tires should be used for a maximum distance of 75-100 km (60-75 miles) and with a maximum speed of 80 km/h (50 mph). This should be enough for a short commute or a short trip to a repair shop. If you were planning on going on a road trip, would you do it without a spare tire? Most people won’t - because they don’t want to be stranded in the middle of nowhere with a blown tire. Now, why would you allow your precious data to run without enough backups?

I’ve said it is an imperfect analogy because a 3-node Ceph cluster already has one spare node. More safety-critical systems like passenger trains, hazardous materials cargo trains (in sensible countries that is - looking at you, US đź‘€) and roller-coasters have multiple fail-safe mechanisms in case of emergencies. If your data integrity and storage availability are important to you, then treating your cluster like a passenger train and adding another node saves you a lot of headache when a failure happens. Instead of scrambling to debug the first failure before another node failure renders your system unusable in the case of a 3-node cluster, using 4 nodes gives you more time to fix or replace the failing node. Ceph will happily rewire the storage cluster in the background to recover from the failure with no downtime. Additionally, the remaining nodes can handle another failure.

Our advice comes from our experience working with Rook and Ceph over the years. However, we wanted to test what happens more empirically. The rest of this article talks about a simple experiment you can replicate on your own cluster if you would like.

Let’s test this

To verify our recommendations, we ran a simple experiment on a Ceph cluster we use to demonstrate Rook-Ceph’s capabilities.

The cluster is set up using Terraform and Kubeone on Hetzner cloud. You can find the cluster creation code here, which sets up Kubernetes and creates the control plane and pool nodes. We also install Rook-Ceph using the helm charts and manage them with Flux CD. The cluster consists of two control plane nodes and three or four pool nodes. If this was too much information at once, don’t worry, we will create blog posts explaining each component in the future. For now, let’s focus on the experiment.

First, we needed a script running frequently enough to verify that the system was running correctly. The script needed to do the basic operations required to check that a filesystem is working (list, read, write, delete), and it needed to run quickly enough to be repeated often. We settled on a simple bash script which we deployed as a Kubernetes CronJob running every minute. You can check the source code here. This is what the job essentially does:

  • List the files in the filesystem and count them.
  • Read 5 random files and check that the files are not corrupted using md5 signatures.
  • Write 5 random files.
  • Delete 5 random files.

The output of the script looks like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
$ kubectl logs --namespace busybot jobs/busybot-28336211
Reading directory
Number of files: 61
d264e9c84f2066d2fbe12c5bca8a652a  /store/83-d264e9c84f2066d2fbe12c5bca8a652a
043009826aa708c7fc58333ad2032c2e  /store/97-043009826aa708c7fc58333ad2032c2e
ba3cbe080982de2968eb8ac4fe9ae854  /store/40-ba3cbe080982de2968eb8ac4fe9ae854
c03b01af3340b0b514cb1b04bfe98c46  /store/96-c03b01af3340b0b514cb1b04bfe98c46
4e7c72688c151e48e0009f308a1c9078  /store/22-4e7c72688c151e48e0009f308a1c9078
1+0 records in
1+0 records out
Writing 86
1+0 records in
1+0 records out
Writing 47
1+0 records in
1+0 records out
Writing 55
1+0 records in
1+0 records out
Writing 81
1+0 records in
1+0 records out
Writing 99
Deleting 19-c197ade11b49f5ee6625e1c63610814e
Deleting 40-ba3cbe080982de2968eb8ac4fe9ae854

The base case

The base case is a cluster operating normally. On either of the 3-node and the 4-node clusters, the script ran for an average of 13.5 seconds each time. There were no warnings on the Ceph dashboard or our Koor Data Control Center (KDCC).

The normal running state for the four-node cluster
The normal running state for the four-node cluster
The normal running state for the three-node cluster
The normal running state for the three-node cluster

If you do not have KDCC installed, you can check the Ceph status from the Rook toolbox or use the Rook kubectl plugin:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
$ # The normal running state for the four-node cluster
$ kubectl exec -n rook-ceph -it deploy/rook-ceph-tools -- ceph status
  cluster:
    id:     4dd0bbfe-4836-4203-bf52-d73e39303031
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum a,b,c (age 32m)
    mgr: a(active, since 29m), standbys: b
    mds: 1/1 daemons up, 1 hot standby
    osd: 4 osds: 4 up (since 31m), 4 in (since 31m)
    rgw: 1 daemon active (1 hosts, 1 zones)

  data:
    volumes: 1/1 healthy
    pools:   12 pools, 169 pgs
    objects: 776 objects, 1.4 GiB
    usage:   4.4 GiB used, 116 GiB / 120 GiB avail
    pgs:     169 active+clean

  io:
    client:   853 B/s rd, 1 op/s rd, 0 op/s wr


$ # The normal running state for the three-node cluster
$ kubectl rook-ceph ceph status
Info: running 'ceph' command with args: [status]
  cluster:
    id:     bd41c104-90b2-4b9e-a43e-6708828e301c
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum a,b,c (age 7m)
    mgr: a(active, since 4m), standbys: b
    mds: 2/2 daemons up, 2 hot standby
    osd: 3 osds: 3 up (since 6m), 3 in (since 6m)
    rgw: 1 daemon active (1 hosts, 1 zones)

  data:
    volumes: 2/2 healthy
    pools:   15 pools, 249 pgs
    objects: 1.38k objects, 3.6 GiB
    usage:   12 GiB used, 78 GiB / 90 GiB avail
    pgs:     249 active+clean

  io:
    client:   11 KiB/s rd, 341 B/s wr, 14 op/s rd, 0 op/s wr

One failure

When we manually turned off one of the machines, the four-node cluster first showed a warning because one of the mons was down. Ceph kept operating normally since there were enough mons available to achieve a quorum. The script ran normally and calculated the md5’s correctly.

The 4-node cluster showed a warning after losing one node
The 4-node cluster showed a warning after losing one node
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
$ # A four node cluster after turning one node off
$ kubectl exec -n rook-ceph -it deploy/rook-ceph-tools -- ceph status
  cluster:
    id:     4dd0bbfe-4836-4203-bf52-d73e39303031
    health: HEALTH_WARN
            1/3 mons down, quorum a,c
            1 osds down
            1 host (1 osds) down
            Degraded data redundancy: 778/3183 objects degraded (24.442%), 55 pgs degraded, 127 pgs undersized
...

As for the three-node cluster, when one of the machines was turned off, the cluster showed a warning, similar to the above, and the script ran normally.

The 4-node cluster showed a warning after losing one node
The 3-node cluster showed a warning after losing one node
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
$ # A three node cluster after turning one node off
$ kubectl exec -n rook-ceph -it deploy/rook-ceph-tools -- ceph status
  cluster:
    id:     bd41c104-90b2-4b9e-a43e-6708828e301c
    health: HEALTH_WARN
            1/3 mons down, quorum a,c
            1 osds down
            1 host (1 osds) down
            Degraded data redundancy: 549/1647 objects degraded (33.333%), 86 pgs degraded, 249 pgs undersized
...

Two failures

When another node was manually turned off, the four-node cluster remained online and was able to handle the script. The logs indicate that no files were lost. On the other hand, the three-node cluster was not available, and the script stopped working.

The 4-node cluster is still working after losing two nodes
The 4-node cluster is still working after losing two nodes
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
$ # A four node cluster after turning two nodes off
$ kubectl exec -n rook-ceph -it deploy/rook-ceph-tools -- ceph status
  cluster:
    id:     4dd0bbfe-4836-4203-bf52-d73e39303031
    health: HEALTH_WARN
            1 filesystem is degraded
            1 MDSs report slow metadata IOs
            2 osds down
            2 hosts (2 osds) down
            Reduced data availability: 77 pgs inactive, 17 pgs down
            Degraded data redundancy: 1573/3468 objects degraded (45.358%), 75 pgs degraded, 152 pgs undersized
...

On the other hand, the three-node cluster was not available, and the script stopped working.

1
2
3
$ # The three-node cluster stopped responding after two failures
$ kubectl exec -n rook-ceph -it deploy/rook-ceph-tools -- ceph status
2023-11-23T05:24:51.724+0000 7f9b2d8e4700  0 monclient(hunting): authenticate timed out after 300

Running with only 3 nodes is risky

Don’t let your storage become unavailable by making sure you have enough spare nodes. While a three-node cluster satisfies the minimum requirements to run Ceph and can handle one failure, adding one more node allows you to handle more failures. It’s as simple as that.

The configuration space for Rook Ceph is vast, so finding the best set of configs for your use case requires a bit of experience and experimentation. We are planning a series of blog posts that explore the different ways Rook and Ceph could be configured to fit your needs. In the meantime, if you would like us to take a look at your Ceph cluster, feel free to get in touch and we will be happy to help!

Koor Data Control Center is in beta. Take a look, and give it a try for free.

Zuhair AlSader November 22, 2023