2015-08-30

Ceph Cluster Thrash and Rebooting Nodes

I have a home cluster with low traffic volumes but terabytes of data - mostly photos. Ceph provides peace of mind that my data is resilient against failure, but my nodes are made with recycled equipment, so when a node reboot places considerable stress on the cluster to bring it back in.

Cluster thrash occurs when activity on a Ceph cluster causes it to start timing out OSDs. It's bad because recovery processes will begin - further stressing the cluster and potentially also causing problems. Then some of those outed OSDs will attempt to rejoin causing yet more problems. The usual way to avoid cluster thrash is to properly resource the cluster in the first place to handle the loads. In my case that's not a justifiable expense - my nodes are recycled (HDs are new) and my normal load doesn't stress the cluster much until a node reboots.

Here's how to deal with cluster thrash.

Tell ceph to not rebalance the cluster due to OSDs leaving. If possible, do this before rebooting your storage node, but it can be done at any time.
ceph osd set nodown
ceph osd set noout

Temporarily disable access to the cluster. My cluster predominantly serves files so stopping the network file system daemon does the job. Also unmount cephfs.
sudo service samba stop
sudo umount -lf /mnt/ceph
Your file system might not be windows networking (so not samba). Run this on the node that serves the files - which is not necessarily the same as the ceph mds or monitor nodes.

Also, shutdown the MDS service on each node that runs it. MDS is the process the oversees cephfs.
sudo service ceph stop mds 

The reason for stopping file access is to prevent load put on the cluster while it is recovering and the avoid potentially causing different versions of data to exist on the cluster. Ceph handles the latter situation very well on its own but consumes some effort in doing so. If disabling cluster access is not an option then see the tips at the end of the article.

Temporarily disable scrubbing since that takes resources:
ceph osd set noscrub
ceph osd set nodeep-scrub

Add the OSDs back into the cluster one or two at a time and allow the cluster to stabilise as much as it can in between. Newly added OSDs will go through a process of peering. Wait until all placement groups have finishing peering before adding another OSD to the cluster.
sudo service ceph start osd.<X>

This guide wouldn't be complete without listing the actions to re-enable normal operations on the cluster.
Re-start the MDS service. I prefer to do this first because it takes some time before it's ready to serve my cephfs again. I watch the monitor until the IO ceases - or I keep attempting to mount cephfs until it eventually succeeds.
sudo service ceph start MDS

Remount cephfs
sudo mount /mnt/ceph
Restart the samba service
sudo service samba start

At this point the cluster is again serving files, but we still need to re-enable scrubbing and allow the cluster to re-balance when there's errors. Never run a cluster without deep-scrubbing because that's your defense against data corruption.
ceph osd unset noscrub
ceph osd unset nodeep-scrub
ceph osd unset noout
ceph osd unset nodown
And you're done.

If you cannot afford to disable file access then all of the other tips might still be useful. In addition, if you have three or more replicas then you can also temporarily lower the minimum number of placement group replicas the cluster requires to be available. You should not put this lower than (number_of_replicas  / 2) + 1 or you risk data inconsistency. Check the number of replicas by getting the size of the pools (in my case 3), record the usual value for min_size and then use set to change the min_size temporarily.
ceph osd pool get <poolname> size
ceph osd pool get <poolname> min_size
ceph osd pool set <poolname> min_size 2
If you're using a fairly standard cephfs setup then there are actually two pools called: data and metadata. Change the min_size on both of them but always check the size of each pool first because they might be different. I run my more replicas of my metadata pool. 
Don't forget to set the min_size back to whatever value you normally have it set too once the cluster stabilises.

It does look like a lot of work to reboot a node. In practice I don't reboot nodes all that often and usually I don't need even half of the above tips to bring my current cluster back without cluster thrash. Since going away from USB flash drives into HD-spinners, I no longer have the cluster thrash problems I once did. Though, following the above tips does bring OSDs back into the cluster much quicker than letting it all happen by itself.

Happy cephing.