Ceph Cluster Thrash and Rebooting Nodes

I have a home cluster with low traffic volumes but terabytes of data - mostly photos. Ceph provides peace of mind that my data is resilient against failure, but my nodes are made with recycled equipment, so when a node reboot places considerable stress on the cluster to bring it back in.

Cluster thrash occurs when activity on a Ceph cluster causes it to start timing out OSDs. It's bad because recovery processes will begin - further stressing the cluster and potentially also causing problems. Then some of those outed OSDs will attempt to rejoin causing yet more problems. The usual way to avoid cluster thrash is to properly resource the cluster in the first place to handle the loads. In my case that's not a justifiable expense - my nodes are recycled (HDs are new) and my normal load doesn't stress the cluster much until a node reboots.

Here's how to deal with cluster thrash.

Tell ceph to not rebalance the cluster due to OSDs leaving. If possible, do this before rebooting your storage node, but it can be done at any time.
ceph osd set nodown
ceph osd set noout

Temporarily disable access to the cluster. My cluster predominantly serves files so stopping the network file system daemon does the job. Also unmount cephfs.
sudo service samba stop
sudo umount -lf /mnt/ceph
Your file system might not be windows networking (so not samba). Run this on the node that serves the files - which is not necessarily the same as the ceph mds or monitor nodes.

Also, shutdown the MDS service on each node that runs it. MDS is the process the oversees cephfs.
sudo service ceph stop mds 

The reason for stopping file access is to prevent load put on the cluster while it is recovering and the avoid potentially causing different versions of data to exist on the cluster. Ceph handles the latter situation very well on its own but consumes some effort in doing so. If disabling cluster access is not an option then see the tips at the end of the article.

Temporarily disable scrubbing since that takes resources:
ceph osd set noscrub
ceph osd set nodeep-scrub

Add the OSDs back into the cluster one or two at a time and allow the cluster to stabilise as much as it can in between. Newly added OSDs will go through a process of peering. Wait until all placement groups have finishing peering before adding another OSD to the cluster.
sudo service ceph start osd.<X>

This guide wouldn't be complete without listing the actions to re-enable normal operations on the cluster.
Re-start the MDS service. I prefer to do this first because it takes some time before it's ready to serve my cephfs again. I watch the monitor until the IO ceases - or I keep attempting to mount cephfs until it eventually succeeds.
sudo service ceph start MDS

Remount cephfs
sudo mount /mnt/ceph
Restart the samba service
sudo service samba start

At this point the cluster is again serving files, but we still need to re-enable scrubbing and allow the cluster to re-balance when there's errors. Never run a cluster without deep-scrubbing because that's your defense against data corruption.
ceph osd unset noscrub
ceph osd unset nodeep-scrub
ceph osd unset noout
ceph osd unset nodown
And you're done.

If you cannot afford to disable file access then all of the other tips might still be useful. In addition, if you have three or more replicas then you can also temporarily lower the minimum number of placement group replicas the cluster requires to be available. You should not put this lower than (number_of_replicas  / 2) + 1 or you risk data inconsistency. Check the number of replicas by getting the size of the pools (in my case 3), record the usual value for min_size and then use set to change the min_size temporarily.
ceph osd pool get <poolname> size
ceph osd pool get <poolname> min_size
ceph osd pool set <poolname> min_size 2
If you're using a fairly standard cephfs setup then there are actually two pools called: data and metadata. Change the min_size on both of them but always check the size of each pool first because they might be different. I run my more replicas of my metadata pool. 
Don't forget to set the min_size back to whatever value you normally have it set too once the cluster stabilises.

It does look like a lot of work to reboot a node. In practice I don't reboot nodes all that often and usually I don't need even half of the above tips to bring my current cluster back without cluster thrash. Since going away from USB flash drives into HD-spinners, I no longer have the cluster thrash problems I once did. Though, following the above tips does bring OSDs back into the cluster much quicker than letting it all happen by itself.

Happy cephing.


Bulletproof Processing: try…catch

Processing is stable and crashes are infrequent. However, my experience doing performances and exhibitions for IVX has shown me it is handy to add a bit of extra resiliency to sketches. This is especially true when I expect to run for extended periods of time or when the sketch runs largely unattended. The first tip I have is to use a Java try…catch structure inside the draw() method.

Since Processing is based in Java then why not take advantage of Java’s native error handling. A try…catch structure attempts to execute all code within the try part of the structure and only executes the catch block if there is a problem. This particular patterns helps if the errors are transient, meaning that the error will generally go away on its own.

void draw() {
    try {

        // Normal drawing code goes here

    } catch (Exception e) {
        println(“draw(): “ + e.getMessage());
        // Pause for a quarter second. Hope problem goes away
        try {
        } catch (Exception e1) {

This particular skeleton will catch any exceptions that occur in the draw() method, print them and then pause for 250 milliseconds. That quarter second is usually enough for a transient error to correct itself – perhaps a file loading or some memory becoming available. Either way, the sketch will pause for a short time and then attempt to resume itself. Without this exception handing then any Exception will cause the sketch to stop running.

Note that the sleeping function itself can throw an exception so Java forces us to declare a try…catch block – even though the catch block is empty.

Using try…catch in the setup() method might be useful if the code will be run by others, but in that case make extra sure that error messages are informative. I would prefer that setup() fails outright since most setup() errors are not transient.

So, you’ll see that it does not take much extra code to add a good amount of resiliency to a Processing sketch. Give it a try. Java's website has good resources that take a more in-depth look at exception handling.


Generative Advertising with Feedback: Bahio Coffee

M&C Saatchi have combined generative design with feedback analysis to try evolve the most engaging ad. The campaign is called Bahio coffe. I think this is an interesting idea and would like to talk about the good, the bad and the ugly. It has been written about here and has a good website to explore progress here. A two minute video overview is on YouTube.

The generative algorithm is given “copy, layout, fonts, colours and images” and this is expressed as a gene-string. There is still considerable human expertise that goes into the basic assets that are input into the Bahio generative system.

The feedback is attention – which appears to be tracked by watching the amount of eyeball engagement viewers have with the poster. Individual posters are scored by the amount of attention they get – with better posters having their genes preserved for future generations.

The algorithm to generate new posters is genetic. Mathematically this is a method for searching a large multi-variate space in a non-exhaustive manner. It works best when we assume the fitness landscape is hill-like – that is, there are smooth ways to improve towards a “best” poster. Though I suspect it has limited usefulness without some sort of similarity measurement between input possibilities. What that means is, how does it determine that a small mutation on the “image” variable results in an image that is different in a similarly small way? Though, for the relatively small number of images the campaign appears to run, this doesn’t appear to be a big problem.

However, in the offline world, we don’t yet have an easy way to target particular demographics. This technique is, without modification, limited to products that have a general appeal to everybody. Some control could be exercised by creative directors on the input materials to the generative system but then the evaluation is still going to be limited.

That doesn’t make this a bad approach at all. M&C Saatchi will learn a lot of useful things from conducting this experiment. So far, this is a form of multivariate testing which is a technique already employed in the web world. It’s great to see experiments with transferring this to the offline world.