2014-09-10

Ceph on Thumbdrive Update: BTRFS and one more node.

A few things have happened to my Ceph cluster. The AspireOne netbook was really not up to the job. It is fine for just a few OSD processes but anything more was caused slow-downs resulting in cluster thrash. Amalgamating thumb drives using LVM was helpful… until I wanted to run CephFS. I was time to add another node to the mix.

Read about earlier stories about the Ceph on USB thumb drive cluster here:
Adding another node to Ceph is trivial. This was an old, but much more powerful laptop in every way. I’ve moved the mon and mds functions onto this laptop so now the AspireOne only runs OSD processes. When I add another node then I’ll also run a mon process on the AspireOne so that there is an odd number for quorum building.

The new node uses faster and larger USB keys. These were 32GB – which was both the fastest and cheapest price per GB available at my local PBTech. The new node currently runs two of these sticks in an OSD process each.

I also moved the cluster away from XFS to BTRFS. This was trivial and involved zero cluster downtime. Yes: zero. First ensure the cluster is reasonably healthy, then drop the weight of the OSD using:
ceph osd reweight OSDID 0.1
. Actually I got bored waiting – the cluster was healthy and the pools all had size 3 with min_size 2… so I just stopped the OSD process and removed it from the ceph. Don’t do that on a live cluster, especially where pools have few replicas. But, this was just for testing. Then...
sudo service ceph stop osd.X
ceph osd crush rm osd.X
ceph osd rm osd.X
ceph auth del osd.X


Then umount the backing storage and format it using BTRFS. Then I followed the instruction in my previous tutorial to add the storage back into Ceph. Wait for the cluster to heal before migrating another OSD from XFS to BTRFS.

The AspireOne node has three groups of 8GB keys, federated by USB hub. BTRFS is capable of spanning physical drives without LVM so I removed LVM once all groups had been migrated. Do read about the options for BTRFS stores because the choices matter. I went with RAID10 for the metadata and RAID0 for the data. RAID0 maybe gives better parallel IO performance because the extents are scattered among the drives but it does mean that all block devices in the FS effectively operate at the size of the smallest one. I can live with that.

Three BTRFS OSDs on the AspireOne is sometimes a bit much for that machine. Though, one cool thing about BTRFS was I extended a mounted BTRFS volume with a few more thumb drives, then restarted the osd process. Use
ceph osd crush reweight osd.X Y
to tell Ceph to reallocate space. That was instant storage expansion without downtime on the OSD long enough to trigger a recovery process. I did the whole process in less than ten minutes – and most of that was googling to find the correct BTRFS commands.

The cluster happily serves files to my desktop machine over CIFS. While it’s not a setup I’d recommend for production use it is kinda fun.