2014-08-14

Going to LVM for performance and graceful failures.

A few things happened in the world of the 12 USB drive netbook ceph node. Basically the netbook wasn't up to the job. Under any kind of reasonable stress (such fifteen parallel untars of the kernel sources to and from ceph-filesystem) the node would spiral into cluster thrush. The major problem appeared to be OSDs being swapped out to virtual memory and timing out.

Aside: my install of debian (Wheezy) came with a 3.2 kernel. Ceph likes a kernel version 3.16 or greater. I complied a kernel that was 3.14 since it was marked as longterm supported. My tip is to do this before you install ceph. Doing it afterwards resulted in kernel feature mismatches with some of my OSDs.

Back to the main problem. My USB configuration had introduced a new failure and performance domain. The AspireOne netbook has three USB ports - each of which I attached a hub and each hub has four usb keys: three hubs times four USB drives is 12 drives total. Ideally I'd like to alter the crush map so that PGs don't replicate on the same USB hub. This looked easy enough in ceph ... edit the crushmap and introduce a bucket type called "bus" that sat between "osd" and "host" then change the default chooseleaf type to bus.

It turns out there was an easier way to solve both my problems: LVM. The Linux Volume Manager joins block devices together into a single logical volume. LVM can also stripe data from logical volumes across multiple devices. However, it does mean that if a single USB key fails then the whole logical volume fails too ... and that OSD goes down. I can live with that.

Identical looking flash drives are impossible to match with linux block devices in the /dev folder. I am running ceph so it was just easier to pull a USB key, see what OSD died and find what device was associated with it. I let the cluster heal in between each pull of a USB drive until I have a hub's worth of flash-keys pulled. I then worked a USB-hub at a time: bringing the new LVM-backed OSD into ceph before working on the next hub. Details follow.

Bring down the devices and remove them from ceph. Use
ceph osd crush remove id
then
ceph osd down id
,
ceph osd rm id
. Then stop the OSD process with
/etc/init.d/ceph stop osd.id
. It pays also to tidy up the authentication keys with
ceph auth del osd.id
or you'll have problems later. You can then safely unmount the device and then get hacking with your favourite partition editor.

There are good resources for LVM online. The basics are: setup an LVM partition on your devices. Use
pvcreate
on each LVM partition to let LVM know this is a physical volume. Then create volume groups using
vgcreate
- I made a different volume group per USB hub. Then you can make the logical volume (i.e. the thing used by the OSD) from space on a volume group
lvcreate
. The hierachy is: pv-physical volumes, vg-volume groups, lv-logical volumes. I used the
-i
option on
lvcreate
to have LVM stripe data across the USB keys because parallelism. If you've noticed a pattern in the create commands then bonus: the list commands follow the same pattern
pvs
,
vgs
,
lvs
. Format the logical volume using your favourite filesystem, though ceph prefers XFS (maybe BTRFS).

Once the logical volume is formatted then it's time to bring it back into ceph. I tried to do things the hard way and then gave up and used ceph-deploy instead. The commands used are described here.

A disadvantage with this setup is that LVM tends to scan for volumes before the USB drives are visible so the drives would not automount. I solved this with a custom init.d script. While in /etc I also changed inittab to load
ceph -w
onto tty1 so that the machine boots directly into a status console.

The performance is much faster with the new 4_LVMx3_OSD configuration compared with the 12_OSD cluster. Write speeds are almost double with RADOS puts of multi-megabyte objects. There is almost zero swapfile activity too.

I hope to soon test ceph filesystem performance on this setup before adding another node or two. I've glossed over many steps so let me know in the comments if you'd like details on any part of the process.

(I also wrote about the 12 USB drive OSD cluster with a particular focus on the ceph.conf settings)