2014-08-11

Ceph on USB thumb drives

Ceph is an open source distributed object store mean to work at huge scales on COTS (common off the shelf) hardware. It works in huge datacenters, so why not dust off an old netbook, plug in 12 USB flash drives and have at it.
The netbook is an Acer AspireOne (Intel Atom N270 1.6 GHz, 1 Gig RAM, 160Gig HD). What follows are the config changes made in ceph.conf before running the ceph-deploy command. The ceph version is Firefly 0.81. Since this is a one machine cluster I needed to tell ceph to replicate across OSDs and not across hosts.
[default]
osd crush chooseleaf type = 0 

I messed up setting the default journal size. At first I thought: Pfft. Journal, make it tiny – it just robs space. And my 4MB (yes four megabytes) journal made the cluster unworkable. With the tiny journals and default settings I could never reliably keep more than two OSDs up and data throughput was terrible. I rebuilt with 512MB journals instead.
[osd]
osd journal size = 512 

The machine was way underpowered. So I tuned a few other things. The authentication cephx was turned off. There are risks to this but this is a hobbyist project on a secured subnet.
[default]
auth cluster required = none
auth service required = none
auth client required = none

The cluster uses a ton of memory and CPU when recovering objects. It helps to limit this activity somewhat.
[osd]
osd max backfills = 1
osd recovery max active = 2 

And since things could get a bit slow I increased a few timeouts:
[osd]
osd op thread timeout = 180
osd op complaint time = 300
osd default notify timeout = 240
osd command thread timeout = 180

I was not able to get 2G flash keys to come up. Given the price of 8G sticks is only five bucks this isn’t much of a limitation. I suppose I could use LVM striping to join a bunch of 2G sticks together into a larger unit.

The speed is not all that quick. Ceph –w reports the write speed about 3 megabytes per second. That doesn’t sound like much except the data pool I was testing on writes three copies of the data – six if you count journaling. A lot of things could affect speed: tiny memory, slow CPU, slow USB sticks and/or the USB bus being saturated.

This config uses XFS on the USB sticks where BTRFS might perform better. While the speeds look poor, remember that ceph OSDs don’t report that a write is successful until the object is written to both the media and the journal. I could probably mitigate this double write by: having fewer OSDs by joining up groups of USB sticks with LVM stripes and/or moving the journal to a different device – right now ceph is using the USB sticks both for data and journal.

I stress tested RADOS by adding objects until the store filled up. It’s robust and not a single OSD timed out of the pool. As of writing this blog I am currently testing untarring linux kernel sources to the ceph filesystem – I’ll keep you posted.

My future plans are to expand the cluster utilising old hardware I have lying about. I’d like to add at least two more nodes – but they won’t necessarily be USB thumb drive backed.