Forcing AIO on Ceph OSD journals

My Ceph cluster doesn't run all that quick. On reads it was about 20% slower than my RAID5 NAS and writes were 4x slower! Ouch. A good part of that is probably down to using USB flash keys but...

Upon starting the OSDs I see this message:

journal FileJournal::_open: disabling aio for non-block journal.  Use journal_force_aio to force use of aio anyway
My OSDs are XFS backed which supports async writing to the journal, so let's set that up.

First, ssh into the ceph-deploy node and get the running version of the .conf file, replacing {headnode] with the hostname of the main monitor:

ceph-deploy --overwrite-conf config pull {headnode}
You can skip this step if your .conf is up to date.

Next edit the .conf file and add the following. If you already have an [OSD] section then update accordingly.

journal aio = true
journal dio = true
journal block align = true
journal force aio = true
This will try to apply this setting to all OSDs. You can control this on a per OSD basis by adding sections named after the OSD. E.g.
journal aio = true
journal dio = true
journal block align = true
journal force aio = true
And here's the official documentation.

Next, push the config back out to all the ceph nodes and restart ceph your osds. Separate hostnames with spaces.

ceph-deploy --overwrite-conf config push {headnode} {cephhost1} {cephhost2} ....
A word of warning: the --overwrite-conf flag is destructive. I'll leave it to you to take backups.
Then SSH into the various nodes restarting the ceph service as you go. I just restarted all ceph services,
sudo service ceph restart
But it's probably okay to just restart the OSDs
sudo service ceph restart osd

I experienced an almost 2 times speed increase on writes until the journals fill. Still slow, but getting much better. My fault for not having better hardware!

Read more about my Ceph Cluster.


Howto Deep-Scrub on All Ceph Placement Groups

Ceph automatically takes care of deep-scrubbing all placement groups periodically. The exact timing of that is tunable but you're probably here because you want to force deep-scrubs.

The basic command for deep-scrubbing is:

ceph pg deep-scrub <pg-id>

and you can find the placement group ID using:
ceph pg dump

And if you want to instruct all placement groups to deep-scrub, use the same script from repairing inconsistent PGs. Basically loop over all the active PGs, instructing each to deep-scrub:

ceph pg dump | grep -i active | cut -f 1 | while read i; do ceph pg deep-scrub ${i}; done

The repair article explains how this line of script works.

You can be more specific about which PGs are deep-scrubbed by altering the

part of the script. For example, to only scrub active+clean PGs:
ceph pg dump | grep -i active+clean | cut -f 1 | while read i; do ceph pg deep-scrub ${i}; done

Some general caveats are in order. Repair your PGs before attempting to deep-scrub; it's safer to only scrub PGs that are active and clean. You can use

ceph pg dump_stuck
ceph health detail
to help find out what's going on. Here's a link to Ceph placement group statuses.

Good luck!


Bringing back an LVM backed volume

What can you do when an LVM backed logical volume goes offline? This happens on my slower netbook on an LVM logical volume spanning about 20 USB flash drives. Sometimes those PVs go missing and the filesystem stops! Here's the steps I take to fix this problem without a reboot. My volume group is called "usb" and my logical volume is called "osd.2".

Since my volume is part of a ceph cluster, I should ensure that the ceph osd is stopped. service ceph stop osd.2. You probably don't need to do this since the OSD probably exited once it saw errors on the filesystem.

Next, unmount the filesystem and mark the logical volume as inactive. We use the -f -l switches to force the dismount and lazily deal with the dismount in the background. Without those switches the umount might freeze.
umount -f -l /dev/mapper/usb-osd.2
Marking the logical volume as inactive can be done in two ways. Prefer the first method since it is more specific. The second method will mark inactive all dismounted logical volumes and that might be overkill.
lvchange -a n usb/osd.2 -or- vgchange -a n

At this point I unplug all the USB drives and check the hubs. Plug in the USB keys a few at a time and use pvscan as you go to ensure that each USB key is being recognised. If you have a dead USB key then try again in another port. If that doesn't work then check the hubs have power - even replug the hub. Failing that try a reboot. Failing that... attempt to repair the LVM volume some other way. Since ceph already replicates data I don't bother running the LVM backed logical volumes on RAID - I just overwrite the LV and make a new one from the remaining USB flash drives.

Once all the PVs have come back then pvscan one last time then vgscan. Now you should see your volume groups have all their PVs in place. Now it's time to reactivate the logical volumes. Both methods will work but again I prefer the first once since it is more specific.
lvchange -a y usb/osd.2 -or- vgchange -a y

All things going well and the Logical Volume is now active. It's a good idea to do a filesystem consistency check before you remount the drive. Since I use XFS I'll carry on with the steps for that. You should use whatever tools work for your filesystem.
mount /dev/mapper/usb-osd.2 mounting the drive allows the journal to replay. That usually fixes any file inconsistency problems.
umount /dev/mapper/usb-osd.2 unmount the drive before checking.
xfs_check /dev/mapper/usb-osd.2 to check the drive and use xfs_repair /dev/mapper/usb-osd.2 if there are any errors.

Now we're ready to mount the logical volume again: mount /dev/mapper/usb-osd.2

And since I'm running ceph I want to restart the OSD process: service ceph osd restart osd.2


Read more about my ceph cluster running on USB drives.