2016-01-22

Evicting and Flushing from Ceph Cache Tier/Cache Pools

Disaster! The OSDs backing my cache pools were reporting full. This occurred because the node carrying most of the backing pools crashed leaving insufficient replicas of the backing pool. Even when that node was brought back online, recovery operations going to take a long time. Here's what I did. The first thing was to set an absolute maximum size to the cache tier pool:
ceph osd pool set cachedata target_max_bytes ....

The next thing was to start manually evicting objects from the pool. (Flushing writes dirty objects back to the backing pool, evicting boots out clean objects). Flushing would need the backing pool up and able to accept new writes - but evicting would not. Evicting would free up space in the cache tier without the backing pool having stabilised.

The standard command to evict objects is:

rados -p cachepool cache-flush-evict-all

I found that locked up on me, complaining that some objects were locked. I also tried another variant, but that was not shrinking the pool either.

rados -p cachepool cache-try-flush-evict-all

My next trick was to use parallel (apt-get install parallel if you don't have it!) to try evicting objects one by one. I'd run this script until satisfied that the cache pool had shrunk to a reasonable size and then Ctrl-C to terminate the evictions.

rados -p cachepool ls | parallel -j16 rados -p cachepool cache-try-evict {}

What this command does is list the contents of cachepool and take each entry and spawn an instance of rados to try to evict each object sepearately. The -j16 means to spawn 16 rados processes at a time.

For completeness, the other cache flushing or evicting commands that rados recognises are (from here):

cache-flush 
cache-try-flush 
cache-evict 
cache-flush-evict-all
cache-try-flush-evict-all

I believe that variants with "try" in the name are non-blocking while the rest will block.

Soon the SSD OSDs that back my cache tier were back under warning levels. My cluster continued recovering overnight and all the data lived happily ever after (at least until next time).

Ceph Cluster Diary January 2016

Another node is added to my Ceph cluster. This is a second-hand dual Pentium-D with 8GB of RAM and a 250GG SATA HDD. Removed from the box a dual-head Quadro graphics card and a RAID controller plus SCSI2 drives that were not compatible with Linux.

Cluster speed is a concern, so three 220GB drives were installed in the box and a writeback cache tier was created around the cephfs data pool. The online examples were very useful.

At first, the metadata pool was also cached. Though I begun to get cluster thrashing problems. I mitigated this, and some data-risk, by removing the metadata cache, but adding a crush rule to have two replicas of the metadata on SSDs and the final copy on HDDs.

There were also numerous page faults, so I gave up some space on one of the SSDs for a Linux swap partition and most of the page faults disappeared.

Most file operations are about three times faster than before. When the metadata was also cached there was about a 7x speed up, but the cluster was less reliable. My backing storage devices are mainly external USB hard drives running on old USBv1 hardware so any speed up is welcome.

The result is a much more reliable cluster that gives consistent enough speed to run virtual hard-drive files for some Virtual Machines that I occasionally run on my main desktop. Previously, those Virtual Machines had a tendency to crash when run from cephfs.

Early on, I did have a problem with the cache filling up, but fixed that by applying more aggressive cache sizing policies. In particular, I set the target_max_bytes to 85% of my SSD size.

ceph osd pool set cachedata target_max_bytes .....

I very pleased with the setup now. One or two more tweaks and I might be ready to begin retiring my dedicated NAS box and switch all my network storage to ceph.