HA Virtualisation with Pacemaker and Ceph

I’ve recently begun playing with ceph’s rbd pool as a way to provide network block devices for libvirt guests managed through Pacemaker, having had success with drbd and iscsi. This post should be considered notes of my ongoing experiments, and not a hard-and-fast ‘howto’ for this concept. Nevertheless, it might be useful to someone!

Concepts

We’re using ceph rbd pool (RADOS block device) to offer up the storage for the VMs. If you’ve used iscsi before, this is a similar concept, but with a replicated, distributed backend for the data.

We’re using pacemaker with the VirtualDomain RA to manage libvirt (kvm) instances.

Hardware

You’ll need a minimum of 2 nodes to run the osd daemons (object store – i.e. the data) and 3 or more nodes (always an odd number for quorum) nodes for the mon daemons (ceph pool monitoring). The ceph documentation gives suggested hardware requirements for these.

You’ll also need 2 (or more) nodes for VMs to allow live migration. CPU and memory are important here.

I started using 4 nodes – 2 disk servers, and 2 vm servers. 3 (or more) nodes run pacemaker (to allow quorum) and one (or more) vm server hosts the extra mon daemon. This was mostly due to the hardware I had to hand falling into 2 categories of ‘fast disk’ and ‘fast cpu/mem’ – but the pool can expand later as needed.

Ceph Configuration

Follow the ceph documentation on setting up a pool – using mkcephfs (or ceph-deploy) to get things going.

Make sure that the path you point your osd’s at is the mountpoint for an xfs or btrfs filesystem.

I use the following /etc/ceph/ceph.conf:

[global]
auth supported = none # see also cephx
 
[mon]
mon data = /srv/ceph/mon-$id
[mon.0]
host = disksrv1
mon addr = 192.168.1.1:6789
[mon.1]
host = disksrv2
mon addr = 192.168.1.2:6789
[mon.2]
host = vmsrv1
mon addr = 192.168.1.3:6789
 
[osd]
osd data = /srv/ceph/osd-$id
osd journal = /srv/ceph/osd-$id/journal #In production, use an SSD or at least it's own partition
osd journal size = 2000 # Only needed if using a file for the journal (2GB is a good start for GBe)
 
[osd.0]
    host = disksrv1
[osd.1]
    host = disksrv2

When you start ceph, run ceph status on one of the mons and wait for it to settle – look out for HEALTH_OK and check that the appropriate number of mons and osds are running.

Guest disk Creation

You now need to create a disk in the pool for the guest to use. This is done from any of the mon nodes with the following command – replacing with the disk size, and and with the name of your rbd pool and guest vm. If –pool is omitted, it will default to the rbd pool:

rbd create [guestname] --size [megabytes] --pool [poolname]

Note that rbd images are thinly provisioned – that is no space will be used unless files are written to the image, and the size is only an upper limit. you can later change the size of a disk with:

rbd resize [guestname] --size [megabytes] --pool [poolname]

Full documentation on the rbd commands is available on the Ceph wiki

By default, ceph pools have replication set at 2 – i.e 2 copies of all data. If you are paranoid, you can up this number, but be aware that this requires a corresponding increase in the number of osds, and also will incur a performance hit.

Libvirt configuration

RBD support in qemu has been around for a while – definitely in the 0.15.1 releases. Some distibutions don’t compile with it enabled – in which case you need to compile it yourself with the --enable-rbd configure option.

To see if your version has rbd support, then try the following command (after having created the rbd image above):

qemu-img info -f rbd rbd:[poolname]/[guestname]

you should see something like:

image: rbd:rbd/guest1
file format: rbd
virtual size: 1.0G (1073741824 bytes)
disk size: unavailable
cluster_size: 4194304

You need to ensure that your various libvirt daemons can communicate for migration. You can use TLS (recommended, using certificates for auth), ssh, or tcp with no auth (good for testing, but insecure). See the libvirt Remote documentation for information. Note that many distros also need you to tweak the libvirtd startup options to include --listen in /etc/sysconfig/libvirtd or /etc/defaults/libvirtd.

Once communication is established, you need to create a guest libvirt XML configuration for each guest, and deploy this to all the VMs. The important addition is the inclusion of a disk of type ‘network’ with source protocol ‘rbd’ and the name set appropriately to [poolname]/[guestname] from the rbd commands above.

 <disk type='network' device='disk'>
      <source protocol='rbd' name='rbd/guest1'>
<!-- * Uncomment unless you have a copy of ceph.conf on the VM host *
         <host name='disksrv1' port='6789'/>
         <host name='disksrv2' port='6789'/>
         <host name='vmsrv1' port='6789'/>
-->
      </source>
      <target dev='vda' bus='virtio'/>
 </disk>

Pacemaker

I am assuming here that you are familiar with the workings of pacemaker. At present, this guide only covers using pacemaker to manage libvirt – though there are resource agents for monitoring ceph’s init daemons, and for ‘mounting’ rbd images available written by the excellent people at Hastexo. You should also ensure that your OS is automatically starting libvirtd, but not automatically starting any guests (e.g. libvirt-guests init.d scripts, libvirt autostart etc).

Be aware that if you are running pacemaker to monitor VirtualDomain guests AND ceph, you may need to put in place location rules to prevent ceph running on hosts with ‘VM hardware’ and libvirt running on hosts with ‘disk hardware’.

The relevant crm configuration snippet for each guest will look something like this:

primitive guest1 ocf:heartbeat:VirtualDomain \
  params config=/etc/libvirtcfg/guest1.xml \
  hypervisor="qemu:///system" migration_transport="tls" meta allow-migrate="true" \
  op start timeout="300" op stop timeout="300" \
  op monitor depth="0" timeout="30" interval="10" \
  op migrate_from timeout="300" \
  op migrate_to timeout="300"

If you are used to running with DRBD and iscsi, this might seem quite short – however since libvirt is handling all of the rbd access, much of the complexity disappears.

2 comments ↓

#1 The Data Center Journal How much SSD do you need vs. want? on 11.03.12 at 12:44 pm

[...] home lab – Virtualization TipsApple announces updated MacBook Pro, Mac Mini, iMac and new iPad MiniHA Virtualisation with Pacemaker and Ceph #igit_rpwt_css { background:#FFFFFF;font-size:12px; font-style:normal; color:#000000 !important; [...]

#2 Christopher Barry on 05.18.13 at 8:32 pm

Nice post. I am planning a similar config/experiment, but have not acquired the requisite hardware as yet. I’m currently testing running vms using tgtd with iSER (iSCSI over RDMA with Infiniband), but want the security of ceph’s distributed block devices. I make the iSER connection from the host, then attach the vm to the block device on the host with virtio. The speeds in my config are exceptional, and I’m hitting an array of SSDs, but the safety factor is too scary for production.
Q: What’s your experience with the performance of your setup? Are you using 10GbE or GbE or some bonded config from your hosts? Also, what do you know about using ceph with IB? I’ve seen statements of people using IPoIB, but would really rather avoid that if possible.
Thanks,
-C

Leave a Reply

Your email address will not be published. Required fields are marked *

* Copy This Password *

* Type Or Paste Password Here *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>