Switching from Ethernet to Infiniband for Gluster access (or why we had to ...)

As explained in my previous (small) blog post, I had to migrate a Gluster setup we have within CentOS.org Infra. As said in that previous blog post too, Gluster is really easy to install, and sometimes it can even "smells" too easy to be true. One thing to keep in mind when dealing with Gluster is that it's a "file-level" storage solution, so don't try to compare it with "block-level" solutions (so typically a NAS vs SAN comparison, even if "SAN" itself is wrong for such discussion, as SAN is what's *between* your nodes and the storage itself, just a reminder.)

Within CentOS.org infra, we have a multiple nodes Gluster setup, that we use for multiple things at the same time. The Gluster volumes are used to store some files, but also to host (different gluster volumes with different settings/ACLs) KVM virtual-disks (qcow2). People knowing me will say : "hey, but for performances reasons, it's faster to just dedicate for example a partition , or a Logical Volume instead of using qcow2 images sitting on top a filesystem for Virtual Machines, right ?" and that's true. But with our limited amount of machines, and a need to "move" Virtual Machine without a proper shared storage solution (and because in our setup, those physical nodes *are* both glusterd and hypervisors), Gluster was an easy to use solution to :

Aggregate local SATA disks as a bigger shared drive

use replicated+distributed mode to also have local resiliency for those VMs

It was working, but not that fast ... I then heard about the fact that (obviously) accessing those qcow2 images file through fuse wasn't efficient at all, but that Gluster had libgfapi that could be used to "talk" directly to the gluster daemons, bypassing completely the need to mount your gluster volumes locally through fuse. Thankfully, qemu-kvm from CentOS 6 is built against libgfapi so can use that directly (and that's the reason why it's automatically installed when you install KVM hypervisor components). Results ? better , but still not was I/we was/were expecting ...

When trying to find the issue, I discussed with some folks in the #gluster irc channel (irc.freenode.net) and suddenly I understood something that it's *not* so obvious for Gluster in distributed+replicated mode : for people having dealt with storage solutions at the hardware level (or people using DRBD, which I did too in the past, and that I also liked a lot ..) in the past, we expect the replication to happens automatically at the storage/server side, but that's not true for Gluster : in fact Glusterd just exposes metadata to gluster clients, which then know where to read/write (being "redirected" to correct gluster nodes). That means so than replication happens at the *client* side : in replicated mode, the clients will write itself twice the same data : once on each server ...

So back to our example, as our nodes have 2*1Gb/s Ethernet card, and that one is a bridge used by the Virtual Machines, and the other one "dedicated" to gluster, and that each node is itself a glusterd/gluster client, I let you think about the max perf we could get : for a write operation : 1Gbit/s , divided by two (because of the replication) so \~ 125MB / 2 => in theory \~ 62 MB/sec (and then remove tcp/gluster/overhead and that drops to \~ 55MB/s)

How to solve that ? well, I tested that theory and confirmed directly that it was the case, when in distributed mode only, write performances were automatically doubled. So yes, running Gluster on Gigabit Ethernet suddenly was the bottleneck. Upgrading to 10Gb wasn't something we could do, but , thanks to Justin Clift (and some other Gluster folks), we were able to find some "second hand" Infiniband hardware (10Gbps HCAs and switch)

While Gluster has native/builtin rdma/Infiniband capabilities (see "tranport" option in the "gluster create volume" command), we had in our case to migrate existing Gluster volumes from plain TCP/Ethernet to Infiniband, while trying to get the downtime as small as possible. That is/was my first experience with Infiniband, but it's not as hard as it seems, especially when you discover IPoIB(IP over Infiniband). So from a Syadmin POV, it's just "yet another network interface", but a 10Gbps one now :)

The Gluster volume migration then goes like this : (schedule a - obvious - downtime for this) :

On all gluster nodes (assuming that we start from machines installed only with @core group, so minimal ones) :

yum groupinstall "Infiniband Support"
chkconfig rdma on
#stop your clients or other apps accessing gluster volumes, as they will be stopped

service glusterd stop && chkconfig glusterd off &&  init 0

Install then the hardware in each server, connect all Infiniband cards to the IB switch (previously configured) and power back on all servers. When machines are back online, you have "just" to configure the ib interfaces. As in my cases, machines were "remote nodes" and not having a look at how they were configured, I had to use some IB tools to see which port was connected (a tool like "ibv_devinfo" showed me which port was active/connected, while "ibdiagnet" shows you the topology and other nodes/devices). In our case it was port 2, so let's create the ifcfg-ib{0,1} devices (and ib1 being the one we'll use) :

DEVICE=ib1  
TYPE=Infiniband  
BOOTPROTO=static  
BROADCAST=192.168.123.255  
IPADDR=192.168.123.2  
NETMASK=255.255.255.0  
NETWORK=192.168.123.0  
ONBOOT=yes  
NM_CONTROLLED=no  
CONNECTED_MODE=yes

The interesting part here is the "CONNECTED_MODE=yes" : for people who already uses iscsi, you know that Jumbo frames are really important if you have a dedicated VLAN (and that the Ethernet switch support Jumbo frames too). As stated in the IPoIB kernel doc , you can have two operation mode : datagram (default 2044 bytes MTU) or Connected (up to 65520 bytes MTU). It's up to you to decide which one to use, but if you understood the Jumbo frames thing for iscsi, you get the point already.

An "ifup ib1" on all nodes will bring the interfaces up and you can verify that everything works by pinging each other node, including with larger mtu values :

ping -s 16384 \<other-node-on-the-infiniband-network>

If everything's fine, you can then decide to start gluster *but* don't forget that gluster uses FQDN (at least I hope that's how you configured initially your gluster setup, already on a dedicated segment, and using different FQDN for the storage vlan). You just have to update your local resolver (internal DNS, local hosts files, whatever you want) to be sure that gluster will then use the new IP subnet on the Infiniband network. (If you haven't previously defined different hostnames for your gluster setup, you can "just" update that in the different /var/lib/glusterd/peers/ and /var/lib/glusterd/vols//*.vol)

Restart the whole gluster stack (on all gluster nodes) and verify that it works fine :

service glusterd start
gluster peer status
gluster volume status
# and if you're happy with the results :
chkconfig glusterd on

So, in a short summary:

Infiniband isn't that difficult (and surely if you use IPoIB, which has though a very small overhead)
Migrating gluster from Ethernet to Infiniband is also easy (and surely if you planned carefully your initial design about IP subnet/VLAN/segment/DNS resolution for "transparent" move)

Arrfab's blog Some tips and tricks, mostly around CentOS

Switching from Ethernet to Infiniband for Gluster access (or why we had to ...)