Arrfab's blog - Virtualizationhttps://arrfab.net/2014-11-24T11:37:00+01:00Some tips and tricks, mostly around CentOSSwitching from Ethernet to Infiniband for Gluster access (or why we had to ...)2014-11-24T11:37:00+01:002014-11-24T11:37:00+01:00Fabian Arrotintag:arrfab.net,2014-11-24:/posts/2014/Nov/24/switching-from-ethernet-to-infiniband-for-gluster-access-or-why-we-had-to/<p>As explained in my previous (small) blog post, I had to migrate a
<a href="http://www.gluster.org">Gluster</a> setup we have within CentOS.org Infra.
As said in that previous blog post too, Gluster is really easy to
install, and sometimes it can even "smells" too easy to be true. One
thing to keep in mind when dealing with Gluster is that it's a
"file-level" storage solution, so don't try to compare it with
"block-level" solutions (so typically a NAS vs SAN comparison, even if
"SAN" itself is wrong for such discussion, as
<a href="http://en.wikipedia.org/wiki/Storage_area_network">SAN</a> is what's
*between* your nodes and the storage itself, just a reminder.)</p>
<p>Within <a href="http://www.centos.org">CentOS.org</a> infra, we have a multiple
nodes Gluster setup, that we use for multiple things at the same time.
The Gluster volumes are used to store some files, but also to host
(different gluster volumes with different settings/ACLs) KVM
virtual-disks (qcow2). People knowing me will say : "hey, but for
performances reasons, it's faster to just dedicate for example a
partition , or a Logical Volume instead of using qcow2 images sitting on
top a filesystem for Virtual Machines, right ?" and that's true. But
with our limited amount of machines, and a need to "move" Virtual
Machine …</p><p>As explained in my previous (small) blog post, I had to migrate a
<a href="http://www.gluster.org">Gluster</a> setup we have within CentOS.org Infra.
As said in that previous blog post too, Gluster is really easy to
install, and sometimes it can even "smells" too easy to be true. One
thing to keep in mind when dealing with Gluster is that it's a
"file-level" storage solution, so don't try to compare it with
"block-level" solutions (so typically a NAS vs SAN comparison, even if
"SAN" itself is wrong for such discussion, as
<a href="http://en.wikipedia.org/wiki/Storage_area_network">SAN</a> is what's
*between* your nodes and the storage itself, just a reminder.)</p>
<p>Within <a href="http://www.centos.org">CentOS.org</a> infra, we have a multiple
nodes Gluster setup, that we use for multiple things at the same time.
The Gluster volumes are used to store some files, but also to host
(different gluster volumes with different settings/ACLs) KVM
virtual-disks (qcow2). People knowing me will say : "hey, but for
performances reasons, it's faster to just dedicate for example a
partition , or a Logical Volume instead of using qcow2 images sitting on
top a filesystem for Virtual Machines, right ?" and that's true. But
with our limited amount of machines, and a need to "move" Virtual
Machine without a proper shared storage solution (and because in our
setup, those physical nodes *are* both glusterd and hypervisors),
Gluster was an easy to use solution to :</p>
<blockquote>
<ul>
<li>Aggregate local SATA disks as a bigger shared drive</li>
<li>use <a href="https://access.redhat.com/documentation/en-US/Red_Hat_Storage/3/html/Administration_Guide/Creating_Distributed_Replicated_Volumes.html">replicated+distributed
mode</a>
to also have local resiliency for those VMs</li>
</ul>
</blockquote>
<p>It was working, but not that fast ... I then heard about the fact that
(obviously) accessing those qcow2 images file through fuse wasn't
efficient at all, but that Gluster had
<a href="http://www.gluster.org/community/documentation/index.php/Libgfapi_with_qemu_libvirt">libgfapi</a>
that could be used to "talk" directly to the gluster daemons, bypassing
completely the need to mount your gluster volumes locally through fuse.
Thankfully, qemu-kvm from CentOS 6 is built against libgfapi so can use
that directly (and that's the reason why it's automatically installed
when you install KVM hypervisor components). Results ? better , but
still not was I/we was/were expecting ...</p>
<p>When trying to find the issue, I discussed with some folks in the
#gluster irc channel (irc.freenode.net) and suddenly I understood
something that it's *not* so obvious for Gluster in
distributed+replicated mode : for people having dealt with storage
solutions at the hardware level (or people using
<a href="http://www.drbd.org/">DRBD</a>, which I did too in the past, and that I
also liked a lot ..) in the past, we expect the replication to happens
automatically at the storage/server side, but that's not true for
Gluster : in fact Glusterd just exposes metadata to gluster clients,
which then know where to read/write (being "redirected" to correct
gluster nodes). That means so than replication happens at the *client*
side : in replicated mode, the clients will write itself twice the same
data : once on each server ...</p>
<p>So back to our example, as our nodes have 2*1Gb/s Ethernet card, and
that one is a bridge used by the Virtual Machines, and the other one
"dedicated" to gluster, and that each node is itself a glusterd/gluster
client, I let you think about the max perf we could get : for a write
operation : 1Gbit/s , divided by two (because of the replication) so \~
125MB / 2 => in theory \~ 62 MB/sec (and then remove
tcp/gluster/overhead and that drops to \~ 55MB/s)</p>
<p>How to solve that ? well, I tested that theory and confirmed directly
that it was the case, when in <a href="https://access.redhat.com/documentation/en-US/Red_Hat_Storage/3/html/Administration_Guide/Creating_Distributed_Volumes.html">distributed
mode</a>
only, write performances were automatically doubled. So yes, running
Gluster on Gigabit Ethernet suddenly was the bottleneck. Upgrading to
10Gb wasn't something we could do, but , thanks to <a href="https://twitter.com/realjustinclift">Justin
Clift</a> (and some other Gluster
folks), we were able to find some "second hand"
<a href="http://en.wikipedia.org/wiki/InfiniBand">Infiniband</a> hardware (10Gbps
HCAs and switch)</p>
<p>While Gluster has native/builtin rdma/Infiniband capabilities (see
"tranport" option in the "gluster create volume" command), we had in our
case to migrate existing Gluster volumes from plain TCP/Ethernet to
Infiniband, while trying to get the downtime as small as possible. That
is/was my first experience with Infiniband, but it's not as hard as it
seems, especially when you discover
<a href="https://www.kernel.org/doc/Documentation/infiniband/ipoib.txt">IPoIB</a>(IP
over Infiniband). So from a Syadmin POV, it's just "yet another network
interface", but a 10Gbps one now :)</p>
<p>The Gluster volume migration then goes like this : (schedule a - obvious
- downtime for this) :</p>
<p>On all gluster nodes (assuming that we start from machines installed
only with @core group, so minimal ones) :</p>
<div class="highlight"><pre><span></span><span class="n">yum</span> <span class="n">groupinstall</span> <span class="ss">"Infiniband Support"</span>
<span class="n">chkconfig</span> <span class="n">rdma</span> <span class="k">on</span>
<span class="o">#</span><span class="n">stop</span> <span class="n">your</span> <span class="n">clients</span> <span class="k">or</span> <span class="n">other</span> <span class="n">apps</span> <span class="n">accessing</span> <span class="n">gluster</span> <span class="n">volumes</span><span class="p">,</span> <span class="k">as</span> <span class="n">they</span> <span class="n">will</span> <span class="n">be</span> <span class="n">stopped</span>
<span class="n">service</span> <span class="n">glusterd</span> <span class="n">stop</span> <span class="o">&&</span> <span class="n">chkconfig</span> <span class="n">glusterd</span> <span class="k">off</span> <span class="o">&&</span> <span class="n">init</span> <span class="mi">0</span>
</pre></div>
<p>Install then the hardware in each server, connect all Infiniband cards
to the IB switch (previously configured) and power back on all servers.
When machines are back online, you have "just" to configure the ib
interfaces. As in my cases, machines were "remote nodes" and not having
a look at how they were configured, I had to use some IB tools to see
which port was connected (a tool like "ibv_devinfo" showed me which
port was active/connected, while "ibdiagnet" shows you the topology and
other nodes/devices). In our case it was port 2, so let's create the
ifcfg-ib{0,1} devices (and ib1 being the one we'll use) :</p>
<div class="highlight"><pre><span></span><span class="n">DEVICE</span><span class="o">=</span><span class="n">ib1</span>
<span class="k">TYPE</span><span class="o">=</span><span class="n">Infiniband</span>
<span class="n">BOOTPROTO</span><span class="o">=</span><span class="k">static</span>
<span class="n">BROADCAST</span><span class="o">=</span><span class="mi">192</span><span class="p">.</span><span class="mi">168</span><span class="p">.</span><span class="mi">123</span><span class="p">.</span><span class="mi">255</span>
<span class="n">IPADDR</span><span class="o">=</span><span class="mi">192</span><span class="p">.</span><span class="mi">168</span><span class="p">.</span><span class="mi">123</span><span class="p">.</span><span class="mi">2</span>
<span class="n">NETMASK</span><span class="o">=</span><span class="mi">255</span><span class="p">.</span><span class="mi">255</span><span class="p">.</span><span class="mi">255</span><span class="p">.</span><span class="mi">0</span>
<span class="n">NETWORK</span><span class="o">=</span><span class="mi">192</span><span class="p">.</span><span class="mi">168</span><span class="p">.</span><span class="mi">123</span><span class="p">.</span><span class="mi">0</span>
<span class="n">ONBOOT</span><span class="o">=</span><span class="n">yes</span>
<span class="n">NM_CONTROLLED</span><span class="o">=</span><span class="k">no</span>
<span class="n">CONNECTED_MODE</span><span class="o">=</span><span class="n">yes</span>
</pre></div>
<p>The interesting part here is the "CONNECTED_MODE=yes" : for people who
already uses iscsi, you know that Jumbo frames are really important if
you have a dedicated VLAN (and that the Ethernet switch support Jumbo
frames too). As stated in the <a href="https://www.kernel.org/doc/Documentation/infiniband/ipoib.txt">IPoIB kernel
doc</a> ,
you can have two operation mode : datagram (default 2044 bytes MTU) or
Connected (up to 65520 bytes MTU). It's up to you to decide which one to
use, but if you understood the Jumbo frames thing for iscsi, you get the
point already.</p>
<p>An "ifup ib1" on all nodes will bring the interfaces up and you can
verify that everything works by pinging each other node, including with
larger mtu values :</p>
<blockquote>
<p>ping -s 16384 \<other-node-on-the-infiniband-network></p>
</blockquote>
<p>If everything's fine, you can then decide to start gluster *but* don't
forget that gluster uses FQDN (at least I hope that's how you configured
initially your gluster setup, already on a dedicated segment, and using
different FQDN for the storage vlan). You just have to update your local
resolver (internal DNS, local hosts files, whatever you want) to be sure
that gluster will then use the new IP subnet on the Infiniband network.
(If you haven't previously defined different hostnames for your gluster
setup, you can "just" update that in the different
/var/lib/glusterd/peers/<em> and /var/lib/glusterd/vols/</em>/*.vol)</p>
<p>Restart the whole gluster stack (on all gluster nodes) and verify that
it works fine :</p>
<div class="highlight"><pre><span></span><span class="nv">service</span> <span class="nv">glusterd</span> <span class="nv">start</span>
<span class="nv">gluster</span> <span class="nv">peer</span> <span class="nv">status</span>
<span class="nv">gluster</span> <span class="nv">volume</span> <span class="nv">status</span>
# <span class="nv">and</span> <span class="k">if</span> <span class="nv">you</span><span class="s1">'</span><span class="s">re happy with the results :</span>
<span class="nv">chkconfig</span> <span class="nv">glusterd</span> <span class="nv">on</span>
</pre></div>
<p>So, in a short summary:</p>
<ul>
<li>Infiniband isn't that difficult (and surely if you use IPoIB, which
has though a very small overhead)</li>
<li>Migrating gluster from Ethernet to Infiniband is also easy (and
surely if you planned carefully your initial design about IP
subnet/VLAN/segment/DNS resolution for "transparent" move)</li>
</ul>Enabling IPv6 for guests on an Hetzner CentOS 5.5 xen dom02010-12-31T11:28:00+01:002010-12-31T11:28:00+01:00Fabian Arrotintag:arrfab.net,2010-12-31:/posts/2010/Dec/31/enabling-ipv6-for-guests-on-an-hetzner-centos-5-5-xen-dom0/<p>I was playing with IPv6 in the last days (started to use a tunnel from
<a href="http://www.tunnelbroker.net/">he.net</a> as my current ISP doesn't
support native IPv6 and doesn't plan to support it in a short time) and
wanted to add IPv6 to some of my CentOS Xen domU's running on a
<a href="http://www.hetzner.de">Hetzner</a> box. This part was a little bit more
difficult than for a standard network. Due to their internal network
design, Hetzner <a href="http://translate.google.be/translate?u=http%3A%2F%2Fwiki.hetzner.de%2Findex.php%2FZusaetzliche_IP-Adressen&sl=de&tl=en&hl=&ie=UTF-8">only
allow</a>
'routed' xen networks and not standard 'bridged' ones. What I used for
IPv4 was just binding the public IPs on the dom0 and configured all my
iptables rules there to forward/SNAT/DNAT to the appropriate domU. But
you know that NAT is gone with IPv6 so normally it's supposed to be
easier, right ? Well, yes and no, depending on your network layout. Even
after having enabled ipv6 forwarding (net.ipv6.conf.all.forwarding=1 ),
I was just able to ping the dom0 but not the guests behind. Hmm, that
reminds me the <a href="http://en.wikipedia.org/wiki/Proxy_arp">proxy ARP</a> that
was used for IPv4 but not existing anymore for IPv6 (gone too ...) . ARP
was (more or less, not technically correct but read the RFCs if you
enough time) replaced by …</p><p>I was playing with IPv6 in the last days (started to use a tunnel from
<a href="http://www.tunnelbroker.net/">he.net</a> as my current ISP doesn't
support native IPv6 and doesn't plan to support it in a short time) and
wanted to add IPv6 to some of my CentOS Xen domU's running on a
<a href="http://www.hetzner.de">Hetzner</a> box. This part was a little bit more
difficult than for a standard network. Due to their internal network
design, Hetzner <a href="http://translate.google.be/translate?u=http%3A%2F%2Fwiki.hetzner.de%2Findex.php%2FZusaetzliche_IP-Adressen&sl=de&tl=en&hl=&ie=UTF-8">only
allow</a>
'routed' xen networks and not standard 'bridged' ones. What I used for
IPv4 was just binding the public IPs on the dom0 and configured all my
iptables rules there to forward/SNAT/DNAT to the appropriate domU. But
you know that NAT is gone with IPv6 so normally it's supposed to be
easier, right ? Well, yes and no, depending on your network layout. Even
after having enabled ipv6 forwarding (net.ipv6.conf.all.forwarding=1 ),
I was just able to ping the dom0 but not the guests behind. Hmm, that
reminds me the <a href="http://en.wikipedia.org/wiki/Proxy_arp">proxy ARP</a> that
was used for IPv4 but not existing anymore for IPv6 (gone too ...) . ARP
was (more or less, not technically correct but read the RFCs if you
enough time) replaced by
<a href="http://en.wikipedia.org/wiki/Neighbor_Discovery_Protocol">NDP</a> but I
don't see such option for IPv6. Well, a kernel feature called proxy_ndp
(net.ipv6.conf.all.proxy_ndp=1) exists on newer kernels (like for
example the 2.6.32.x that is used on RHEL6 , and so in CentOS 6) but not
on CentOS 5.5 (using a 2.6.18.x) kernel .. Hmmm ...</p>
<p>On the other side, I was searching for a 'workaround' probably given by
libvirt, but the version included in RHEL5/CentOS5 doesn't know what to
do with IPv6. Okay so let's have a look at the Xen and kernel side at
the same time. If the proxy_ndp kernel feature is not present on my
CentOS 5.5 dom0, I can still 'advertise' my neighbors with the ip
command : yes, it supports it : " ip -6 neighbor add proxy
your:ipv6:long:address::1 dev eth0"</p>
<p>So we just need to create a modified vif-route script (in fact I decided
to call it vif-route6) that will be used for ipv6 guests :</p>
<blockquote>
<p>#!/bin/bash </p>
<p>#============================================================================<br>
# /etc/xen/scripts/vif-route6<br>
# Script for configuring a vif in routed mode for IPv6 only<br>
# Based on existing vif-route script in /etc/xen/scripts and adapted
for ipv6 </p>
<p>#============================================================================</p>
<p>dir=\$(dirname "\$0")<br>
. "\$dir/vif-common.sh"</p>
<p>main_ip=\$(dom0_ip)<br>
main_ip6=\$(ip -6 addr show eth0|grep 'scope global'|sort|head -n
1|awk '{print \$2}'|cut -f 1 -d '/')</p>
<p>case "\$command" in<br>
online)<br>
ifconfig \${vif} \${main_ip} netmask 255.255.255.255 up<br>
ip -6 addr add \${main_ip6} dev \${vif}<br>
ipcmd='add'<br>
cmdprefix=''<br>
;;<br>
offline)<br>
do_without_error ifdown \${vif}<br>
ipcmd='del'<br>
cmdprefix='do_without_error'<br>
;;<br>
esac</p>
<p>if [ "\${ip}" ] ; then<br>
# If we've been given a list of IP addresses, then add routes from
dom0 to<br>
# the guest using those addresses.<br>
for addr in \${ip} ; do<br>
\${cmdprefix} ip -6 neighbor \${ipcmd} proxy \${addr} dev
\${netdev:-eth0} 2>&1<br>
result=`\${cmdprefix} ip -6 route \${ipcmd} \${addr} dev \${vif} src
\${main_ip6} 2>&1`<br>
done<br>
fi</p>
<p>handle_iptable</p>
<p>log debug "Successful vif-route \$command for \$vif."<br>
if [ "\$command" = "online" ]<br>
then<br>
success<br>
fi</p>
</blockquote>
<p>Ok, so we have just now to modify our xen domU's config to add a vif
that will use that specific script and give it the IPv6 address that
we'll assign to that domU (from /etc/xen/your-domU-name):</p>
<blockquote>
<p>vif = [ \<snip of the first vif> ,
"mac=00:16:36:38:31:b8,vifname=test.ipv6,script=vif-route6,ip=2a01:4f8:100:4363::dead"
]</p>
</blockquote>
<p>You can now start your domU and configure it normally for IPv6 (using
obviously that 2a01:4f8:100:4363::dead IPv6 address and choosing the
dom0 main IPv6 address as gateway ...</p>
<p>Hope it will help some people in the same situation (using a routed and
not a bridged network layout for xen)</p>