An introduction to failover in the cloud

One of the hardest, but most important things to do when building your cloud architecture, is to eliminate Single Point of Failures (SPoF). What this means is that every mission critical service should be able to survive an outage of any given server. Some companies, like Netflix, have taken this to an extreme and created a service called Chaos Monkey. The sole role off Chaos Monkey is to randomly take down servers. The overarching idea is that it will force everyone to design the code and architecture with server failures in mind.

While this is a great idea, many companies wouldn’t have the resources to implement such a system (much less maintain it). Usually, the first step towards eliminating SPoFs is to implement failover IPs on critical services. The complexity of doing this varies. For something like a load balancer, it is rather simple, but for other things like a database, it is a lot more difficult.

In this article, we’ll focus on a useful, yet simple example of IP failover in a cloud context. We’ll set up two VMs with a floating public IP between them. This way if one goes down, the other one will take over.

Prerequisite

In order to set this up, you need the following:

Two Linux VMs (I’m using ‘Ubuntu 14.04 Cloud Image’ from the marketplace)
One VLAN that both VMs are connected to
One Static IP subscription (don’t attach this to any server)

Configuring the servers

The first thing we need to do is to make sure the two servers can communicate with each other using hostnames. That means, that we need to either have a proper DNS internally set up or have the hosts added to /etc/hosts/. The latter is the easiest solution, so we’ll go ahead and do that.

Let’s call our nodes ‘lb0’ and ‘lb1’ and assume that the internal IP of lb0 is 10.0.0.10/24, and 10.0.0.11/24 for lb1. Configuring static IP is beyond the scope of this article, but it’s straight forward and instructions can be found here.

We can then simply run the following command on both servers to add name resolution:

[bash light="true"]
$ echo -e '10.0.0.10 lb0.local lb0 \n10.0.0.11 lb1.local lb1' | sudo tee -a /etc/hosts
[/bash]

[bash light="true"]

$ echo -e '10.0.0.10 lb0.local lb0 \n10.0.0.11 lb1.local lb1' | sudo tee -a /etc/hosts

[/bash]

You might also need to remove all references to the local hostname that points to 127.0.0.1 or 127.0.1.1. By default, Ubuntu will create such a record during the installation.

Now make sure the nodes can ping each other using the hostname, i.e. run ping lb1 from lb0 and ping lb0 from lb1.

Assuming you were able to ping successfully, we can move on to configuring the failover.

Configuring failover

There are a number of different failover tools around for Linux. We’re going to use one called Heartbeat. Since it’s available in the regular repository, we can simply install it from there:

[bash light="true"]
$ sudo apt-get -y install heartbeat
[/bash]

[bash light="true"]

$ sudo apt-get -y install heartbeat

[/bash]

With the package installed, we now need to configure it.

On both servers, we need to configure a shared secret. This is simply to avoid the situation where another server on the network can interfere and steal our shared resource. We do this by running the following (replace ‘YourSecret’ with a random string/password):

[bash light="true"]
$ sudo touch /etc/ha.d/authkeys
$ sudo chown hacluster:haclient /etc/ha.d/authkeys
$ sudo chmod 600 /etc/ha.d/authkeys
$ echo -e 'auth1 1 sha1 YourSecret' | sudo tee -a /etc/ha.d/authkeys
[/bash]

[bash light="true"]

$ sudo touch /etc/ha.d/authkeys

$ sudo chown hacluster:haclient /etc/ha.d/authkeys

$ sudo chmod 600 /etc/ha.d/authkeys

$ echo -e 'auth1 1 sha1 YourSecret' | sudo tee -a /etc/ha.d/authkeys

[/bash]

After we’ve configured the secret, we need to configure the Heartbeat for our network structure. This is done in /etc/ha.d/ha.cf.

Here’s how the configuration file will look like on lb0 (more details are available here):

node lb0.local
ucast eth0 10.0.0.1
node lb1.local
ucast eth0 10.0.0.11

node lb0.local

ucast eth0 10.0.0.1

node lb1.local

ucast eth0 10.0.0.11

To automate this, we would just need to run:

On lb0

[bash light="true"]
$ sudo touch /etc/ha.d/ha.cf
$ echo -e echo -e 'node lb0.local/nucast eth0 10.0.0.10/nnode lb1.local/nucast eth0 10.0.0.11' | sudo tee -a /etc/ha.d/ha.cf
[/bash]

[bash light="true"]

$ sudo touch /etc/ha.d/ha.cf

$ echo -e echo -e 'node lb0.local/nucast eth0 10.0.0.10/nnode lb1.local/nucast eth0 10.0.0.11' | sudo tee -a /etc/ha.d/ha.cf

[/bash]

On lb1

[bash light="true"]
$ sudo touch /etc/ha.d/ha.cf
$ echo -e 'node lb0.local/nucast eth0 10.0.0.11/nnode lb1.local/nucast eth0 10.0.0.10' | sudo tee -a /etc/ha.d/ha.cf
[/bash]

[bash light="true"]

$ sudo touch /etc/ha.d/ha.cf

$ echo -e 'node lb0.local/nucast eth0 10.0.0.11/nnode lb1.local/nucast eth0 10.0.0.10' | sudo tee -a /etc/ha.d/ha.cf

[/bash]

Now we need to add the shared resource. This is done in the file /etc/ha.d/haresources. This is where your public IP you comes into play. Your IP will of course be different than mine, so let’s call this IP a.b.c.d. Let’s configure lb0 to be the master for this IP. To do that, we will add the following to both nodes:

[bash light="true"]
$ echo 'lb0.local IPaddr2::a.b.c.d/24/eth0' | sudo tee -a /etc/ha.d/haresources
[/bash]

[bash light="true"]

$ echo 'lb0.local IPaddr2::a.b.c.d/24/eth0' | sudo tee -a /etc/ha.d/haresources

[/bash]

Lastly, we need to start the heartbeat service. This is done by simply running the following on both machines:

[bash light="true"]
$ sudo service heartbeat restart
[/bash]

[bash light="true"]

$ sudo service heartbeat restart

[/bash]

You should now be able to ping the heartbeat IP (a.b.c.d) from both machines.

By default, the lb0 should be the preferred server. Only if this server is down lb1 should take over.

Testing

On lb0, run the following command:

[bash light="true"]
$ ip a show
[/bash]

[bash light="true"]

$ ip a show

[/bash]

Under eth0, you should now be able to see the failover IP listed. Now, if you stop the heartbeat service (sudo service heartbeat stop), the IP should disappear from lb0, and instead show up on lb1.

Once you’ve verified that the IP showed up on lb1, you can now start the heartbeat service again (sudo service heartbeat start). After a few seconds, you should be able to see that the failover IP has migrated back to lb0.

Making it useful

Now that we have a shared IP, let’s put it to use. To do this, let’s use something simple like an Nginx webserver.

First, we of course need to install Nginx.

[bash light="true"]
$ sudo apt-get install -y nginx
[/bash]

[bash light="true"]

$ sudo apt-get install -y nginx

[/bash]

Next up, let’s add some sample data.

[bash light="true"]
$ sudo mkdir -p /www
$ sudo touch /www/index.html
$ echo "&lt;html&gt;&lt;h1&gt;CloudSigma fail over test&lt;/h1&gt;&lt;/html&gt;" | sudo tee -a /www/index.html
[/bash]

[bash light="true"]

$ sudo mkdir -p /www

$ sudo touch /www/index.html

$ echo "<html><h1>CloudSigma fail over test</h1></html>" | sudo tee -a /www/index.html

[/bash]

By default, Linux systems can only bind resources to an IP that is present on one of the interfaces. This poses a problem here, since the IP will only be present on one node at the time. To remedy this, we need to change a sysctl setting:

[bash light=”true”] $ echo “net.ipv4.ip_nonlocal_bind=1″ | sudo tee -a /etc/sysctl.conf
$ sudo sysctl -p
[/bash]

Lastly, we need to configure Nginx to serve the template page we created on the shared IP. To do this, we’ll create the file ‘/etc/nginx/sites-enabled/failovertest’ with the following content:

server {
        listen a.b.c.d:80;
        root /www;
        index index.html;
}

server {

listen a.b.c.d:80;

root /www;

index index.html;

}

(Please note that this is a simple example, and not suitable for a production deployment.)

Now let’s run this on both servers:
[bash light=”true”] $ sudo touch /etc/nginx/sites-enabled/failovertest
$ echo -e ‘server {\nlisten a.b.c.d:80;\nroot /www;\nindex index.html;\n}’ | sudo tee -a /etc/nginx/sites-enabled/failovertest
$ sudo service nginx restart
[/bash]

We now have a fully redundant webserver that can survive a failure of one of these node. To test this, you test running the following in a terminal window from your local computer (assuming you’re using Linux or Mac OS X):

[bash light="true"]
$ while true; do curl http://a.b.c.d; sleep 1; done
&lt;html&gt;&lt;h1&gt;CloudSigma fail over test&lt;/h1&gt;&lt;/html&gt;
&lt;html&gt;&lt;h1&gt;CloudSigma fail over test&lt;/h1&gt;&lt;/html&gt;
&lt;html&gt;&lt;h1&gt;CloudSigma fail over test&lt;/h1&gt;&lt;/html&gt;
&lt;html&gt;&lt;h1&gt;CloudSigma fail over test&lt;/h1&gt;&lt;/html&gt;
&lt;html&gt;&lt;h1&gt;CloudSigma fail over test&lt;/h1&gt;&lt;/html&gt;
&lt;html&gt;&lt;h1&gt;CloudSigma fail over test&lt;/h1&gt;&lt;/html&gt;
&lt;html&gt;&lt;h1&gt;CloudSigma fail over test&lt;/h1&gt;&lt;/html&gt;
&lt;html&gt;&lt;h1&gt;CloudSigma fail over test&lt;/h1&gt;&lt;/html&gt;
...
[/bash]

[bash light="true"]

$ while true; do curl http://a.b.c.d; sleep 1; done

<html><h1>CloudSigma fail over test</h1></html>

...

[/bash]

If you look in the Nginx logs on lb0 (tail -f /var/log/nginx/*.log) you will see the activity. Now issue a reboot on the server. As the server shuts down, the other server (lb1) will take over, and you will keep seeing the results coming in on your local computer.

Making it a lot more useful

While the above example is a great way to illustrate the setup, it isn’t very useful in production (if you have a simple static page, you might as well throw it up on a CDN). It is however very useful if you combine it with Nginx’s built-in Reverse Proxy functionality. This allows you to configure Nginx to load balance a pool of app servers on your internal network.

If you have any questions or issues, please post a comment.

Happy hacking!

About
Latest

About Viktor Petersson

Former VP of Business Development at CloudSigma. Currently CEO at WireLoad and busy making a dent in the Digital Signage industry with Screenly. Viktor is a proud geek and loves playing with the latest technologies.

Manage Docker resources with Cgroups - May 12, 2015
Docker, Cgroups & More from ApacheCon 2015 - April 30, 2015
How to setup & optimise MongoDB on public cloud servers - March 24, 2015
Presentation deck from CloudExpo Europe - March 17, 2015
CoreOS is now available on CloudSigma! - March 10, 2015