As part of the move our new Sydney office we purchased a duplicate of our main router, a Cisco 1841. This was necessary as we wanted to have the network fully installed and tested before the move started in order to avoid any nasty surprises. However once the dust had settled we were left with a spare unit of a moderately sophisticated router, and I couldn’t help looking for something to do with it.
Most modern Cisco routers implement a protocol called Virtual Router Redundancy Protocol (VRRP). Originally intended to allow a second router to take over in the case of the main one dying, it can also be used to load-balance networks and maintain backup routes if a network connection fails. VRRP is the successor to Cisco’s proprietary HSRP, and has the advantage of being able to interact with non-Cisco devices such as Linux-based routers.
Another interesting feature of Cisco routers is the ability to monitor Service Level Agreements, ensure that certain levels of network performance and availability are maintained. These can track variables ranging from network lag to the ability to make VOIP calls.
Although not explicitly enunciated anywhere I can find, these two features can be linked together, enabling two (or more) routers to monitor the availability of upstream services and fail-over if the internet becomes unavailable. As this isn’t clearly documented anywhere the following is a brief tutorial on setting this up …
Lets say our internal network is on 10.1.1.0/24. We have two routers on it, one that maintains the link to the internet via fibre and one that connects to our VOIP provider via SHDSL. However the VOIP connection can also be used as a fall-back internet connect, and vice-versa. I practice we have both failing over to each-other, but for this example we’ll concentrate on the internet connection.
The standard configuration guide generally give the main gateway router the gateway IP, with that being taken over by the backup router. However I prefer to give each router its own dedicated IP with the gateway IP ‘floating’. This means I can SSH into the router even when it is not the primary. The floating gateway IP on our network is 10.1.1.1. In both routers the interface 0/0 is internet-facing and 0/1 is the LAN.
First we need to setup SLAs to monitor not just the link itself but a well-known host upstream. The service-provider’s DNS server is usually a good choice.
We’ll do the default router first (10.1.1.251). We’ll define two SLAs; one that pings the next-hop and one that pings the DNS server (obviously the IPs below are made up). The SLAs are numbered 1 and 2 respectively:
ip sla 1 icmp-echo 220.127.116.11 source-interface FastEthernet0/0 timeout 2000 ip sla schedule 1 life forever start-time now ip sla 2 icmp-echo 18.104.22.168 source-interface FastEthernet0/0 timeout 2000 ip sla schedule 2 life forever start-time now
The fragments above define the SLA (including the port, which is important in multihomed routers), specify the timeout (in milliseconds) and then schedule it to run forever.
The next step is to register these as tracked objects, and define some parameters of what constitutes an outage (and a recovery):
track 1 rtr 1 reachability delay down 60 up 60 track 2 rtr 2 reachability delay down 60 up 60
Here we create a ‘tracked object’ for each SLA (which has the same ID as the SLA for clarity). The ‘delay up/down’ specifies how long (in seconds) the SLA must be down for the object to be declared unreachable (and the corresponding delay when it returns). I’ve been fairly generous here as we’ve had problems in the past with ADSL connections diconnecting briefly and causing the connection to ‘flap’.
Now that we have a measure of network availability we can use it to create a cluster the two routers with the second taking over from the first when the SLAs time-out. We’ll first router with a priority of 100 and the second with 95. If either of the SLAs fail we’ll drop the priority by 10, causing the second to take over.
On the first interface we do:
interface FastEthernet0/1 ip address 10.1.1.251 255.255.255.0 vrrp 1 ip 10.1.1.1 vrrp 1 priority 100 vrrp 1 track 1 decrement 10 vrrp 1 track 2 decrement 10 vrrp 1 timers advertise 5
The ‘timers advertise’ clause tell the router to only send out updates every 5 seconds; the default of twice a second is a waste of bandwidth given the long SLA time-outs. The advertising periods must match on all routers.
On the second we do:
interface FastEthernet0/1 ip address 10.1.1.252 255.255.255.0 vrrp 1 ip 10.1.1.1 vrrp 1 priority 95 vrrp 1 timers advertise 5
And that’s pretty much it. Both routers will start to advertise their priority on the LAN and decide which one is the master.
Obviously in practice we can do a lot more; in our case the second (VOIP) connection can fail-over to the internet one, there’s logging of timeouts to a syslog host, we run RIP internally to automatically propagate routes and basic things such as NAT have been elided. But this should be enough to get you started.