Disaster recover and fault tolerance

20 Feb 2009
by: By John Adams
NETWORKED security systems are as vulnerable to failure as any other computer network is. And the increased use of off-the-shelf computer equipment means that security electronics installers building networks need to pay close attention to the ability of their systems to handle failures of any kind.
SOME manufacturers may claim that off-the-shelf computer products are as reliable as the best proprietary DVR solutions but this is not the case. Most computer systems operate in an environment that does not demand perfect operation and there is less emphasis on reliability and onboard redundancy.

Better DVRs are custom-built for their role with powerful streaming capabilities, bulletproof operating systems, multiple ports and either mirrored HDDs or onboard RAID storage. They also have the ability to support multiple monitors and analogue CCTV cameras. As part of your disaster recovery plans you need to think hard about which type of solution will greatest reliability.

Having said this, there are also computer components and installation techniques that allow duplication of vital comms paths, power sources and administrative functions. Applying backup systems during planning is what will guarantee system integrity in long term, whether you use DVRs or off-the-shelf hardware.

As clever installers will be aware, one of the first things you need to undertake when considering fault tolerance and disaster recovery is impact analysis. What will complete failure of a networked video surveillance system mean for an organization? What would failure of a networked access control system mean? You need to assess the results of failure and then work backwards to avert risks and establish safeguards.

Central to the disaster recovery process is sitting down with the management of an organization and spelling out to them the likely results of a series of disaster scenarios. Senior management of a big site may assume that in the event of catastrophic power, phone or Internet failure they will have the unlimited support of the networked access system, while in truth that system may be hampered by limited UPS supported and guaranteed to fail open within 2 hours of site power failure.

The last thing you want as an installer (or as a security manager) is to foster unrealistic expectations of system performance in the event of network failure. Networks are touchy beasts and their multiplicity of essential elements – firmware, hardware and software - can lead to unusual and unexpected failures that may take many hours or even several days to put right.

In a recent incident, an IT solutions provider we know of had a customer experience complete server failure. Just to make things more interesting this was topped off with trouble with a hot replacement server. The full extent of the problem was not fully appreciated early on – junior techs were on the site initially and they battled the problem without reporting its seriousness for most of a day.

The upshot of all this was that a large company was denied Internet connection or email servcies for 2-and-half days while the entire technical staff of the ISP tried to rebuild the failed system. In the end IT company kept the business but it was a near run thing and the stress on the technical team and company management was intense.

Apply this scenario to the servers supporting video surveillance at a casino, airport, metropolitan rail system or any large iconic, industrial or defense site and you have serious problems that that would mean partial closure of operations at great cost to end user and security contractor.

Clearly it’s vitally important that you establish what the user expects from their system and apply those demands in the process of system design. If the user wants the sort of data recovery redundancy that will demand RAID 1 make sure you find that out before the system is installed not after a hard drive failure. And security managers – know what you want and make your demands very clear from the outset.

When working on disaster-capable networked systems, security teams and the IT team will need to sit down and work together on problems of this complexity. Not only are you going to need to establish the results of security system failure, you’ll also need to work out what sort of problems are likely to lead to such failures. The bigger the system and the bigger the network, the more difficult it will be to establish the variables that will relate to system failure. Just to complicate matters there may be partial disasters as well as total disasters.

When you think disaster recovery you’re not just thinking of natural disasters, fires or terrorist attacks that might cause structural damage, or upset the utilities that support the system. You also need to think about local disasters that might cause destruction and disruption to the system itself.

The sort of disasters that are likely to impact on a networked electronic security system include lightning strike, total power failure, loss of communications between node zero and control room, failure of controlling PC, failure of switches, routers or hubs, system intrusion, virus, hacker attack, etc), failure of HDDs and software bugs. Once you have a list of potential problems then you can start thinking about what’s required to recover from them.

A key issue with disaster recovery is putting together a plan that will bring a system back from failure and the best way to do this is to compartmentalize various tasks and assign a team leader to each area of the system. This sounds complicated but it may simply mean giving leadership in the event of a particular problem to the individual with greatest relevant expertise.

Think hard about people. There may be some worst case scenarios that could mean key members of staff may be unable to perform their duties and the responsibility will need to be carried by other team members. If possible you need to duplicate chains of command in other offices. Management at an interstate office might take over, and vice versa.

“When working on disaster-capable networked systems, security teams and the IT team will need to sit down and work together on problems of this complexity. Not only are you going to need to establish the results of security system failure, you’ll also need to work out what sort of problems are likely to lead to such failures”


Your recovery plan will be broad. There will be issues with the physical components like the cable plant and the patch panel on one hand, while there may be problems with control software or remote communications systems on another.

Just to make matters harder there are also guaranteed to be areas of potential failure where a security electronics team has no business getting involved. These areas may be handled by the systems department, or in the event of external lighting failure, responsibility may go to inhouse maintenance crews or contractors.

In each and every case, the recovery plan needs to be broken down to its smallest parts. In order to get this element right you need detailed procedures, as well as up to date contact lists and guaranteeing fast communication. There may also be areas where training is required to ensure appropriate support. Such training should be included in disaster recovery procedure manuals.

Managing the networked security system’s disaster recovery plans is a job for the security manager, or a competent and well respected assistant manager or supervisor, not a junior. This individual will be responsible for contacting team leaders, as well as overseeing upgrades to the plan that will came as personnel and equipment change. It sounds simple enough but getting this part right will require constant monitoring.

Another key element to the disaster recovery process is establishing and maintaining test procedures. Once you’ve established a recovery plan it’s not going to be much use if that plan has never been tested in a simulated disaster.

Remote locations

Depending on the nature of the installation, recovery may mean that a control room at a remote location takes over the responsibility of managing the system using a secondary comms channel and a dedicated power source.

The system may also be designed with distributed intelligence so components can continue to function even if the control room is not operational. And in surveillance applications on large sites, a central core of cameras in the highest security areas may be double-wired to a standalone machine with its own power source while security officers patrol the rest of the site.

Something that’s easy to do on larger sites is to physically separate vital system components so that a local disaster can’t destroy an entire control and storage solution. This could be as simple as spreading servers around a number of different server rooms or keeping RAID of DAT backups in another location from your servers/DVRs.

Almost every big security office has always kept its backup audio, video and event files behind the mantrap and in the same location as primary storage devices. Even if all you do is keep a single server and controller PC outside the control room and in another part of the facility, it will be a major enhancement to your disaster recovery capabilities.

When thinking about an alternate management and storage site you’ll need to talk to the IT department but the ability to move the security operation off site and still function effectively is vital in the event of disasters like fires, chemical spills or any threat that causes complete evacuation of a site.

Modern networked access control and video surveillance systems are admirably equipped to offer this capability. A little foresight during system planning should allow either the staff at another major office, or the control room team to use remote workstations – laptops if necessary – to manage the system externally. 

If there are plans to relocate in the event of trouble then you need to think about staffing the remote location. This means serious consideration must be given to issues like network access, training and sharing of information at the highest levels.

Important with all IT infrastructure is thinking about things like policy-based management, root cause analysis and knowledge bases to broaden the ability of teams to make the best possible decisions when working without direct guidance of senior management. The last thing you want is a comprehensive recovery plan that falls over because the only person who knows how it works is on holiday.

Something else to consider is maintenance of a full schematic diagram of a security networks and all their peripheral equipment. In the event of the destruction of a site this sort of detail allows fast rebuilding.

As mentioned earlier in this article, redundancy is vital and the earlier it’s incorporated into a system design the better. Any system whose operation is essential for the system’s overall functionality must be duplicated if there’s any hope of maintaining operation in the event of a disaster.

Data recovery with RAID options

Disaster recovery planning needs to be systematic and thoroughly thought out. One of the most important things to think about is data management. This can be a big issue for security departments with large numbers of cameras whose storage requirements might be enormous. Many security managers are flat out getting storage enough for 15 images per second held 7 days, let alone having video servers or DVR hard drives backed up off site.

One of the key elements of onsite data recovery for security teams is use of RAID storage systems and it’s worth us going into a bit of detail to give the best understanding of how these systems work. For a start the RAID acronym stands for Redundant Array of Independent Disks and central to the concept of RAID is “striping”. This is a way of combining an array of HDDs into a single storage unit. Essentially striping an array of hard drives involves partitioning each drive’s platter into storage stripes of any size from half a kilobyte to a few megabytes.

Storage stripes are interleaved across the array so that the entire storage solution is actually made up of many different stripes from all the disks woven together. Data saves or searches see the disks shuffled like a deck of cards during download and retrieval. The benefits of RAID include the fact that storage levels are exceedingly high and that in the event of disk failure certain recording modes guarantee no data will be lost. Possible RAID modes include 0, 1, 2, 3, 4, 5 and 6.

In video surveillance applications, RAID allows the use of small stripes around 512-bytes long so that images are recorded across every disk in the array with each drive storing a part of the image stream. There are 2 advantages here. Firstly, loss of a hard drive doesn’t mean complete loss of data - the other disks in the array can rebuild the files lost from one failed HDD. And secondly, record accesses can be performed very quickly – that’s perfect for video applications. 

Different modes are very much worth having as they give a range of performance options, depending on your requirements. RAID-0 sees data split over the array giving high performance in terms of storage at the expense of possible data loss in the event of disk failure. It’s the fastest RAID mode.

RAID-1 is perfect for performance-critical, fault-tolerant solutions. It provides redundancy by writing all data to 2 or more disks giving faster reads and slightly slower writes over single drive storage. Most importantly though, there’s full data redundancy.

RAID-3 is ideal in data heavy situations where long sequential record recalls improve data transfer. It lays down data in byte-sized stripes, storing parity on one drive. This parity configuration allows complete recovery of all information in the event one drive fails – excellent for video surveillance applications if you want to employ every disk to its full capacity. 

Lastly, RAID-5 works in the same way as RAID-4 but unlike RAID-3 it shares parity across all the disks, meaning there’s no single-disk parity bottleneck. Raid-5 allows smaller writes to be undertaken faster that RAID-4 but read performance is not as good. RAID-5 is the answer in multi-user environments where performance is not the ultimate goal.

Addressing the practicalities of network redundancy

For electronic security teams hot-swappable servers/DVRs are central to the recovery process but you also need at least one alternate Ethernet path and back-ups for routers, switches and hubs. What you want from a networked electronic security system is fault tolerance and high availability.

Typical fault tolerant systems resist problems by duplicating power supplies, duplicating disk arrays and offering automatic changeover software. If there’s a negative with such automation it’s that you may be unaware there’s a problem because the system will do the thinking for you. A capable monitoring and reporting solution is vital here.

Another aspect of fault tolerance is building multiple connections between video servers/DVRs and network switches. Building networks this way ensures there’s backup should a NIC fail. You might also connect a NIC to a pair of switches instead of just one.

As an alternative you might opt for high availability. A high availability network design is one that ensures performance at a level that guarantees that no matter what components fail – short of complete site destruction – some operational capability will be retained.

If there’s a standout advantage of high availability vs fault tolerance it’s cost – high availability solutions are going to be much less expensive than fault tolerant ones. A typical high availability site may incorporate separate Ethernet systems with both client and server machines incorporating a pair of Ethernet cards incorporated.

Duplication of the Ethernet network may seem like an expensive business but when you think about the low cost of hardware and the ease of pulling duplicate cables, getting full network redundancy locally is almost too easy, especially when you consider the small size of most security LANs or LAN segments. It’s definitely harder to work maintaining a hot swappable server than it is to keep a simple Ethernet idling ready to go.

Another seriously valuable addition to the security control room and its network would be a full blown intranet router designed to direct traffic around a network using smarts like network address translation, and port address translation, as well as having the ability to execute firewall rules. You can use a top end router like this to handle critical operations, too.

Having such a router solution integrated and aligned with a security network would give excellent support but you need to take into account that rebuilding an intranet router from scratch is not easy so you’d be looking at hot standby features that allow primary and secondary routers to stay in touch in real time. Any failure of primary routers would see the secondary unit take over. Using a load balancing router solution a similar effect is achieved because the overall solution has the ability to pick up the slack in the event a router fails. The central issue with these solutions, however, is dollars.

Replacement bits

Getting disaster recovery right means having the ability to rebuild a failed system on the spot and that means having replacement hardware on site where it can be plugged into the system immediately. Such equipment may include a server, a PC, a DVR, and a couple of switches or routers.

While a hotel or large industrial site may be able to get away with having the surveillance system down for half an hour, big airports or casinos will need to get failover times down to a couple of seconds.

With networked DVRs or video servers this may mean servers link to a hub through a bunch of different NICs using different switch ports. It’s straightforward stuff but it means not only will there be failure protection, the resultant “fat pipe” will pump up local bandwidth possibilities.

If failure of the cable plant, its connector and terminations is a worry then building lots of connections between switches will help. You could also connect a DVR/video server to a couple of different switches on the same network so no single switch failure will see the system off line.

Quick tips for disaster recovery of networked security systems include:

* Planning for the worst – this way you won’t be surprised

* Checking your plan and updating it regularly in order to keep fresh
* Incorporating many solutions all able to be executed fast
* Document the plan and ensure there are multiple copies
* Make sure your team is familiar with the plan.


"Key issues with disaster recovery is putting together a plan that will bring a system back from failure and the best way to do this is to compartmentalize various tasks and assign a team leader to each area of the system”