Deploying Unified Communications Manager to geographically diverse sites adds a number of challenges not experienced in single-site deployments. The primary new challenges are:
- Bandwidth usage and call quality over WAN links
- Availability during WAN outages
- Call routing to the PSTN, especially for emergency numbers
- Potential for overlapping dial plans
After the cut, we will take a brief look at each of these issues, and some ways to to overcome them.
Bandwidth and Call Quality
Individual voice data streams are relatively small compared to many other forms of network traffic like FTP, printing, and file sharing. However, they are extremely sensitive to both delay and jitter (variation in delay) compared to most other traffic types. To minimize this, QoS and proper network design need to be utilized. Delay is introduced in a number of ways, some fixed, and some variable for a given network path. The fixed delays include codec delay (introduced by the process of collecting enough data to place in a packet, encoding it, and sending it,) serialization delay (the time it takes to write the 1s and 0s to the wire,) propagation delay (the time it takes for a single bit to get from one end of a link to the other,) and switching delay (time spent in network devices, separate from queuing.) Queuing delay is the main variable delay, as well as what is sometimes called “network delay,” introduced in a service provider network that you have little to no control over. “Network delay” is simply a composite of a number of the other delays, but it is out of your direct control. Another factor that can introduce jitter is variation in route, such as with per-packet load balancing over unequal routes.
Codec delay can be influenced by changing the amount of voice packetized. In most deployments, the endpoints encode 20ms of voice into each packet. To save on WAN bandwidth, you can configure different packetization intervals, most commonly 30ms. More than that can introduce a number of issues, including a lost packet being more than the endpoints can cover up. The other tradeoff is increasing the delay from a sound being produced and it being placed on the wire, since the phone or gateway need wait longer before encapsulating the data.
Serialization delay is impacted by the speed of the link. The actual data is put onto a gig Ethernet segment much faster than it can be put on a T1, so changing the medium can influence this. Usually once you get to Ethernet speeds, you will not see a difference, but it is something you can influence.
The only really fixed delay is propagation delay. The others can be influenced, but are not going to change during a data connection. Unlike serialization delay, faster links do not lower the propagation delay, as it is based on the speed of light (or electricity) between endpoints. Propagation delay can only be changed by having less wire between the endpoints, either by moving them closer (Disparate datacenters across town vs across the country) or using a more direct physical route between them, which is often out of your hands.
Switching delay is unlikely to be a problem, but if you are running older equipment or a less efficient switching algorithm, you may benefit from looking at it.
Queuing delay is the time that a packet waits for other packets in line to be placed on the wire ahead of it. This is most common where a faster link meets a slower one, such as the WAN edge routers, but could also be seen where multiple links are aggregated, like on switch uplink ports where many FastEthernet ports are connected to the core with a single GigE uplink. Most QoS (Quality of Service) strategies attempt to minimize queuing delay for priority traffic with “managed unfairness” that increases the service that real-time traffic like voice receives at the cost of slower service for less delay sensitive traffic.
Several tools are available for this, including queues with guaranteed bandwidth, priority queues that allow traffic to go to the head of the line, but limit the total amount to prevent starving the lower priority queues, and policers that limit the total amount of less important traffic.
For more in-depth information, check out QoS books, the Cisco designzone documents, or the QoS category on this blog.
Survivability During WAN Outages
One problem with VoIP solutions is that they rely on the data network, and although great strides have been made in reliability in recent years, most of it is built on technologies that a relatively new compared to the technologies supporting traditional TDM telephony, and are therefore more prone to service outages. Remote sites need to have options to continue providing phone service even in the event of a WAN outage.
The primary strategies to provide continued phone service during a WAN outage are SRST, Clustering over the WAN, and multiple UCM clusters. Each has it’s benefits and drawbacks, which we will explore below.
Survivable Remote Site Telephony (SRST) uses a subset of Unified Communications Manager Express (UCME) to provide a subset of UCM features. Newer versions also support running UCME as SRST, which provides a greater number of features at the cost of more configuration and fewer endpoints supported. SRST has a number of benefits: It is usually the least expensive option, and unless you are using CME as SRST, configuration is normally quite simple, and more sites are supported than with clustering over the WAN. The main drawback is that user experience while running in fallback mode is significantly different, with many of the features that users are used to, including hunt groups, park, and ad hoc conferencing, no longer working.
Clustering over the WAN can be done to provide geographic diversity to protect against outages (network, power, etc) at a single site, as well as providing support for phones at any site that hosts a UCM server. The benefits include being able to continue providing service to other WAN sites in the event of a WAN or other outage at your main datacenter, and continuing to provide all services, other than maybe direct extension dialing, to users. Drawbacks include added complexity, fairly strict WAN requirements (possibly adding cost) and the need for additional hardware.
Another option is to use more than one UCM cluster. Hosting a UCM cluster at each site provides the most protection against outages, but quickly becomes very difficult to support and very expensive. Usually this option would be part of a strategy that included more efficient redundancy methods.
Some or all of these schemes can be combined in an overall design. For instance, at a multinational company that has sites in Milwaukee, Chicago, London, Paris, and smaller sales offices in North America and Europe. For a company like this, I would probably design a solution that had a North American cluster with servers clustered over the WAN in Milwaukee and Chicago, providing geographic diversity, and a European cluster between London and Paris, with the smaller offices using SRST for survivability. In this design we have each of the redundancy options used to provide specific parts of a very scalable and fault tolerant system.
PSTN Routing
In a single site solution, PSTN (Public Switched Telephone Network) routing is usually a pretty simple affair. One gateway provides a connection to the outside world, or at most there could be redundant gateways that calls overflow to or are load balanced between. A call has only one exit path, or equal cost exit paths, with no real preference between them.
In a multi-site design, PSTN routing can become much more complex, since you can have many different places a call could be routed to, and there can be significant cost, legal or even life and death implications based on how a call routes.
What difference does it make where a call hits the PSTN? If it is routed to the wrong circuit, you could display the wrong caller ID, calls could be blocked, you might incur unnecessary long distance charges, in some places using Tail End Hop Off (TEHO) is illegal, and in the case of emergency numbers, if a user can’t call and emergency service, life saving treatment could be delayed.
On the other hand, you can route calls to the gateway nearest the destination, and save long distance charges. This probably is not going to save you too much on domestic long distance, but may be worth it for international calling. Just be careful, because some countries ban this type of routing.
An in depth discussion of PSTN routing is outside the scope of this post. Proper call routing can be configured using either a site-specific partition and CSS, or local route groups. The site specific configuration is the most flexible option, but requires more configuration. The Local route groups allow a route group to be tied to a device pool, and which is then used to route calls.
Overlapping Dialplan
In most deployments with DIDs, internal extensions use the last several digits of the phone’s DID as an extension, which can cause problems if the same ranges are assigned to multiple sites. If you have a site that has the range 1-414-555-1XXX and one that is 1-262-555-1XXX, both sites would have DNs in the 1XXX range.
There are a number of ways to configure the dial plan to work around this. One of the most simple is to use a different initial digit, and mask it on an inbound translation pattern. So at the gateway of the 1-262-555-1XXX site, you would use a translation pattern with a mask of 2XXX, and assign 2XXX DNs to the phones.
Phones could also be assigned e164 numbers, which has a number of benefits, but not everything is capable of dialing the + symbol used for e164, most notably UCCX and 7940 and 7960 phones. With e164, you could use a strategy such as is discussed below to provide abbreviated dialing within a site.
The other option is to use site codes, a short code that identifies the site, usually starting with a specific “access code” digit. There are a few ways you can implement this: Either put the DNs in separate partitions, or assign the full number (Site Code + DN,) to the phone, and create translation patterns to allow for dialing without the site code when dialing inside the site.
My preference is to use the site codes included in the DN. To configure this, you would assign the site codes, in this example we will use site codes with 8 as a designator, and a 2-digit codes, so the 414 area code numbers, with a site code of 00, will be 8001XXX, and the 262 numbers will be 01, or 8011XXX. Each of the sites has a partition, 414_ROUTES_PT and 262_ROUTES_PT, and a Calling Search Space for each of the sites containing the site routes partition and the partition that contains all the internal DNs. In each of the site routes partitions, there is a translation pattern that expands 1XXX to the full DN.
The other option would be to configure the same DNs in unique partitions, an translation patterns in a shared partition. So you would have GLOBAL_ROUTES_PT, 414_DNS_PT, and 262_DNS_PT. The 414_PHONES_CSS would include GLOBAL_ROUTES_PT AND 414_DNS_PT. In the GLOBAL_ROUTES_PT, there would be translation patterns for 8001XXX and 8011XXX that strip the site codes, and have calling search spaces that search the appropriate partitions.