Elements of Voice Quality 69 www.newnespress.com 3.2.4 Jitter Jitter is the variation in delays that the receiver experiences. Jitter is a nuisance that the user does not hear directly, because the phones employ a jitter buffer to correct for any delays. Jitter can be defined in a number of ways. One way is to use the standard deviation or maximum deviation around the mean delay per packet. Another way is to use the known arrival intervals (such as 20ms), and subtract consecutive delays of packets that were not lost from the known arrival time, then take the standard deviation or the maximum deviation. Either way, the jitter, measured in times or percentages against the mean, tells how variable the network is. Jitter is introduced by variable queuing delays within network equipment. Phones and PBXs are well known for having very regular transmission intervals. However, the intervening network may have variable traffic. As the queue depths change and the network loads fluctuate, and as contention-based media such as Wi-Fi links clog with density, packets are forced to wait. Wireless networks are the biggest culprit for introducing delay into an enterprise private network. This is because wireless packets can be lost and retransmitted, and the time it takes to retransmit a packet can usually be measured in units of a millisecond. A jitter buffer’s job is to sit on the receiver and prevent the jitter from causing an underrun of the voice decoder. An underrun is an awkward period of silence that happens when the phone has finished playing the previous packet and needs another packet to play, but one has not yet arrived. These underruns count as a form of error or loss, even if every packet does make it to the receiver, and loss concealment will work to disguise them. The problem with jitter becomes that an underrun must be followed by an increase in delay of the same amount, assuming no packets are lost. This can be seen by realizing that the delayed packet will hold up the line for packets behind it. Here, the value of the jitter buffer can be seen. The jitter buffer lets the receiver build up a slight delay in the output. If this delay is greater than the amount of actual jitter on the network, the jitter buffer will be able to smooth things out without underruning. In this sense, the jitter buffer converts jitter directly into delay. If the jitter becomes too large, the jitter buffer may have limited room, and start dropping earlier samples in the buffer to let the call catch up to be closer to real time. In this way, the jitter buffer can convert the jitter directly into loss. Because jitter is always converted into delay first, then loss, it does not have a direct impact on the E-model by itself, but instead can be folded in to the other measures. However, the complication arises because the user or administrator does not usually know the exact parameters of the jitter buffer. How many samples, how much delay, will the jitter buffer take before it starts to drop audio? Does the jitter buffer start off with a fixed delay? Does it build up the delay as jitter forces it to? Or does it try to proactively build in some delay, 70 Chapter 3 www.newnespress.com which can grow or shrink as the underruns occur? These all have an impact on the E-model call quality. As a result, a rule of thumb here is to match the jitter tolerance to the delay tolerance. The network, at least, should not introduce more than 50ms of jitter. 3.2.5 Non-IP Effects that Should Be Kept in Mind The E-model makes plenty of room for non-IP effects on voice quality, and we would be wise to consider them here, even though the previous few sections chose to focus only on the network effects. As mentioned earlier, echo is a problem to be tackled whenever calls are being tied together in conference bridges or are traversing through multiple media gateways. Analog lines introduce the problem of noise, as well as volume or gain control. Some analog lines may be tuned softer than others. Most of this requires reasonable end-to-end testing, however. Then there are the intangibles. Is the network provisioned well enough that calls go through or are held predictably and reliably? Is the voice mobility network laid out well enough that users know that every point in the campus is a hot spot, or are some areas weak or dead? Cellular companies make entire marketing campaigns on the premise of the importance of coverage and dropped calls. (The number of bars on the phone or people standing behind the spokesman are both powerful examples of how important the predictability of the call quality is to callers.) This same concern needs to be applied to voice mobility networks produced within the enterprise. No amount of modeling will answer how much tolerance exists, but the general consensus is that voice mobility networks must work better than the cellular networks, when the callers are in the office. Mobility within the office does not generally count as a factor that can be used to increase the acceptance of the quality of the calls, and although mobility is a tremendous driving force to achieve higher productivity and less frustration, it is the sort of benefit that is hardly noticed until it is gone. Keep in mind that the codec chosen can make an immediate ten-point difference in the R-value, in many cases. 3.3 How to Measure Voice Quality Yourself The final section in this chapter is concerned with the ways in which administrators of voice mobility networks can directly ascertain the quality of the network. 3.3.1 The Expensive, Accurate Approach: End-to-End Voice Quality Testers As mentioned in the discussion of PESQ (Section 3.1.2), existing tools can measure the quality of the voice network by directly pumping in prerecording voice samples and Elements of Voice Quality 71 www.newnespress.com comparing the output. These tools are either expensive or home-grown, and are used to test large networks as a part of a planning or predeployment phase. This sort of testing is more of a tuning exercise, and—much like how piano tuning is a rare and complicated enough exercise that it is not performed frequently—direct end-to-end testing is not diagnostic. Telephone equipment testing companies do make the sort of equipment to perform this end-to-end inspection, and these tools can be rented. Unfortunately, it is very difficult to know where to invest in this sort of heavily proactive effort. More likely, the voice quality is measured by having administrators walk around the network with some number of phones in question, ensuring themselves that whatever problems they may face will likely be manageable. The problem with both forms of proactive testing is that they normally occur on only lightly loaded networks, and thus are not able to measure the effect of network load on voice quality. Network load is generally the largest impact on voice quality, in fact, partly because voice mobility network managers do a good job of testing their networks before they launch them for basic problems, which they quickly correct, and partly because voice mobility networks are more likely to be robust enough out of the box for basic voice connectivity. 3.3.2 Network Specific: Packet Capture Tests Most of the major packet capture tools, for wireline and for wireless, make modules that are able to indirectly infer the MOS values using E-model calculations. Sometimes, these work by tracing the voice setup protocols, such as SIP, and determining what RTP flows map to phone calls and the properties of the phone calls. Other times, these tools will just look directly at the RTP streams, and not try to find out what phone numbers the streams map to In both cases, the tools then use the sequence number and timestamp fields in the RTP stream to determine values such as loss, delay, and jitter. Using assumed values for the jitter buffer, with the option of having the user overwrite them, the tools then model the expected effect and produce a score. The major issue with these tools is that they show quality only up to the point where they are inserted. An easy example of the problem is to look at wireless networks. On a Wi-Fi network, a packet capture tool may be able to directly determine what packets it sees and come up with a score. By looking at the Wi-Fi protocol, the tool may do a good job of inferring whether the mobile phone received the packet from the access point, and at what time, and may produce a reasonably close call quality number. On the other hand, the upstream flow is likely to look quite good from the point of view of the test tool, because there is only one network in between the client and the tool. The entirety of the network upstream from the client goes missing, and the upstream MOS value can be entirely misleading. 72 Chapter 3 www.newnespress.com Some network infrastructure devices are able to do these inferences within themselves, as they pass the data through. This may be a reasonable thing to do, again depending on the point of insertion and how well they are able to capture information as late into the network as possible. It is important, when using all of these tools, for you to consult with the vendor or maker of the tools to find out where the tools are measuring. For a wireless controller with voice metric capabilities, for example, make sure that the downstream metrics are measured on the access point, based on what happened over the air, and not just passing through the controller. For wireless overlay monitoring, make sure that there is an option to do a similar capture using a wired mirror port on one of the switches, for cases in which voice quality might begin to suffer and the network needs direct attention. Overall, do not rely on just one tool, and believe what the users say—no matter what the tool tells you. 3.3.3 The Device Itself The most accurate and reasonable way to measure voice quality is from the endpoints themselves. Both some handsets and PBXs offer the ability for the device to produce the one-way MOS value or R-value for the receive side at the device itself. These numbers are based entirely on E-model calculations, assuming best-case or known-default scenarios for the rest of the system, but are likely to be the most accurate. Of course, it is difficult to ask a user to determine what the voice quality is of a call while on it, especially given that voice quality is not something a user wants to measure. However, for diagnosing locations that are having troubles, this tool is valuable for the administrator herself, who is able to avoid having to guess as to whether the call sounds reasonable, and may be able to detect variations in the MOS value or R-value. In the end, keep in mind that the absolute values produced by any of the methods deserve being taken with a grain of salt. As time goes on, the administrator of a voice mobility network should be able to learn what the real quality means for any given value the tool suggests, even when the tool is placing results a half a MOS point too high or too low. However, the variation of the scores, especially when the network has changed, can be a valuable tool for point the way towards the solution. 73 CHAPTER 4 Voice Over Ethernet 4.0 Introduction This chapter introduces the technologies necessary to carry voice over wireline packet networks. The first half of the chapter is a basic review of the concepts within packet networks, including IP and Ethernet. The second half takes a look directly at voice over these networking technologies. 4.1 The IP-Based Voice Network The previous chapters explored the basics of how calls are set up and voice is carried over packet-based IP networks. However, the details about what makes the IP network itself work have not yet been addressed. Voice started out on analog phone lines. Each pair of copper wires was dedicated to one specific phone, and to nothing else. This notion of a dedicated circuit has its advantages. It provides complete isolation of whatever might be going on with that line from the circumstances and problems of other phones in the network. No amount of calls being placed on a neighbor’s line can make the original line itself become busy. This isolation and invariance is necessary for voice networks to function when unexpected circumstances occur, and ensures that the voice network is reliable in the face of massive fluctuations in the system. Provisioning is simple, as well, with one line per phone at the edge. The problem with the concept of the dedicated line is that it is extremely wasteful. When the phone is not in use, the line stays empty. No other calls can be placed on that line. Even when a call is in place, the copper wire is fully occupied with carrying the voice traffic, a small bandwidth application, and a tremendous amount of excess signal capacity exists. Dedicated wires might make sense for short distances between the phone and some next-level aggregation equipment, but these dedicated lines were used as trunks between the aggregators, causing tremendous waste from both idleness and lost bandwidth. But probably the property that caused the most complications with wireline networking was that the dedicated line is not robust. If network problems occur—the bundle of cables is cut, or some intermediate equipment fails and can’t do its job—all lines that are attached along that path are brought down with it. ©2010 Elsevier Inc. All rights reserved. doi:10.1016/B978-1-85617-508-1.00001-3. 74 Chapter 4 www.newnespress.com Digital telephone networks started to eliminate some of the problems inherent to the one- line dedication of early circuit switching. By having digital processes encode and carry the voice, more voice calls could be multiplexed onto each line, better using the bandwidth available on the copper wire. Furthermore, by allowing for hop-by-hop switching with smarter switches between trunks, failures along one trunk could be accommodated. However, the network was still circuit-switched. A voice line could be used only for voice. Even where voice circuits were set aside for data links, the link is either fully in use or not at all. The granularity of the 64kbps audio line, the DS0, became a burden. Running applications that are not always on and have massive peak throughput but equally meek average throughput requirements meant that provisioning was always an expensive proposition: either dedicate enough lines to cover the peak requirement case, and pay for all of the unused capacity, or cap the capacity offered to the application. Furthermore, these circuits needed to be considered, managed, and monitored rather separately. The hard divisions between two circuits became a hard division between applications. Voice networks were famous for their reliability, strict clockwork operation—and complexity. They were not for easy-to-set-up, easy-to-move operations. The wires are drawn once and carefully, and the switches and intermediate equipment is set up by a team of dedicated and expensive experts who do nothing but voice all day. If you were serious about voice, you operated your own little phone company, complete with dedicated operators. If not, your only option was to have the phone company run your phone network for you. Along came packet-switched networks. Sending small, self-contained messages between arbitrary endpoints on a network inherently made sense for computers. The idea of sending a message quickly, without tying up lines or going through cumbersome setup and teardown operations removed the restrictions on wasted lines. Although it was still true that lines could remain idle when not being used, the notion of allowing these packets of information into the line as the fundamental concept, rather than requiring continuous occupation and streaming, meant that lines that carried aggregated traffic from multiple users and multiple messages could be used more efficiently. If the messages were short enough, one line might do. No concerns about running out of lines and having the needed, or only, path to the receiver blocked. Instead, these messages could just be queued until space was available. Along with this whole new way of thinking about occupying the resources came a different way of thinking about addressing and connecting the resources. In the early days, a phone number used to encode the exact topological location of the extension. Each exchange, or switch with switchboard operator, had a name and number, and calls were routed from exchange to exchange based on that number first. Changes to the structure or layout of the telephone system would require changes to the numbers. Packet-switching technologies changed that. Lines themselves lost their names and numbers. Instead, those names and numbers were moved to the equipment that glued the lines together. Every device itself now had the address. The binding of the addresses to the topology of the network remained, at Voice Over Ethernet 75 www.newnespress.com some level. Devices could not be given any arbitrary address. Rather, they needed to have addresses that were similar to their neighbors. The notion of exchange-to-exchange routing was retained. This notion, though, proved to be a burden. Changes to the network were quite possible, as either more devices needed addresses, or more new “exchanges” were added to the network. Either way, the problem of figuring out how to route messages through the network remained. The original design had each router know which lines needed to be used to send the messages along their way. The router might not know how the message should get to the final destination, but it always knew the next step, and could direct traffic along the right roads to the next intersection, where the next router took over. As the number of intersections increased, and the number of devices expanded, the complexity of maintaining these routing tables exploded. A way was needed for neighboring routers to find out about each other, and more importantly, to find out about what end devices they knew routes to. Thus, the routing protocol was born. These protocols spoke from router to router, exchanging information on a regular basis, ensuring that routers always had recent information on what destinations were valid and how to get there from here. But another thing happened. This idea of exchanging the routes had another benefit, in that it allowed the network itself to be restructured, or to fail in spots, and yet still be able to send traffic. Routers did not need to know the entire path to the destination, only the next hop. If a router knew two, different next hops for the same message, and one of the routes went down, the router could try the second one. If the router lost all of its paths to a particular set of destinations, the router before it could learn about that, and avoid using that path to get the messages through. If there was a way to get the message there, the network would find it, through the process of convergence, or agreement over time on the consistency of whether and how messages could be sent. The network became resilient, and point failures would not stop traffic from flowing. This is the story of the Internet, and of all the protocols that make it work. Clearly, the story is simplified (and perhaps romanticized to highlight the point at hand), but the fundamentals are there. Circuit switching is difficult to manage, because it is incredibly wasteful and inflexible. Packet switching is much simpler to manage, and can recover from failures. The Internet grew up on top of the lines offered by the circuit-switched technologies, but used a better way to dedicate the resources. It wasn’t long before someone realized that voice itself could be put over these packet-switched lines. At first, that might sound wasteful, as using a digital line to carry a packet containing voice can never be more efficient than using that line to carry the same bits of voice directly because of the packet overhead. But packet networking technologies matured, and the throughputs offered on simple point-to-point links grew much faster than did the corresponding uses of the same copper line for digital voice—at least, in the enterprise. And the advantages of using a 76 Chapter 4 www.newnespress.com multipurpose technology allowed these voice over IP pioneers to use the network’s flexibility and lack of dedication to one purpose to add to the voice over IP offerings quickly, without requiring retooling of physical wires. The ways in which provisioning was thought about changed, and the idea that voice and data networks can perhaps use the same resources became a compelling reason to try to save deployment and management costs. There are a tremendous number of resources available for understanding the intricacies of how IP networks operate, including details on how to manage routing protocols and large trunk lines. Here, we will explore how voice fits into the packet-based IP network. 4.1.1 Wireline Networking Technologies and Packetization The wireline networking technologies range from the most basic definition of how electrical signals are encoded over the copper line to the higher-level ways that computer software endpoints ensure that messages do not flood the network. 4.1.1.1 Ethernet Nearly all wireline voice mobility networks in the enterprise start with Ethernet. Ethernet is a family of related networking technologies that establish how two machines that are physically connected can talk to each other. Ethernet was designed to be as simple to deploy as possible, so that it can be set up as an unmanaged network, where physically connecting two endpoints together, somehow, through the network is enough to allow them to find each other and communicate. (Note that this doesn’t mean that higher-level protocols will work on this network without effort—just Ethernet itself.) All of the Ethernet protocols belong to the IEEE 802.3 series and are based on the idea of encoding frames. A frame is a well-defined packet message, with a source, a destination, a length, and a type. The logical format of the Ethernet frame is shown in Table 4.1. Table 4.1: Ethernet Frame Format Destination Source Ethertype Frame Body FCS 6 bytes 6 bytes 2 bytes n bytes 4 bytes In Ethernet, links are anonymous. Endpoints, however—the line cards that the Ethernet cables plug into—are given addresses. These addresses are assigned at the time the device is built, and are permanently associated with the device. The Ethernet address is a 48-bit (6-byte) address, as shown in Table 4.2. The first three bytes, or 24 bits, is called the Organizationally Unique Identifier (OUI). Each manufacturer of Ethernet equipment is assigned one or more of these OUIs by the Institute of Electrical and Electronics Engineers (IEEE) Registration Authority. The manufacturer chooses the second 24 bits from a unique Voice Over Ethernet 77 www.newnespress.com pool, often in order starting from 00:00:01. Together, the scheme guarantees that this address will never be accidentally taken by another device. Ethernet also defines two special flags in the address. The L bit specifies a local address, which is dynamic and invented by a device for temporary usage. This has an application in Wi-Fi (see Chapter 5), but is otherwise not common. The G bit is for group-addressed frames—either broadcast or multicast. A group-addressed frame is meant to go out to multiple devices at once, for all of them to receive. Multicast transmissions use this mechanism. The special group address FF:FF:FF:FF:FF:FF (all 1s) is the broadcast address, and specifically requests to go to every device, whether they are in a multicast group or not. Table 4.2: The Ethernet Address Format OUI Manufacturer-Defined Bit: 0–23 24–47 L G Bit: 6 7 This is one way by which Ethernet guarantees that it does not require management to add or remove devices from the network. When a device wants to transmit over a wire to another device, it has no way of knowing if that second device is there. Ethernet was intentionally designed to be as simple as possible, so senders have to transmit and hope that the other device is there. When the sender creates a frame, it places the destination Ethernet address first in the frame, followed by its own address. Then comes the type of the frame, used to figure out what network protocol is running on top of Ethernet. An arbitrary frame body follows, subject to size restrictions: the body of the frame cannot be greater than 1500 bytes, usually, and cannot be less than 64 bytes. (Shorter frames must be padded.) Finally, Ethernet provides a way to determine whether noise on the Ethernet line causes any bit errors, by using a frame check sequence (FCS), a mathematical checksum of the bits in the frame that will generally not match the contents of a frame if there are any errors. Ethernet uses a CRC-32 checksum. Ethernet itself is a serial protocol, much like serial lines used to connect modems together, but operating with much more sophistication and at a faster rate. Most Ethernet types today fall into two categories: copper and fiber. The commercially available copper Ethernet technologies all use a modified version of a telephone cable, made out of copper wires. Each cable carries eight small, insulated copper wires, twisted into pairs as is done for analog telephone lines. The plastic connectors at each end also look like telephone connectors, but have eight pins, rather than the usual six. These connectors, often referred to as RJ45, a specification in which the connectors figure prominently, snap into the 78 Chapter 4 www.newnespress.com corresponding sockets on all Ethernet devices. Differing numbers of the pairs within the four-pair cable may be used for different Ethernet technologies. The first RJ45-based Ethernet is called 10BASE-T, or simply original Ethernet. Devices that support 10BASE-T run at 10Mbps, across just two of the pairs within the cable, one for reception, and one for transmission. (The other pairs are not used for data.) These Ethernet lines run a serial protocol, where the voltage on the line is flipped to signal a one or a zero in the bits used to encode the frame. However, these serial lines do not constantly transmit. Instead, the line is usually idle. But when a device wants to transmit on the line, it simply starts transmitting. The transmission itself is the frame, just described. Before the frame itself is sent, a few bits are prepended to it. These bits, known as the preamble, are used to alert the device at the other end that the transmission is going to begin. The preamble is a 64-bit sequence of alternating ones and zeros, except for the last two bits, which are both ones. The receiving device detects that a transmission comes in, by looking for the sharp swings in voltage in the line from idle, representing the preamble’s bits. By the time the preamble is done, the receiver will have figured out the timing of the bit patterns, in case the receiver’s clock is slightly off from the sender’s. The full bits of the frame proper come in, including the checksum. At the end of the transmission, the sender and receiver have to wait for a few microseconds, and then the line becomes idle and ready to be transmitted on again. Given that 10BASE-T is a point-to-point physical system, as there can only be one transmitter on one twisted pair, and the other transmitter on the second, there needed to be some way to interconnect multiple lines and thus multiple devices together. The solution to that is the Ethernet hub. The hub works by connecting the twisted pair that is used by a device to transmit, to every link’s twisted pair used to receive. This connection allows the transmission by one device to reach all of the others on the same segment, or other devices attached to the same hub. Hubs are purely electrical, and do not participate in the network itself. When a device transmits on an Ethernet hub, every device on that hub hears the signal. A receiver knows that the frame is for it by looking at the destination Ethernet address. If the address matches, then the frame is kept; otherwise, it is discarded unless the operating system on that device requests to receive all frames on the line. The use of hubs, and the definitions for 10BASE-T, require that the transmissions are all half-duplex, meaning that a reception and transmission cannot occur independently. Adding multiple devices together on an Ethernet link introduces a problem. Two or more devices are capable of transmitting at the same time. If two devices do transmit at the same time, their signals will mix on the wire, and all of the receivers will receive the garbage created by the interference. Thankfully, there is a solution to avoid this. The overall concept is known by the unwieldy phrase Carrier Sense Multiple Access with Collision Detection (CSMA/CD). Let’s break that phrase apart, starting from the end. The collision detection . reliability, strict clockwork operation and complexity. They were not for easy-to-set-up, easy-to-move operations. The wires are drawn once and carefully, and the switches and intermediate equipment is. callers are in the office. Mobility within the office does not generally count as a factor that can be used to increase the acceptance of the quality of the calls, and although mobility is a tremendous. sees and come up with a score. By looking at the Wi-Fi protocol, the tool may do a good job of inferring whether the mobile phone received the packet from the access point, and at what time, and