48 Chapter 2 www.newnespress.com Table 2.19: µ-Law Encoding Table Input Range Number of Intervals in Range Spacing of Intervals Left Four Bits of Compressed Code Right Four Bits of Compressed Code 8158 to 4063 16 256 0x8 number of interval 4062 to 2015 16 128 0x9 number of interval 2014 to 991 16 64 0xa number of interval 990 to 479 16 32 0xb number of interval 478 t o 223 16 16 0xc number of interval 222 to 95 16 8 0xd number of interval 94 to 31 16 4 0xe number of interval 30 to 1 15 2 0xf number of interval 0 1 1 0xf 0xf −1 1 1 0x7 0xf −31 to −2 2 15 0x7 number of interval −32 to −95 4 16 0x6 number of interval −223 to −96 8 16 0x5 number of interval −479 to −224 16 16 0x4 number of interval −991 to −480 32 16 0x3 number of interval −2015 to −992 64 16 0x2 number of interval −4063 to −2016 128 16 0x1 number of interval −8159 to −4064 256 16 0x0 number of interval leading one), and to record the base-2 (binary) exponent. This is how floating-point numbers are encoded. Let’s look at the previous example. The number 360 is encoded in 16-bit binary as 0000 0001 0110 1000 with spaces placed every four digits for readability. A-law only uses the top 13 bits. Thus, as this number is unsigned, it can be represented in floating point as 1.01101 (binary) × 2 5 . The first four significant digits (ignoring the first 1, which must be there for us to write the number in binary scientific notation, or floating point), are “0110”, and the exponent is 5. A-law then records the number as 0001 0110 where the first bit is the sign (0), the next three are the exponent, minus four, and the last four are the significant digits. A-law is used in Europe, on their telephone systems. For voice over IP, either will usually work, and most devices speak in both, no matter where they are sold. The distinctions are now mostly historical. Voice Mobility Technologies 49 www.newnespress.com G.711 compression preserves the number of samples, and keeps each sample independently of the others. Therefore, it is easy to figure out how the samples can be packaged into packets or blocks. They can be cut arbitrarily, and a byte is a sample. This allows the codec to be quite flexible for voice mobility, and should be a preferred option. Error concealment, or packet loss concealment (PLC), is the means by which a codec can recover from packet loss, by faking the sound at the receiver until the stream catches up. G.711 has an extension, known as G.711I, or G.711, Appendix I. The most trivial error concealment technique is to just play silence. This does not really conceal the error. An additional technique is to repeat the last valid sample set—usually, a 10ms or 20ms packet’s worth—until the stream catches up. The problem is that, should the last sample have had a plosive—any of the consonants that have a stop to them, like a p, d, t, k, and so on—the plosive will be repeated, providing an effect reminiscent of a quickly skipping record player or a 1980s science-fiction television character.* Appendix I states that, to avoid this effect, the previous samples should be tested for the fundamental wavelength, and then blocks of those wavelengths should be cross-faded together to produce a more seamless recovery. This is a purely heuristic scheme for error recovery, and competes, to some extent, with just repeating the last segment then going silent. In many cases, G.711 is not even mentioned when it is being used. Instead, the codec may be referred to as PCM with µ-law or A-law encoding. 2.3.1.2 G.729 and Perceptual Compression ITU G.729 and the related G.729a specify using a more advanced encoding scheme, which does not work sample by sample. Rather, it uses mathematical rules to try to relate neighboring samples together. The incoming sample stream is divided into 10ms blocks (with 5ms from the next block also required), and each block is then analyzed as a unit. G.729 provides a 16 : 1 compression ratio, as the incoming 80 samples of 16 bits each are brought down to a ten-byte encoded block. The concept around G.729 compression is to use perceptual compression to classify the type of signal within the 10ms block. The concept here is to try to figure out how neighboring samples relate. Surely, they do relate, because they come from the same voice and the same pitch, and pitch is a concept that requires time (thus, more than one sample). G.729 uses a couple of techniques to try to figure out what the sample must “sound like,” so it can then throw away much of the sample and transmit only the description of the sound. To figure out what the sample block sounds like, G.729 uses Code-Excited Linear Prediction (CELP). The idea is that the encoder and decoder have a codebook of the basics * Max Headroom played for one season on ABC during 1987 and 1988. The title character, an artificial intelligence, was famous for stuttering electronically. 50 Chapter 2 www.newnespress.com of sounds. Each entry in the codebook can be used to generate some type of sound. G.729 maintains two codebooks: one fixed, and one that adapts with the signal. The model behind CELP is that the human voice is basically created by a simple set of flat vocal chords, which excite the airways. The airways—the mouth, tongue, and so on—are then thought of as signal filters, which have a rather specific, predictable effect on the sound coming up from the throat. The signal is first brought in and linear prediction is used. Linear prediction tries to relate the samples into the block to the previous samples, and finds the optimal mapping. (“Optimal” does not always mean “good,” as there is almost always an optimal way to approximate a function using a fixed number of parameters, even if the approximation is dead wrong. Recall Figure 2.9.) The excitation provides a representation that represents the overall type of sound, a hum or a hiss, depending on the word being said. This is usually a simple sound, an “uhhh” or “ahhh.” The linear predictor figures out how the humming gets shaped, as a simple filter. What’s left over, then, is how the sound started in the first place, the excitation that makes up the more complicated, nuanced part of speech. The linear prediction’s effects are removed, and the remaining signal is the residue, which must relate to the excitations. The nuances are looked up in the codebook, which contains some common residues and some others that are adaptive. Together, the information needed for the linear prediction and the codebook matches are packaged into the ten-byte output block, and the encoding is complete. The encoded block contains information on the pitch of the sound, the adaptive and fixed codebook entries that best match the excitation for the block, and the linear prediction match. On the other side, the decoding process looks up the codebooks for the excitations. These excitations get filtered through the linear predictor. The hope is that the results sound like human speech. And, of then, it does. However, anyone who has used cellphones before is aware that, at times, they can render human speech into a facsimile that sounds quite like the person talking, but is made up of no recognizable syllables. That results from a CELP decoder struggling with a lossy channel, where some of the information is missing, and it is forced to fill in the blanks. G.729a is an annex to G.791, or a modification, that uses a simpler structure to encode the signal. It is compatible with G.729, and so can be thought of as interchangeable for the purposes of this discussion. 2.3.1.3 Other Codecs There are other voice codecs that are beginning to appear in the context of voice mobility. These codecs are not as prevalent as G.711 and G.729—some are not available in more than softphones and open-source implementations—but it is worth a paragraph on the subject. These newer coders are focused on improving the error concealment, or having better delay Voice Mobility Technologies 51 www.newnespress.com or jitter tolerances, or having a richer sound. One such example is the Internet Low Bitrate Codec (iLBC), which is used in a number of consumer-based peer-to-peer voice applications such as Skype. Because the overhead of packets on most voice mobility networks is rather high, finding the highest amount of compression should not be the aim when establishing the network. Instead, it is better to find the codec which are supported by the equipment and provide the highest quality of voice over the expected conditions of the network. For example, G.711 is fine in many conditions, and G.729 might not be necessary. Chapter 3 goes into some of the factors that can influence this. 2.3.2 RTP The codec defines only how the voice is compressed and packaged. The voice still needs to be placed into well-defined packets and sent over the network. The Real-time Transport Protocol (RTP), defined in RFC 3550, defines how voice is packetized on most IP-based networks. RTP is a general-purpose framework for sending real-time streaming traffic across networks, and is used for nearly all media streaming, including voice and video, where real-time delivery is essential. RTP is usually sent over UDP, on any port that the applications negotiate. The typical RTP packet has the structure given in Table 2.20. Table 2.20: RTP Format Flags Sequence Number Timestamp SSRC CSRCs Extensions Payload 2 bytes 2 bytes 4 bytes 4 bytes 4 bytes × number of contributors variable variable The idea behind RTP is that the sender sends the timestamp that the first byte of data in the payload belongs to. This timestamp gives a precise time that the receiver can use to reassemble incoming data. The sequence number also increases monotonically, and can also establish the order of incoming data. The SSRC, for Synchronization Source, is the stream identifier of the sender, and lets devices with multiple streams coming in figure out who is sending. The CSRCs, for Contributing Sources, are other devices that may have contributed to the packet, such as when a conference call has multiple talkers at once. The most important fields are the timestamp (see Table 2.20) and the payload type (see Table 2.21). The payload type field usually specifies the type of codec being used in the stream. 52 Chapter 2 www.newnespress.com Table 2.22 shows the most common voice RTP types. Numbers greater than 96 are allowed, and are usually set up by the endpoints to carry some dynamic stream. When the codec’s output is packaged into RDP, it is done so to both avoid splitting necessary information and causing too many packets per second to be sent. For G.711, an RTP packet can be created with as many samples as desired for the given packet rate. Common values are 20ms and 30ms. Decoders know to append the samples across packets as if they were in one stream. For G.729, the RTP packet must come in 10ms multiples, because G.729 only encodes 10ms blocks. An RTP packet with G.729 can have multiple blocks, and the decoder knows to treat each block separately and sequentially. G.729 phones commonly stream with RTP packets holding 20ms or larger, to avoid having too many packets in the network. 2.3.2.1 Secure RTP RTP itself has a security option, designed to allow the contents of the RTP stream to be protected while still allowing the quick reassembly of a stream and the robustness of allowing parts of the stream to be lost on the network. Secure RTP (SRTP) uses the Advanced Encryption Standard (AES) to encrypt the packets. (AES will later have a starring role in Wi-Fi encryption, as well as for use with IPsec.) The RTP stream requires a key to be established. Each packet is then encrypted with AES running in counter mode, a mode where intervening packets can be lost without disrupting the decryptability of subsequent packets in the sequence. Integrity of the packets is ensured by the use of the HMAC-SHA1 keyed signature, for each packet. How the SRTP stream gets its keys is not specified by SRTP. However, SIPS provides a way for this to be set up that is quite logical. The next section will discuss how this key exchange works. Table 2.22: Common RTP Packet Types Payload Type Encoded Name Meaning 0 PCMU G.711 with µ-law 3 GSM GSM 8 PCMA G.711 with A-law 18 G729 G.729 or G.729a Table 2.21: The RTP Flags Field Version Padding Extension (X) Contributor Count (CC) Marked Payload Type (PT) Bit: 0–1 2 3 4–7 8 9–15 Voice Mobility Technologies 53 www.newnespress.com 2.3.3 SDP and Codec Negotiations RTP only carries the voice, and there must be some associated way to signal the codecs which are supported by each end. This is fundamentally a property of signaling, but, unlike call progress messages and advanced PBX features, is tied specifically to the bearer channel. SIP (see Section 2.2.1) uses SDP to negotiate codecs and RTP endpoints, including transports, port numbers, and every other aspect necessary to start RTP streams flowing. SDP, defined in RFC 4566, is a text-based protocol, as SIP itself is, for setting up the various legs of media streams. Each line represents a different piece of information, in the format of type = value. Table 2.23: Example of an SDP Description v=0 o=7010 1352822030 1434897705 IN IP4 192.168.0.10 s=A_conversation c=IN IP4 192.168.0.10 t=0 0 m=audio 9000 RTP/AVP 0 8 18 a=rtpmap:0 PCMU/8000/1 a=rtpmap:8 PCMA/8000/1 a=rtpmap:18 G729/8000/1 a=ptime:20 Table 2.23 shows an example of an SDP description. This description is for a phone at IP address 192.168.0.10, who wishes to receive RTP on UDP port 9000. Let’s go through each of the fields. • Type“v”representstheprotocolversion,whichis0. • Type“o”holdsinformationabouttheoriginatorofthisrequest,andthesessionIDs. Specifically, it is divided up into the username, session ID, session version, network type, address type, and address. “7010” happens to be the dialing phone number. The two large numbers afterward are identifiers, to keep the SDP exchanges straight. The “IN” refers to the address being an Internet protocol address; specifically, “IP4” for IPv4, of “192.168.0.10”. This is where the originator is. • Type“s”isthesessionname.Thevaluegivenhere,“A_conversation”,isnot particularly meaningful. • Type“c”specieshowtheoriginatormustbereachedat—itsconnectiondata.Thisisa repetition of the IP address and type specifications for the phone. 54 Chapter 2 www.newnespress.com • Type“t”isthetimingforthelegofthecall.Therst“0”representsthestarttime,and the second represents the end time. Therefore, there is no particular timing bounds for this call. • The“m”linespeciesthemedianeeded.Inthiscase,aswithmostvoicecalls,there is only one voice stream from the device, so there is only one media line. The next parameters are the media type, port, application, and then the list of RTP types, for RTP. This call is an “audio” call, and the phone will be listening on port 9000. This is a UDP port, because the application is “RTP/AVP”, meaning that it is plain RTP. (“AVP” means that this is standard UDP with no encryption. There is an “RTP/SAVP” option, mentioned shortly.) Finally, the RTP formats the phone can take are 0, 8, and 18, as specified in Table 2.22. • Thenextthreelinesarethecodecsthataresupportedindetail.The“a”eldspecies an attribute. The “a=rtpmap” attribute means that the sender wants to map RTP packet types to specific codec setups. The line is formatted as packet type, encoded name/ bitrate/parameters. In the first line, RTP packet type “0” is mapped to “PCMU” at 8000 samples per second. The default mapping of “0” is already PCM (G.711) with µ-law, so the new information is the sample rate. The second line asks for A-law, mapping it to 8. The third line asks for G.729, asking for 18 as the mapping. Because the phone only listed those three types, those are the only types it supports. • Thelastlineisalsoanattribute.“a=ptime” is requesting that the other party send 20ms packets. The other party is not required to submit to this request, as it is only a suggestion. However, this is a pretty good sign that the sender of the SDP message will also send at 20ms. The setup message in Table 2.23 was originally given in a SIP INVITE message. The responding SIP OK message from the other party gave its SDP settings. Table 2.24 shows this example response. Here, the other party, at IP address 10.0.0.10, wants to receive on UDP port 11690 an RTP stream with the three codecs PCMU, GSM, and PCMA. It can also receive a format known as “telephone-event.” This corresponds to the RTP payload format for sending digits while in the middle of a call (RFC 4733). Some codecs, like G.729, can’t carry a dialed digit as the usual audio beep, because the beep gets distorted by the codec. Instead, the digits have to be sent over RTP, embedded in the stream. The sender of this SDP is stating that they support it, and would like to be sent in RTP type 101, a dynamic type that the sender was allowed to choose without restriction. Corresponding to this is the attribute “a=fmtp”, which applies to this 101-digit type. “fmtp” lines don’t mean anything specific to SDP; instead, the request of “0–16” gets forwarded to the telephone event protocol handler. It is not necessary to go into further details here on what “0–16” means. The “a=silenceSupp” line would activate silence suppression, in which Voice Mobility Technologies 55 www.newnespress.com packets are not sent when the caller is not talking. Silence suppression has been disabled, however. Finally, the “a=sendrecv” line means that the originator can both send and receive streaming packets, meaning that the caller can both talk and listen. Some calls are intentionally one-way, such as lines into a voice conference where the listeners cannot speak. In that case, the listeners may have requested a flow with “a=recvonly”. After a device gets an SDP request, it knows enough information to send an RTP stream back to the requester. The receiver need only choose which media type it wishes to use. There is no requirement that both parties use the same codec; rather, if the receiver cannot handle the codec, the higher-layer signaling protocol needs to reject the setup. With SIP, the called party will not usually stream until it accepts the SIP INVITE, but there is no further handshaking necessary once the call is answered and there are packets to send. For SRTP usage with SIPS, SDP allows for the SRTP key to be specified using a special header: a=crypto:1 AES_CM_128_HMAC_SHA1_32 ⇒ inline:c3bFaGA+Seagd117041az3g113geaG54aKgd50Gz ThisspeciesthattheSRTPAEScounterwithHMAC_SHA1istobeused,andspecies the key, encoded in base-64, that is to be used. Both sides of the call send their own randomly generated keys, under the cover of the TLS-protected link. This forms the basis of RTP/SAVP. Table 2.24: Example of an SDP Responding Description v=0 o=root 10871 10871 IN IP4 10.0.0.10 s=session c=IN IP4 10.0.0.10 t=0 0 m=audio 11690 RTP/AVP 0 3 8 101 a=rtpmap:0 PCMU/8000 a=rtpmap:3 GSM/8000 a=rtpmap:8 PCMA/8000 a=rtpmap:101 telephone-event/8000 a=fmtp:101 0-16 a=silenceSupp:off - - - - a=ptime:20 a=sendrecv 57 CHAPTER 3 Elements of Voice Quality 3.0 Introduction This chapter examines the factors that go into voice quality. First, we will look at how voice quality was originally, and introduce the necessary measurement metrics (MOS and R-Value). After that, we will move on to describing the basis for repeatable, objective metrics, by a variety of models that start out taking actual voice samples into account, but then turn into guidelines and formulas about loss and delay that can be used to predict network quality. Keep in mind that the point of the chapter is not to substitute for thousand- page telephony guidelines, but to introduce the reader to the basics for what it takes, and argue that perhaps—with mobility—the exactitude typically expected in static voice deployments is not that useful. 3.1 What Voice Quality Really Means Chapter 2 laid the groundwork for how voice is carried. But what makes some phone calls sound better than others? Why do some voice mobility networks sound tinny and robotic, where others sound natural and clear? In voice, there are two ways to look at voice quality: gather a number of people and survey them about the quality of the call, or try to use some sort of electronic measurement and deduce, from there, what a real person might think. 3.1.1 Mean Opinion Score and How It Sounds The Mean Opinion Score, or MOS (sometimes redundantly called the MOS score), is one way of ranking the quality of a phone call. This score is set on a five-point scale, according to the following ranking: 5. Excellent 4. Good 3. Fair 2. Poor 1. Bad MOS never goes below 1, or above 5. ©2010 Elsevier Inc. All rights reserved. doi:10.1016/B978-1-85617-508-1.00001-3. 58 Chapter 3 www.newnespress.com There is quite a science to establishing how to measure MOS based on real-world human studies, and the depth they go into is astounding. ITU P.800 lays out procedures for measuring MOS. Annex B of P.800 defines listening tests to determine quality in an absolute manner. The test requirements are spelt out in detail. The room to be used should be between 30 and 120 cubic meters, to ensure the echo remains within known values. The phone under test is used to record a series of phrases. The listeners are brought in, having been selected from a group that has never heard the recorded sentence lists, in order to avoid bias. The listeners are asked to mark the quality of the played-back speech, distorted as it may be by the phone system. The listeners’ scores, on the one-to-five scale, are averaged, and this becomes the MOS for the system. The goal of all of this is to attempt to increase the repeatability of such experiments. Clearly, performing MOS tests is not something that one would imagine can be done for most voice mobility networks. However, the MOS scale is so well known that the 1 to 5 scale is used as the standard yardstick for all voice quality metrics. The most important rule of thumb for the MOS scale is this: a MOS of 4.0 or better is toll-quality. This is the quality that voice mobility networks have to achieve, because this is the quality that nonmobility voice networks provide every day. Forgiveness will likely offered by users when the problem is well known and entirely relatable, such as for bad-quality calls when in a poor cellular coverage area. But, once inside the building, enterprise voice mobility users expect the same quality wirelessly as they do when using their desk phone. Thus, when a device reports the MOS for a call, the number you are seeing has been generated electronically, based on formulas that are thought to be reasonable facsimiles of the human experience. 3.1.2 PESQ: How to Predict MOS Using Mathematics Therefore, we turn to how the predictions of voice quality can actually be made electronically. ITU P.862 introduces Perceptual Evaluation of Speech Quality, the PESQ metric. PESQ is designed to take into account all aspects of voice quality, from the distortion of the codecs themselves to the effects of filtering, delay variation, and dropouts or strange distortions. PESQ was verified with a number of real MOS experiments to make sure that the numbers are reasonable within the range of normal telephone voices. PESQ is measured on a 1 to 4.5 scale, aligning exactly with the 1 to 5 MOS scale, in the sense that a 1 is a 1, a 2 is a 2, and so on. (The area from 4.5 to 5 in PESQ is not addressed.) PESQ is designed to take into account many different factors that alter the perception of the quality of voice. The basic concept of PESQ is to have a piece of software or test measurement equipment compare two versions of a recording: the original one and the one distorted by the telephone . do some voice mobility networks sound tinny and robotic, where others sound natural and clear? In voice, there are two ways to look at voice quality: gather a number of people and survey them. beginning to appear in the context of voice mobility. These codecs are not as prevalent as G.711 and G.729—some are not available in more than softphones and open-source implementations—but it is. Payload Type (PT) Bit: 0–1 2 3 4–7 8 9–15 Voice Mobility Technologies 53 www.newnespress.com 2.3.3 SDP and Codec Negotiations RTP only carries the voice, and there must be some associated way to signal