Fundamental to TCP’s timeout and retransmission procedures is how to set the RTO based upon measurement of the RTT experienced on a given connection.
If TCP retransmits a segment earlier than the RTT, it may be injecting duplicate traffic into the network unnecessarily. Conversely, if it delays sending until much longer than one RTT, the overall network utilization (and single-connection throughput) drops when traffic is lost. Knowing the RTT is made more compli- cated because it can change over time, as routes and network usage vary. TCP must track these changes and modify its timeout accordingly in order to maintain good performance.
Because TCP sends acknowledgments when it receives data, it is possible to send a byte with a particular sequence number and measure the time required to receive an acknowledgment that covers that sequence number. Each such mea- surement is called an RTT sample. The challenge for TCP is to establish a good estimate for the range of RTT values given a set of samples that vary over time.
The second step is how to set the RTO based on these values. Getting this “right”
is very important for TCP’s performance.
The RTT is estimated for each TCP connection separately, and one retransmis- sion timer is pending whenever any data is in flight that consumes a sequence number (including SYN and FIN segments). The proper way to set this timer has been a subject of research for years, and improvements are made on an occasional basis. In this section, we will explore some of the more important milestones in the evolution of the method used to compute the RTO. We begin with the first (“clas- sic”) method, as detailed in [RFC0793].
14.3.1 The Classic Method
The original TCP specification [RFC0793] had TCP update a smoothed RTT estima- tor (called SRTT) using the following formula:
SRTT ←α(SRTT) + (1 − α) RTTs
ptg999 Here, SRTT is updated based on both its existing value and a new sample,
RTTs. The constant α is a smoothing or scale factor with a recommended value between 0.8 and 0.9. SRTT is updated every time a new measurement is made.
With the original recommended value for α, it is clear that 80% to 90% of each new estimate is from the previous estimate and 10% to 20% is from the new measure- ment. This type of average is also known as an exponentially weighted moving aver- age (EWMA) or low-pass filter. It is convenient for implementation reasons because it requires only one previous value of SRTT to be stored in order to keep the run- ning estimate.
Given the estimator SRTT, which changes as the RTT changes, [RFC0793] rec- ommended that the RTO be set to the following:
RTO = min(ubound, max(lbound,(SRTT)β))
where β is a delay variance factor with a recommended value of 1.3 to 2.0, ubound is an upper bound (suggested to be, e.g., 1 minute), and lbound is a lower bound (suggested to be, e.g., 1s) on the RTO. We shall call this assignment procedure the classic method. It generally results in the RTO being set either to 1s, or to about twice SRTT. For relatively stable distributions of the RTT, this was adequate. However, when TCP was run over networks with highly variable RTTs (e.g., early packet radio networks in this case), it did not perform so well.
14.3.2 The Standard Method
In [J88], Jacobson detailed problems with the classic method further—basically, that the timer specified by [RFC0793] cannot keep up with wide fluctuations in the RTT (and in particular, it causes unnecessary retransmissions when the real RTT is much larger than expected). Unnecessary retransmissions add to the network load, when the network is already loaded, as indicated by the increasing sample RTT.
To address this problem, the method used to assign the RTO was enhanced to accommodate a larger variability in the RTT. This is accomplished by keeping track of an estimate of the variability in the RTT measurements in addition to the estimate of its average. Setting the RTO based on both a mean and a variability estimator provides a better timeout response to wide fluctuations in the round- trip times than just calculating the RTO as a constant multiple of the mean.
Figures 5 and 6 in [J88] show a comparison of the [RFC0793] RTO values for some actual round-trip times, versus the RTO calculations we show next, which take into account the variability of the round-trip times. If we think of the RTT measurements made by TCP as samples of a statistical process, estimating both the mean and variance (or standard deviation) helps to make better predictions about the possible future values the process may take on. A good prediction for the range of possible values for the RTT helps TCP determine an RTO that is nei- ther too large nor too small in most cases.
ptg999 Section 14.3 Setting the Retransmission Timeout (RTO) 653
As described by Jacobson, the mean deviation is a good approximation to the standard deviation, but it is easier and faster to compute. Calculating the standard deviation requires executing a square root mathematical operation on the vari- ance, which was considered to be too expensive for a fast TCP implementation.
(This is not the whole story, really. See the fascinating history of “the debate” in [G04].) We therefore need running estimates of both the average as well as the mean deviation. This leads to the following equations that are applied to each RTT measurement M (called RTTs earlier):
srtt ← (1 - g)(srtt) + (g)M rttvar ← (1 - h)(rttvar) + (h)(|M - srtt|)
RTO = srtt + 4(rttvar)
Here, the value srtt effectively replaces the earlier value of SRTT, and the value rttvar, which becomes an EWMA of the mean deviation, is used instead of β to help determine the RTO. This set of equations can also be written in a form that requires a smaller number of operations when implemented on a conventional computer:
Err = M − srtt srtt ← srtt + g(Err) rttvar ← rttvar + h(|Err| − rttvar)
RTO = srtt + 4(rttvar)
As suggested, srtt is the EWMA for the mean and rttvar is the EWMA for the absolute error, |Err|. Err is the difference between the measured value M and the current RTT estimator srtt. Both srtt and rttvar are used to calculate the RTO, which varies over time. The gain g is the weight given to a new RTT sample M in the average srtt and is set to 1/8. The gain h is the weight given to a new mean deviation sample (absolute difference of the new sample M from the running aver- age srtt) for the deviation estimate rttvar and is set to 1/4. The larger gain for the deviation makes the RTO go up faster when the RTT changes. The values for g and h are chosen as (negative) powers of 2, allowing the overall set of computations to be implemented in a computer using fixed-point integer arithmetic with shift and add operations instead of multiplies and divides.
Note
[J88] specified 2 * rttvar in the calculation of RTO, but after further research, [J90]
changed the value to 4 * rttvar, which is what appeared in the BSD Net/1 imple- mentation and ultimately in the standard [RFC6298].
ptg999 Comparing the classic method with Jacobson’s, we see that the calculations of
the RTT average are similar (α is 1 minus the gain g) but a different gain is used.
Also, Jacobson’s calculation of the RTO depends on both the smoothed RTT and the smoothed deviation, whereas the classic method used a simple multiple of the smoothed RTT. This is the basis for the way many TCP implementations compute their RTOs to this day, and because of its adoption as the basis for [RFC6298] we shall call it the standard method, although there are slight refinements in [RFC6298], which we shall now discuss.
14.3.2.1 Clock Granularity and RTO Bounds
TCP has a continuously running “clock” that is used when taking RTT measure- ments. As with initial sequence numbers, real TCP connections do not start their clocks at zero and the clock does not have infinite precision. Rather, the TCP clock is usually the value of a variable that is updated as the system clock advances, not necessarily one-for-one. The length of the TCP’s clock “tick” is called its granular- ity. Traditionally, this value was relatively large (about 500ms), but more recent implementations use finer-granularity clocks (e.g., 1ms for Linux).
The granularity can affect the details of making RTT measurements and also how the RTO is set. In [RFC6298], the granularity is used to refine how updates to the RTO are made. In addition, a lower bound is placed on the RTO. The equation used is as follows:
RTO = max(srtt + max(G, 4(rttvar)), 1000)
where G is the timer granularity and 1000ms represents a lower bound on the total RTO (recommended by rule (2.4) of [RFC6298]). Consequently, the RTO is always at least 1s. An optional upper bound is also allowed, provided it has a value of at least 60s.
14.3.2.2 Initial Values
We have seen how the estimators are updated as time progresses, but we also need to know how to set their initial values. Before the first SYN exchange, TCP has no good idea what value to use for setting the initial RTO. It also does not know what to use as the initial values for its estimators, unless the system has provided hints at this information (some systems cache this information in the forwarding table;
see Section 14.9). According to [RFC6298], the initial setting for the RTO should be 1s, although 3s is used in the event of a timeout on the initial SYN segment. When the first RTT measurement M is received, the estimators are initialized as follows:
srtt ← M rttvar ← M/2
We now have enough detail to see how the estimators are initialized and main- tained. The procedures depend on obtaining RTT samples, which would appear to be straightforward. We now look at why this might not always be the case.
ptg999 Section 14.3 Setting the Retransmission Timeout (RTO) 655
14.3.2.3 Retransmission Ambiguity and Karn’s Algorithm
A problem measuring an RTT sample can occur when a packet is retransmitted.
Say a packet is transmitted, a timeout occurs, the packet is retransmitted, and an acknowledgment is received for it. Is the ACK for the first transmission or the sec- ond? This is an example of the retransmi ssion ambiguity problem. It happens because unless the Timestamps option is being used, an ACK provides only the ACK num- ber with no indication of which copy (e.g., first or second) of a sequence number is being ACKed.
The paper [KP87] specifies that when a timeout and retransmission occur, we cannot update the RTT estimators when the acknowledgment for the retransmit- ted data finally arrives. This is the “first part” of Karn’s algorithm. It eliminates the acknowledgment ambiguity problem by removing the ambiguity for purposes of computing the RTT estimate. It is a requirement in [RFC6298].
If we were to simply ignore retransmitted segments entirely when setting the RTO, however, we would be failing to take into account some useful information being provided by the network (i.e., that it is probably experiencing some form of inability to deliver packets quickly). In such cases, it would be beneficial to reduce the load on the network by decreasing the retransmission rate, at least until pack- ets are no longer being lost. This reasoning is the basis for the exponential backoff behavior we saw in Figure 14-1.
TCP applies a backoff factor to the RTO, which doubles each time a subsequent retransmission timer expires. Doubling continues until an acknowledgment is received for a segment that was not retransmitted. At that time, the backoff factor is set back to 1 (i.e., the binary exponential backoff is canceled), and the retrans- mission timer returns to its normal value. Doubling the backoff factor on subse- quent retransmissions is the “second part” of Karn’s algorithm. Note that when TCP times out, it also invokes congestion control procedures that alter its sending rate. (Congestion control is discussed in detail in Chapter 16.) Karn’s algorithm, then, really consists of two parts. As quoted directly from the 1987 paper [KP87]:
When an acknowledgement arrives for a packet that has been sent more than once (i.e., is retransmitted at least once), ignore any round-trip measurement based on this packet, thus avoiding the retransmission ambiguity problem. In addition, the backed-off RTO for this packet is kept for the next packet. Only when it (or a suc- ceeding packet) is acknowledged without an intervening retransmission will the RTO be recalculated from SRTT.
This algorithm has been a required procedure in a TCP implementation for some time (since [RFC1122]). There is an exception, however, when the TCP Time- stamps option is being used (see Chapter 13). In that case, the acknowledgment ambiguity problem can be avoided and the first part of Karn’s algorithm does not apply.
ptg999 14.3.2.4 RTT Measurement (RTTM) with the Timestamps Option
The TCP Timestamps option (TSOPT), in addition to providing a basis for the PAWS algorithm we saw in Chapter 13, can be used for round-trip time measurement (RTTM) [RFC1323]. The basic format of the TSOPT was described in Chapter 13. It allows the sender to include a 32-bit number in a TCP segment that is returned in a corresponding acknowledgment.
The timestamp value (TSV) is carried in the TSOPT of the initial SYN and returned in the TSER part of the TSOPT in the SYN + ACK, which is how the initial values for srtt, rttvar, and RTO are determined. Because the initial SYN “counts”
as data (i.e., it is retransmitted if lost and consumes a sequence number), its RTT is measured. TSOPTs are also carried in other segments, so the connection’s RTT can be estimated on an ongoing basis. This seems straightforward enough but is made more complex because TCP does not always provide an ACK for each segment it receives. For example, TCP often provides one ACK for every other segment (see Chapter 15) when large volumes of data are transferred. In addition, when data is lost, reordered, or successfully retransmitted, the cumulative ACK mechanism of TCP means that there is not necessarily any fixed correspondence between a seg- ment and its ACK. To handle these challenges, TCPs that use this option (most of them today—Linux and Windows included), employ the following algorithm for taking RTT samples:
1. The sending TCP includes a 32-bit timestamp value in the TSV portion of the TSOPT in each TCP segment it sends. This field contains the value of the sender’s TCP “clock” when the segment is transmitted.
2. A receiving TCP keeps track of the received TSV value to send in the next ACK it generates (in a variable typically named TsRecent) and the ACK num- ber in the last ACK that it sent (in a variable named LastACK). Recall that ACK numbers represent the next in-order sequence number the receiver (i.e., sender of the ACK) expects to see.
3. When a new segment arrives, if it contains the sequence number matching the value in LastACK (i.e., it is the next expected segment), the segment’s TSV is saved in TsRecent.
4. Whenever the receiver sends an ACK, a TSOPT is included such that the timestamp value contained in TsRecent is placed in the TSER part of the TSOPT in the ACK.
5. A sender receiving an ACK that advances its window subtracts the TSER from its current TCP clock and uses the difference as a sample value to update its RTT estimators.
Timestamps are enabled by default in FreeBSD, Linux, and in response to sys- tems that use them for later versions of Windows. In Linux, the system configura- tion variable net.ipv4.tcp_timestamps dictates whether or not they are used
ptg999 Section 14.3 Setting the Retransmission Timeout (RTO) 657
(value 0 for not used, value 1 for used). In Windows, their use is controlled by the Tcp1323Opts value in the registry area mentioned earlier. If it has the value 0, timestamps are disabled. If its value is 2, timestamps are enabled. This key has no default value (it is not in the registry by default). The default behavior is to use timestamps if a peer uses them when initiating a connection.
14.3.3 The Linux Method
The Linux RTT estimation procedure works somewhat differently from the stan- dard method. It uses a clock granularity of 1ms, which is finer than that of many other implementations, along with the TSOPT. The combination of frequent mea- surements of the RTT and the fine-grain clock contributes to a more accurate esti- mate of the RTT but also tends to minimize the value of rttvar over time [LS00].
This happens because when a large enough number of mean deviation samples are accumulated, they tend to cancel each other out. This is one consideration for setting the RTO that differs somewhat from the standard method. Another relates to the way the standard method increases rttvar when an RTT sample is signifi- cantly below the existing RTT estimate srtt.
To understand the second issue better, recall that the RTO is usually set to the value srtt + 4(rttvar). Consequently, any large change in rttvar causes the RTO to increase, whether the latest RTT sample is greater or less than srtt. This is counter- intuitive—if the actual RTT has dropped significantly, it is not desirable to have the RTO increase as a consequence. Linux deals with this issue by limiting the impact of significant downward drops in RTT sample values on the value of rttvar.
We will now look at the details for the procedure Linux uses to set its RTO; the procedure addresses both of the issues just discussed.
Linux keeps the variables srtt and rttvar, as with the standard method, but also two new ones called mdev and mdev_max. The value mdev keeps the running estimate of the mean deviation using the standard algorithm for rttvar described before. The value mdev_max holds the maximum value of mdev seen over the last measured RTT and is never allowed to be less than 50ms. In addition, rttvar is regularly updated to ensure that it is at least as large as mdev_max. Consequently, the RTO never dips below 200ms.
Note
The minimum RTO can be changed. TCP_RTO_MIN, which is a kernel configu- ration constant, can be changed prior to recompiling and installing the kernel.
Some Linux versions also allow it to be changed using the ip route command.
When TCP is used in data-center networks where RTTs may be a few microsec- onds, 200ms minimum RTO can lead to severe performance degradations due to slow TCP recovery after packet loss in local switches. This is the so-called TCP
“incast” problem. Various solutions exist to this problem, including modification of the TCP timer granularity and minimum RTO to be on the order of microseconds [V09]. Such small minimum RTO values are not recommended for use on the global Internet.