TCP in High-Speed Environments

In high-speed networks with large BDPs (e.g., WANs of 1Gb/s or more), conventional TCP may not perform well because its window increase algorithm (the congestion avoidance algorithm, in particular) takes a long time to grow the window large enough to saturate the network path. Said another way, TCP can fail to take advantage of fast networks even when no congestion is present. This issue arises primarily from the fixed additive increase behavior of congestion avoidance. If we consider a TCP using 1500-byte packets operating over a 10Gb/s long-distance link, some 83,000 segments are required to be outstanding in order to fully uti- lize the available bandwidth, assuming no packet drops or errors in five billion packets. For an RTT of 100ms, this takes about 1.5 hours to achieve. In order to address this deficiency, a number of researchers and developers have explored ways to alter TCP in order for it to perform better in such networks, while retain- ing a degree of fairness to standard TCP, especially for more common lower-speed environments.

16.8.1 HighSpeed TCP (HSTCP) and Limited Slow Start

The experimental HighSpeed TCP (HSTCP) specifications [RFC3649][RFC3742]

propose to alter the standard TCP behavior when the congestion window is larger than a base value Low_Window, suggested to be 38 MSS-size segments. This value corresponds to a packet drop rate of 10-3 based on the simplified TCP response function given previously. This function is linear on a log-log plot of sending rate versus packet loss rate, so it is really a power law function.

Note

Functions that form a line on a log-log plot are called power law functions. They have equations of the form y = axk, meaning log y = log a + k log x (a and k are constants). This equation forms a line with slope k on a log-log plot.

To construct the type of power law function required, we select two points and create the equation that describes the line passing between them. Consider two such points as (p1, w1) and (P0, W0) where w1 > W0 > 0 and 0 < p1 < P0. On a linear plot, this would form a line with slope (w1 - W0)/( p1 - P0), but on a log-log plot it forms a line with slope S = (log w1 - log W0)/(log p1 - log P0). Then, based on the

ptg999 Section 16.8 TCP in High-Speed Environments 771

equation in the Note, we have w = CpS, and we require some point, say (P0, W0), to determine C. After some algebra, we find that C = P0-S W0, meaning w = pS P0-S W0.

In Figure 16-19, we see a plot of both the conventional TCP response function and a proposed response function for HSTCP based on the point (P0, W0) = (.0015, 31) and S = -0.82. Note that for larger packet drop rates (over about .001) the response functions are the same, so these equations apply only for a certain maximum value of p. Comparing the two lines, when the packet drop rate is small enough, HSTCP is allowed to send more aggressively.

Figure 16-19 With HighSpeed TCP, the TCP response function is altered to be more aggressive for low packet drop rates and large windows, leading to higher throughputs for high bandwidth-delay-product networks. Image from presentation by Sally Floyd to IETF TWVWG, Mar. 2003.

To have TCP achieve this response function, the congestion avoidance procedure is modified to take into account the current size of the window when making changes. This takes place, as with conventional TCP, upon the arrival of a good ACK. The response for a good arriving ACK is generalized as follows:

cwndt+1 = cwndt + a(cwndt)/cwndt

When responding to a congestion event (e.g., packet loss, ECN indication), it responds as follows:

cwndt+1 = cwndt - b(cwndt)* cwndt

ptg999 Here, a() is the additive increase function and b() is the multiplicative decrease

function. In this generalization of standard TCP, they are functions of the current window size. To achieve the desired response function, we start by generalizing from equation [3]:

W0 P

a(w)(2 – b(w))

2b(w)

This gives:

a(w) = 2P0W02 b(w)/(2 – b(w))

This relationship does not have a unique solution—that is, there are many combinations of a() and b() that satisfy the relationship, even though some of them may not be practical or desirable for deployment.

Additional details of the changes proposed to the congestion avoidance procedure for TCP suggested by HSTCP are available in [RFC3649]. A companion document [RFC3742] describes how slow start can be modified to help TCP obtain a working congestion window in such environments. This is called limited slow start and is designed to slow down slow start, so that a TCP operating with large windows (thousands or tens of thousands of packets) does not double its window in one RTT.

With limited slow start, a new parameter called max_ssthresh is introduced.

This value is not the maximum value of ssthresh but instead a threshold for cwnd that works as follows: If cwnd <= max_ssthresh, slow start proceeds as normal. If max_ssthresh < cwnd <= ssthresh, then cwnd is increased by at most (max_ssthresh/2) SMSS per RTT. This is accomplished by modifying the management of cwnd during slow start as follows:

if (cwnd <= max_ssthresh) {

cwnd = cwnd + SMSS (regular slow start) } else {

K = int(cwnd / (0.5 * max_ssthresh))

cwnd = cwnd + int((1/K)*SMSS) (limited slow start) }

A suggested possible initial value for max_ssthresh is 100 packets, or 100*SMSS in bytes.

16.8.2 Binary Increase Congestion Control (BIC and CUBIC)

HSTCP is one of several proposals for modifying TCP to provide higher throughput for large BDP networks. While it considers throughput and fairness with respect to conventional TCPs in similar circumstances, and elects to be more

ptg999 Section 16.8 TCP in High-Speed Environments 773

aggressive than standard TCP under certain circumstances, it does not attempt to directly control what happens when HSTCP connections with differing RTTs compete with each other (called “RTT fairness”). This was studied for standard TCP some years back, revealing that TCPs with shorter RTTs obtain a larger share of the bandwidth on shared links as compared to those having larger RTTs, when using the same packet size and ACK strategy [F91]. For TCPs that increase cwnd as a function of its size (so-called bandwidth-scalable TCPs), this unfairness can be even more severe. Whether RTT fairness should be considered desirable is sub- ject to debate. Although RTT fairness would seem attractive from first principles, connections with larger RTTs are likely to be using more network resources (e.g., passing through more routers), so it may be reasonable for them to receive some- what less throughput. In any case, knowing just how RTT (un)fairness behaves is a driving factor behind the popular TCP variants we explore next.

16.8.2.1 BIC-TCP

In an effort to create a scalable TCP and deal with the issue of RTT fairness, BIC- TCP (formerly called BI-TCP) [XHR04] was developed and deployed in Linux kernels starting with version 2.6.8. The main goal of BIC TCP is to provide linear RTT fairness even though congestion windows may be quite large (which is required to use high-bandwidth links). Linear RTT fairness means that connections receive a bandwidth share inversely proportional to their RTTs, rather than some more complicated or unknown function.

The approach modifies a standard TCP sender with two algorithms: binary search increase and additive increase. These algorithms are invoked after a congestion indication (e.g., packet loss), but only one of the algorithms is in operation at any given point in time. The binary search increase algorithm operates as follows:

The current minimum window is the last point at which the connection experienced no packet loss during an entire RTT. The maximum window is the window size at which the connection last experienced loss, if known. The desired window lies somewhere between the two. Using a binary search technique, BIC-TCP selects a trial window in the midpoint of these two values and tries again recursively. If this point shows continued packet loss, it becomes the new maximum and the process repeats. If not, it becomes the new minimum and the process repeats. The process terminates when the difference between the minimum and maximum windows is less than a predefined threshold called the minimum increment, or Smin.

The algorithm tends to find the desirable window, also called the saturation point, in a logarithmic number of trials, whereas a standard TCP would require a linear number (half of the difference in window sizes, on average). Thus, this approach makes BIC-TCP more aggressive than standard TCP during certain periods of operation, but this is desired in order to take advantage of high-speed environments without unwanted delay. The protocol is unusual, relative to other proposals, because its increase function is concave at some points—that is, its increase gets smaller as it gets closer to the saturation point. Most other algorithms use large change increments nearest the saturation point.

ptg999 The additive increase algorithm works as follows: When using binary search

increase, the situation can arise where the distance from the current window size to the midpoint (in the sense of the binary search described previously) is large.

Increasing the window to the midpoint in one RTT may be ill advised because of the potential for injecting large packet bursts into the network. This is prevented by the additive increase algorithm, which is invoked when the distance to the midpoint from the current window is more than some amount Smax. When this happens, the increment is limited to Smax per RTT, called window clamping. Once the midpoint is closer than Smax to the trial window, binary search increase takes over.

Overall, upon detection of a loss, the window is reduced by a multiplicative factor β, and its growth starts again with additive increase and switches to binary search once the desired increase amount is less than Smax. The authors call the combined algorithms binary increase, or BI.

When the window grows beyond the current maximum, or no maximum is yet known because no loss event has occurred, it must be established. This is accomplished by a procedure known as max probing. The purpose of max probing is to use bandwidth when it becomes available. It proceeds in a way symmetric to the additive increase and binary increase algorithms. It starts in small initial increments, followed by larger increments if no congestion is indicated. The approach shows good stability because small changes are made near the saturation point, where the network is believed to be operating near its greatest capacity.

Linux (kernels 2.6.8 through 2.6.17) includes an implementation of BIC- TCP that is enabled by default. Four sysctl parameters control its operation:

net.ipv4.tcp_bic, net.ipv4.tcp_bic_beta, net.ipv4.tcp_bic_low_

window, and net.ipv4.tcp_bic_fast_convergence. The first Boolean variable controls whether BIC is used (as opposed to the conventional fast retransmit/

recovery procedures). The next contains a scaling factor for cwnd to determine Smax (default 819). The next parameter controls the minimum size of the congestion window before the BIC-TCP control algorithms take over. Its default value is 14, meaning that for small window values standard TCP congestion control is used. The last parameter is a flag, enabled by default. When set, it affects the way the new maximum and target windows are selected when the binary increase algorithm is in a downward trend. During a window reduction, the new maximum and minimum windows are set to the current and scaled (down by a factor of beta) values of cwnd, respectively. If fast convergence is enabled and the value of the new maximum is less than its previous value before it was set to cwnd, the value of the maximum window is further reduced between the average of it and the minimum window. After this, whether or not fast convergence is enabled, the target window is the average of the maximum and minimum values. This helps to achieve even bandwidth sharing more quickly when multiple BIC-TCP flows are sharing the same router.

ptg999 Section 16.8 TCP in High-Speed Environments 775

16.8.2.2 CUBIC

The authors of BIC-TCP revised their basic algorithms to form a new control algorithm called CUBIC [HRX08]. It has been the default congestion control algorithm used in Linux TCP since kernel version 2.6.18. It addresses concerns raised that BIC-TCP may be too aggressive under some circumstances. It also simplifies the window growth procedures. Instead of using a threshold (Smax) to decide when to invoke the binary search increase versus additive increase, an odd-degree polyno- mial function, in particular a cubic function, is used instead to control the window increase function. Cubic functions can have both convex and concave portions, meaning that they can grow more slowly in some portions (concave) and more quickly in others (convex). Until BIC and CUBIC, virtually all of the TCP literature advocated convex window growth functions. The specific window growth function, used by CUBIC to set cwnd, is as follows:

W(t) = C(t – K)3 + Wmax

In this equation, W(t) is the window at time t. C is a constant parameter (default 0.4), t is the elapsed time in seconds since the last window reduction, and K is the time period the function takes to increase W to Wmax when there is no further loss event. Wmax is the last window size prior to the last window adjustment. K can be calculated as follows:

K W

3 max

= β

where β is the multiplicative decrease constant (default 0.2). An illustration of the CUBIC window growth function for K = 2.71, Wmax = 10, and C = 0.4 on the interval t = [0, 5] is shown in Figure 16-20.

This figure illustrates how the CUBIC window growth function contains both a concave portion and convex portion. When a fast retransmit occurs, Wmax is set to cwnd, and new values of cwnd and ssthresh are set to β*cwnd. CUBIC uses a default value of 0.8 for β. The value W(t + RTT) gives the next target congestion window value. When an additional ACK arrives during congestion avoidance, cwnd is increased by (W(t + RTT) - cwnd)/cwnd.

It is worth noting that having t be the amount of elapsed time since the last window reduction event helps to ensure RTT fairness. Instead of changing the window by some fixed amount when ACKs arrive, the window change amount is a function of the elapsed time since the last window change. This decouples the window change operations from the particular pattern of ACK arrivals.

In addition to the cubic operating region, CUBIC also has a “TCP-friendly”

region that operates when the window is small to ensure that CUBIC is not

ptg999

penalized relative to regular TCP. More specifically, the window size of standard TCP in terms of the elapsed time t, Wtcp(t), is given by

W t

RTT

3 t 1 W

tcp 1 ) max

( ) ) (

( = − β

+ β + β

So if cwnd is less than Wtcp(t) when an ACK arrives during congestion avoidance, CUBIC sets cwnd = Wtcp(t). This ensures TCP friendliness in common low- to mod- erate-speed networks, where CUBIC would otherwise be disadvantaged.

As mentioned earlier, CUBIC has been the default congestion control algorithm for Linux kernels since 2.6.18. Since kernel version 2.6.13, however, Linux supports pluggable congestion avoidance modules [P07], allowing the user to pick which algorithm to use. The variable net.ipv4.tcp_congestion_control contains the current default congestion control algorithm (default: cubic). The variable net.

ipv4.tcp_available_congestion_control contains the congestion control algorithms loaded on the system (in general, additional ones can be loaded as kernel modules). The variable net.ipv4.tcp_allowed_congestion_con- trol contains those algorithms permitted for use by applications (either selected specifically or by default). The default supports CUBIC and Reno.

:PD[

6WHDG\6WDWH

%HKDYLRU :W:PD[

0D[3URELQJ

%HKDYLRU :W!:PD[

:LQGRZ 6L]H

:W W±

7LPHW

Figure 16-20 The CUBIC window growth function is a cubic function of t. It has a concave portion in the area where W(t) < Wmax. In this region, CUBIC searches for the saturation point by growing cwnd with decreasing aggressiveness. After Wmax is reached, the growth function becomes convex, where it searches by growing cwnd with increasing aggressiveness.

ptg999 Section 16.9 Delay-Based Congestion Control 777

Ethernet and the IEEE 802 LAN/MAN Standards

Dynamic Host Configuration Protocol (DHCP)