Path MTU Discovery with TCP

In Chapter 3, we described the concept of the path MTU. It is the minimum MTU on any network segment that is currently in the path between two hosts. Knowing the path MTU can help protocols such as TCP avoid fragmentation. In Chapter 10, we looked at how discovery of the path MTU (PMTUD) is accomplished based on ICMP messages, but in that case UDP is not usually able to adapt its datagram size because the application specifies the size (i.e., not the transport protocol). TCP, in providing the byte stream abstraction it implements, determines what segment size to use and as a result has a much greater degree of control over the size of IP datagrams that are ultimately generated.

In this section we will examine how PMTUD is used by TCP. Our discus- sion will apply to both TCP/IPv4 and TCP/IPv6. More details are provided by [RFC1191] and [RFC1981], respectively. A method that avoids the use of ICMP, called Packetization Layer Path MTU Discovery (PLPMTUD), can also be used by TCP [RFC4821] or by other transport protocols. We shall use the ICMPv6 Packet Too Big (PTB) terminology to refer to either ICMPv4 Destination Unreachable (Fragmentation Required) or ICMPv6 Packet Too Big messages.

ptg999 Section 13.4 Path MTU Discovery with TCP 613

TCP’s regular PMTUD process operates as follows: When a connection is established, TCP uses the minimum of the MTU of the outgoing interface, or the MSS announced by the other end, as the basis for selecting its send maximum segment size (SMSS). PMTUD does not allow TCP to exceed the MSS announced by the other end. If the other end does not specify an MSS, the sender assumes a default of 536 bytes, but this situation is now rare. It is also possible for an imple- mentation to save path MTU information on a per-destination basis to help in selecting its segment size. Note that the path MTU in each direction of a connection could be different.

Once the initial SMSS is chosen, all IPv4 datagrams sent by TCP on that connection have the IPv4 DF bit field set. For TCP/IPv6, this is not necessary because there is no DF bit field; all datagrams are assumed to have it set implicitly. If a PTB is received, TCP decreases the segment size and retransmits using a different segment size. If the PTB contains the suggested next-hop MTU, the segment size can be set to the next-hop MTU minus the sizes of the IPv4 (or IPv6) and TCP headers.

If the next-hop MTU value is not present (e.g., an older ICMP error was returned that lacks this information), the sender may try a variety of values (e.g., binary search for a usable value). This also affects TCP’s congestion control management (see Chapter 16). For PLPMTUD the situation is similar, except PTB messages are not used. Instead, the protocol performing PMTUD must be able to detect message discards quickly and perform its own datagram size adjustments.

Because routes can change dynamically, when some time has passed since the last decrease of the segment size, a larger value (up to the initial SMSS) can be tried. Guidance in [RFC1191] and [RFC1981] recommends that this time interval be about 10 minutes.

There are a number of problems with PMTUD when it operates in an Internet environment with firewalls that block PTB messages [RFC2923]. Of the various operational problems with PMTUD, black holes have been the most problematic, although the situation is improving (in [LS10], 80% of systems studied were able to properly process PTB messages). PMTUD black holes arise when a TCP imple- mentation that depends on the delivery of ICMP messages to adjust its segment size never receives them. This could be for several reasons, including a firewall or NAT configuration that prohibits such ICMP messages from being forwarded. The consequence is a TCP connection that cannot proceed once it starts to use larger packets. It can be difficult to diagnose because only large packets cannot be forwarded. The smaller ones (such as SYN and SYN + ACK packets used to establish the connection) generally succeed. Some TCP implementations have “black hole detection,” which amounts to trying a smaller segment size when a segment is retransmitted several times.

13.4.1 Example

We can see the correct behavior of PMTUD when an intermediate router has an MTU less than either of the endpoints’ MSS. To create this situation, we begin with a router (a Linux host with local address 10.0.0.1) that has a PPPoE interface to a

ptg999 DSL service provider. The PPPoE link uses an MTU of 1492 (1500 bytes for Ether-

net, minus 6 bytes of PPPoE overhead, minus another 2 bytes of PPP overhead; see Chapter 3). Figure 13-7 is an illustration of the topology.

,QWHUQHW

HWK

'6/PRGHP

HWK 333R(/LQN 078 SSS

Figure 13-7 The PPPoE encapsulation drops the path MTU of most TCP connections to 1492 bytes from what might otherwise have been 1500 bytes (the typical MTU for Ethernet). To demonstrate TCP’s use of PMTUD, we set the MTU even smaller (288 bytes).

In order to induce this behavior specifically, we can reduce the MTU size on the PPPoE link from 1492 to, say, 288 bytes. On the GW machine, the following command accomplishes this task:

Linux(GW)# ifconfig ppp0 mtu 288

In addition, we need to tell the client system (C) that small segments are allowed:

Linux(C)# sysctl -w net.ipv4.route.min_pmtu=68

If we did not perform this second operation, Linux would clamp its minimum path MTU at the default value of 552 bytes, which helps avoid certain small MTU attacks (see Section 13.8). The consequence of doing so in our example here is that any packets larger than 288 bytes would be fragmented. To avoid this, and to demonstrate PMTUD more effectively, we remove this minimum. We then start a file transfer from machine C (address 10.0.0.123) to the server S on the Internet (address 169.229.62.97). Listing 13-2 shows a tcpdump packet trace from this exchange. Sev- eral lines have been wrapped and extraneous fields have been removed for clarity.

Listing 13-2 The path MTU discovery mechanism finds an appropriate segment size to use when transiting the network where the middle link has a smaller MTU than the endpoints.

1 20:20:21.992721 IP (tos 0x0, ttl 45, id 43565, offset 0, flags [DF], proto 6, length: 588)

169.229.62.97.22 > 10.0.0.123.1027: P [tcp sum ok]

41:577(536) ack 23

ptg999 Section 13.4 Path MTU Discovery with TCP 615

2 20:20:21.993727 IP (tos 0x0, ttl 64, id 57659, offset 0, flags [DF], proto 6, length: 588)

10.0.0.123.1027 > 169.229.62.97.22: P [tcp sum ok]

23:559(536) ack 577

3 20:20:21.994093 IP (tos 0xc0, ttl 64, id 57547, offset 0, flags [none], proto 1, length: 576)

10.0.0.1 > 10.0.0.123: icmp 556:

169.229.62.97 unreachable - need to frag (mtu 288) for IP (tos 0x0, ttl 63, id 57659, offset 0, flags [DF], proto 6, length: 588)

10.0.0.123.1027 > 169.229.62.97.22:

P 23:559(536) ack 577

4 20:20:21.994884 IP (tos 0x0, ttl 64, id 57660, offset 0, flags [DF], proto 6, length: 288)

10.0.0.123.1027 > 169.229.62.97.22: . [tcp sum ok]

23:259(236) ack 577 ...

5 20:20:22.488856 IP (tos 0x0, ttl 45, id 6712, offset 0, flags [DF], proto 6, length: 836)

169.229.62.97.22 > 10.0.0.123.1027: P [tcp sum ok]

857:1641(784)ack 855 ...

6 20:20:29.672947 IP (tos 0x8, ttl 64, id 57679, offset 0, flags [DF], proto 6, length: 1452)

10.0.0.123.1027 > 169.229.62.97.22: . [tcp sum ok]

1431:2831(1400) ack 2105

7 20:20:29.674123 IP (tos 0xc8, ttl 64, id 57548, offset 0, flags [none], proto 1, length: 576)

10.0.0.1 > 10.0.0.123: icmp 556:

169.229.62.97 unreachable - need to frag (mtu 288) for IP (tos 0x8, ttl 63, id 57679, offset 0, flags [DF], proto 6, length: 1452)

10.0.0.123.1027 > 169.229.62.97.22: . 1431:2831(1400) ack 2105

8 20:20:29.673751 IP (tos 0x8, ttl 64, id 57680, offset 0, flags [DF], proto 6, length: 1452)

10.0.0.123.1027 > 169.229.62.97.22: . [tcp sum ok]

2831:4231(1400) ack 2105

9 20:20:29.675180 IP (tos 0xc8, ttl 64, id 57549, offset 0, flags [none], proto 1, length: 576)

10.0.0.1 > 10.0.0.123: icmp 556:

169.229.62.97 unreachable - need to frag (mtu 288) for IP (tos 0x8, ttl 63, id 57680, offset 0, flags [DF], proto 6, length: 1452)

10.0.0.123.1027 > 169.229.62.97.22: . 2831:4231(1400) ack 2105

ptg999 10 20:20:29.674932 IP (tos 0x8, ttl 64, id 57681, offset 0, flags

[DF], proto 6, length: 288)

10.0.0.123.1027 > 169.229.62.97.22: . [tcp sum ok]

1431:1667(236) ack 2105

11 20:20:29.675143 IP (tos 0x8, ttl 64, id 57682, offset 0, flags [DF], proto 6, length: 288)

10.0.0.123.1027 > 169.229.62.97.22: . [tcp sum ok]

1667:1903(236) ack 2105

In the tcpdump output, the connection has already been set up and MSS options have been exchanged. All packets on the connection have the DF bit field set, so both ends are performing PMTUD. The remote side’s first packet is 588 bytes long, which transitions the router successfully in one piece, despite our configuration of the MTU on the PPPoE links being 288 bytes. The reason for this is asymmetry in the MTU configuration. Although the local end of the PPPoE link is using a maximum transmission unit of 288 bytes, the other end is using a larger size SMSS, presumably 1492 bytes. This leaves us in the situation where our outgoing packets need to be small (288 bytes or less), and packets traveling in the reverse direction can be larger.

When the local end attempts to send a larger packet of size 588 bytes with the DF bit field turned on, a PTB message is generated by the router (10.0.0.1), indicating that the appropriate MTU for the next-hop link is 288 bytes. The TCP responds by sending its next packet with size 288 bytes, as instructed. To then send the rest of the sequence numbers it attempted to send in its 588-byte packet, it sends two additional packets, of sizes 288 and 116. We see a similar pattern of sizes repeats during the course of the file transfer.

The PMTU discovery process is one of the only ways TCP explicitly attempts to adapt its segment size after a connection has started, at least when large amounts of data are transferred. The size of a segment can affect the overall throughput performance, as can the window size. We discuss how these affect overall performance in Chapter 15.

Ethernet and the IEEE 802 LAN/MAN Standards

Dynamic Host Configuration Protocol (DHCP)