Each router in the Figure 6.13 setup forms an adjacency with the other routers, effectively forming a full-mesh. So far, so good. Now, consider the following scenario: the ATM vir- tual circuit between Seattle and Los Angeles breaks for some reason, as indicated by the dotted gray line. Both Seattle and LA notice the break and therefore generate a new LSP (incrementing the Sequence Number and removing the adjacency between Seattle and LA). The new LSP is sent according to the flooding rule on all interfaces where there are adjacencies in the Up state. Thus, both Seattle and LA send four copies (gray arrows) of their new LSPs into the network. Next, the four other routers will receive the two LSPs (white arrows). Here is where the trouble starts: because the flooding algorithm is so sim- ple, the algorithm does not yet know that all the other routers already been have updated and know that the adjacency between Seattle and LA is down. What follows is a multi- plication of LSPs due to the simplicity of the flooding algorithm. All of the routers receive the two new LSPs and re-send the LSP to all the logical interfaces except on the ones on which they got the LSP (gray arrows). What results is that 32 LSPs are sent for a single broken ATM VC. This does not sound too stressful for a modern router’s control plane; however, just think if there are not six routers, but 100 routers in the network. The problem is that the number of LSPs grows by the square of the number of routers, or in mathematical speak O(N 2 ). Thus, a single failing VC in the network may generate up to 10,000 LSP updates, all flying around in a relatively short amount of time. This is an awful lot of stress for the control plane of a router, no matter how powerful. Flooding 167 tx LSP rx LSP Seattle Los Angeles San Francisco New York Atlanta Chicago FIGURE 6.13. ATM overlay networks and flooding stress Things get even worse with another failure scenario: what if not a single VC, but an entire router is going down (due to a reboot, for example)? The amount of LSPs grows by O(N 3 ). In a network of 100 routers spanning a full-mesh, this means that a single fail- ing router generates up to 1,000,000 LSPs in a short amount of time. Ironically, 99 per cent of the LSPs hold information that is already known by some other neighbour. So what can be done to mitigate the dark side of flooding? The answer to this is discussed in the next section. 6.4.2 Mesh-Groups Let’s go back to the basic flooding algorithm and change it a little bit. Now the rule is: Do not send out a received LSP on all the links where we have an adjacency in the Up state. Rather, send out the LSP on some of these links. Figure 6.14 shows a router that is not sending out an LSP on all of the possible links. Instead, some links have been pruned off the flooding topology. The result is that all routers still see LSP updates, but the exces- sive multiplication of LSPs is avoided. The official name for this kind of functionality is known as Mesh-Groups and has been documented in RFC 2973. The Mesh-Group pruning is done based on the topology of the network and is not automatic. There are two basic concepts behind Mesh-Groups. The first concept is blocking an interface entirely, as shown in Figure 6.14. Here, one or a set of interfaces is removed from the flooding list. It is also very straightforward to configure on IOS and JUNOS software, as shown in the following two configuration snippets. Both vendors share the same spirit in their implementation of the Mesh-Group functionality. The LSP flooding in both vendors’implementations is an interface property. In IOS, you configure everything at the physical/logical interfaces prepended by the keyword isis. In JUNOS software, all the logical interfaces can be referenced directly under the protocols isis interface configuration branch, which is very practical, as the relevant information is then at one place. 168 6. Generating, Flooding and Ageing LSPs tx LSP rx LSP Pruned "flooding" links FIGURE 6.14. Mesh-Group blocks remove certain links from the flooding topology IOS configuration In IOS, LSP flooding can be reduced using the isis mesh-group blocked configuration command in interface-configuration mode, as shown in the following: London# show running-config [… ] interface atm 1/2.1 ip router isis isis mesh-group blocked [… ] In JUNOS the configurations statement is very similar. The first flavour of Mesh- Groups can be enabled by use of the mesh-group blocked config-uration directive under the protocols isis interface <interface-name> configuration hierarchy, as shown in the following: JUNOS software configuration hannes@Frankfurt> show configuration [… ] protocols { isis { interface at-4/0/0.200 { mesh-group blocked; } } } [… ] You may ask why the word Group is contained in Mesh-Group. So far we have not con- figured a Group number. What is the Group number related to? This number is related to the refined version of Mesh-Groups where the flooding is not turned off entirely for an interface. Some LSPs are still sent. How is this second flavour of Mesh-Groups configured? First, all the logical interfaces on an IS-IS router have to be organized in groups of interfaces. In Figure 6.15 you can see that the first three interfaces have been grouped together in Mesh- Group #11 and the second three interfaces have been grouped together in Mesh-Group #47. Once an LSP is received over a logical interface (white arrow), the IS-IS router first deter- mines the Mesh-Group number that the receiving interface belongs to. In our example the receiving interface belongs to Mesh-Group #11. When this LSP is now flooded to all neigh- bours, the router does flood the LSP on interfaces belonging to that specific group (Mesh- Group #11 with the gray arrows). This solves the multiplicative effect of basic flooding. The second flavour of Mesh-Groups that has just been described can be configured in a similar way on IOS and in the JUNOS software. The only difference here is that a Mesh-Group Number replaces the keyword blocked. Similar to the mesh-group blocked command, this is configured under interface configuration mode. Flooding 169 In IOS, LSP flooding can be reduced according to the second flavour of Mesh-groups using the isis mesh-group <group-number> configuration command in interface- configuration mode, as shown in the following: IOS configuration London# show running-config [… ] interface atm 1/2.1 ip router isis isis mesh-group 11 interface atm 1/2.2 ip router isis isis mesh-group 11 interface atm 1/2.3 ip router isis isis mesh-group 11 [… ] In JUNOS, the Mesh-Group Number replaces the blocked statement. The second flavour of Mesh-Groups can be enabled by use of the mesh-group <group- number> configuration directive under the protocols isis interface <interface-name> configuration hierarchy, as shown in the following: JUNOS software configuration hannes@Frankfurt> show configuration [… ] protocols { isis { interface at-4/0/0.100 { mesh-group 11; } 170 6. Generating, Flooding and Ageing LSPs Mesh group #11 Mesh group #47 tx LSP rx LSP F IGURE 6.15. Mesh-Groups relay an LSP only to interfaces inside the same Mesh-Group Flooding 171 interface at-4/0/0.101 { mesh-group 11; } interface at-4/0/0.102 { mesh-group 11; } } } [… ] Mesh-Groups help to reduce the flooding explosion in densely meshed environments. However, keep in mind that flooding is a necessity to get information across the internal network. In a sense, it is “too-much” flooding that causes harm. However, a “too-little” flooding strategy can cause harm in a different way. Thus, be very careful when setting up Mesh-Groups. Mesh-Groups cannot be so “tight” that they result in desynchronized link-state databases. In Chapter 8 you will learn about the impact of desynchronized link-state databases and what can be done to avoid them. At the end of the chapter, a refinement of ISO 10589 is presented to make sure that routers that have been acciden- tally pruned off the flooding topology (due to a wrong Mesh-Group configuration, for example) still receive good information for synchronization. Although Mesh-Groups must be hand-configured by a network administrator, it is easy to determine if Mesh-Groups are needed by looking at the statistics that IOS and the JUNOS software can provide. For example, the relevant IS-IS statistics can be displayed using the show clns traffic command, as shown in the following: IOS command output Amsterdam# show clns traffic [… ] IS-IS: Time since last clear: never IS-IS: Level-1 Hellos (sent/rcvd): 115/19 IS-IS: Level-2 Hellos (sent/rcvd): 120/14 IS-IS: PTP Hellos (sent/rcvd): 0/0 IS-IS: Level-1 LSPs sourced (new/refresh): 10/0 IS-IS: Level-2 LSPs sourced (new/refresh): 14/0 IS-IS: Level-1 LSPs flooded (sent/rcvd): 2/2 IS-IS: Level-2 LSPs flooded (sent/rcvd): 3/2 IS-IS: LSP Retransmissions: 0 IS-IS: Level-1 CSNPs (sent/rcvd): 0/2 IS-IS: Level-2 CSNPs (sent/rcvd): 3/0 IS-IS: Level-1 PSNPs (sent/rcvd): 0/0 IS-IS: Level-2 PSNPs (sent/rcvd): 0/0 IS-IS: Level-1 DR Elections: 3 IS-IS: Level-2 DR Elections: 2 IS-IS: Level-1 SPF Calculations: 7 IS-IS: Level-2 SPF Calculations: 7 172 6. Generating, Flooding and Ageing LSPs IS-IS: Level-1 Partial Route Calculations: 0 IS-IS: Level-2 Partial Route Calculations: 0 IS-IS: LSP checksum errors received: 0 IS-IS: Update process queue depth: 0/200 IS-IS: Update process packets dropped: 0 [… ] In every case, a big disparity between the LSPs being sent and the LSPs being received is an indication that there is excess flooding in the network that needs to be controlled via Mesh-Groups. In the JUNOS software, you can display the global lS-IS statistics using the show isis statistics command. Watch for a disparity between LSPs being sent and received: JUNOS software command output hannes@Frankfurt> show isis statistics IS-IS statistics for Frankfurt: PDU type Received Processed Drops Sent Rexmit LSP 220201 220201 0 152846 381 IIH 5640823 5640823 0 3762071 0 CSNP 5486953 5486953 0 9893412 0 PSNP 32766 32766 0 192857 0 Unknown 0 0 0 0 0 Totals 11380743 11380743 0 14001186 381 Total packets received: 11380743 Sent: 14001567 SNP queue length: 0 Drops: 0 LSP queue length: 0 Drops: 0 SPF runs: 121371 Fragments rebuilt: 336 LSP regenerations: 151 Purges initiated: 0 Mesh-Groups solved a big problem in ATM or Frame-Relay overlay networks of the mid-1990s. However, today Mesh-Groups are of limited use because ATM and FR trans- port networks connecting routers have gone away for the most part. Today, routers are typically interconnected by packet-over-SONET/SDH links in a sparse-meshed fashion. A typical core router these days has on average no more than four or five interfaces facing other core routers. In these environments, Mesh-Groups are a nice tuning capability, but not the necessity they were only a few years ago when networks were melting down in the absence of a sound LSP flooding scheme. 6.5 Network-wide Purging of LSPs The flooding of LSP updates the network with the most accurate state information. The link-state database is therefore continually increasing as new or updated information is added to it. If a link is down, issue a new LSP. When it comes back up, issue another new LSP. So far there have been no negative LSPs that make the database shrink in size. But what if IS-IS wants to remove a router from the distributed link-state database in all of the other routers in the network? There is always the option to wait until the LSP ages out, but that can take up to 65,535 seconds (18 hours, 12 minutes). For certain events, such as router removal, IS-IS needs to have the capability to issue a negative LSP update. This negative LSP, or purge LSP, exists and is a “crippled” version of the original LSP. All the purge LSP contains is the LSP header without any further information. The Header and the Checksum fields of the purge LSP header are set to zero to indicate that this is a purge. This negative LSP update, which is called a network-wide purge, is used for a variety of events. One of these events is DIS election. 6.5.1 DIS Election On IS-IS broadcast links there is at least one router performing a special function. This IS-IS router is called the Designated Intermediate System (DIS). The role of the DIS was first discussed in Chapter 5. Each DIS borrows an ID that is unique across the net- work from the LAN on which it is the DIS. The DIS floods that LAN-ID throughout the network to tell other routers that there is connectivity to the LAN. Now, if the DIS is changed (re-elected) due to changes, such as a higher DIS election priority or the time-out of the old DIS, then the new DIS must generate a new LAN-ID and flood this throughout the network. The has-been DIS needs to remove the old LAN-ID from the network in order to ensure that it does not lead to corrupt network information. Figure 6.16 shows the chain of LSPs that are generated to accomplish this. In order to remove the stale LSP from the former DIS, the old DIS generates an LSP with the sequence number incremented by one, but with the Checksum and Lifetime set to zero. Each router that receives this purge LSP will remove the referenced LSP-ID from its link-state database. Network-wide Purging of LSPs 173 Local LAN Old pseudonode Old DIS New pseudonode Old DIS FIGURE 6.16. At DIS re-election the old pseudo node LSP gets purged 6.5.2 Expiration of LSPs Whenever a router ages-out an LSP whose Lifetime has become zero, it needs to tell the other routers that the LSP has been aged out. Recall that each router has an internal clock and those clocks are subject to clock drifts. At the same time, all the routers in a given IS- IS level fundamentally rely on the fact that its link-state database is synchronized with all others. So for further robustness in the face of clock drift, the first router that detects that an LSP’s Lifetime has gone to zero, initiates a network-wide purge of that expired LSP. Lifetime expiration of LSPs is common for routers that have been removed from the net- work for one reason or another. Recall that under normal conditions, each LSP gets refreshed by the Originator before it expires and therefore should never countdown the Lifetime field to zero. This should only happen during the purge of an LSP. If a router purges an LSP from the link-state database, the LSP is not removed imme- diately. Instead, the LSP is retained for a ZeroAgeLifetime of 60 seconds. Keeping the purged LSP for 60 seconds ensures that an LSP is not re-learned (for instance) through an adjacency that has been Down and is now transitioning to Up again. You can recognize a purged LSP that is still in the database if its Lifetime value is in brackets. This is similar to the accounting world, where red numbers are in brackets as well. And this is exactly what the User Interfaces do as well: they essentially show you a zombie – an LSP that is already dead but we keep it alive for visibility, helping us in the troubleshooting case. IOS command output Amsterdam# show isis database [… ] IS-IS Level-1 Link State Database: LSPID LSP Seq Num LSP Checksum LSP Holdtime ATT/P/OL New-York.02-00 0x00002fb1 0x6f71 (23) 1/0/0 [… ] JUNOS software command output hannes@New-York> show isis database IS-IS level 1 link-state database: LSP ID Sequence Checksum Lifetime Attributes New-York.02-00 0x2fb1 0x6f71 (48) L1 Attached [… ] 4 LSPs Typically you do not see much purged LSPs in your database as this is a very rare case (DIS routers do not change very often). However, if you see a lot of bracketed LPSs or one LSP always containing a bracketed Lifetime then probably a malicious event like a flood-purge storm is raging because of duplicate System-IDs. 174 6. Generating, Flooding and Ageing LSPs 6.5.3 Duplicate System-IDs Whenever a router receives an LSP that contains its own System-ID as Originator, and the router is sure that it did not generate this LSP, the router must assume that there is another router on the network that is configured with a duplicate System-ID. All the receiving router can do is to log this event and generate a purge LSP. The other router will most likely try to re-originate this LSP with a higher Sequence Number. Of course, this purge process needs to be carefully paced. Otherwise a flood-purge-storm will start to rage as the two routers continually try to update and purge each other’s wrong LSP. You will see in the next section how these storms can be prevented. Actually, the LSP will be purged because duplicate System-IDs are also an obstacle for a clean SPF calculation. This ensures that the network itself stays clean. 6.6 Flow Control and Throttling of LSPs In link-state routing protocols, the implementer needs to make an effort not to over- whelm neighbours with excessive LSP updates. Excessive LSPs might churn the net- work. In typical transport protocols such as TCP there is a built-in feedback mechanism that makes the sender slow down if the receiver feels overwhelmed. This is called flow control. However, virtually all IGPs (including IS-IS) have no way to tell a neighbour that the IS-IS router is busy and make the other neighbouring routers throttle down LSP transmissions. It is beyond the scope of this book as to why the protocol designers did not address flow control in the IS-IS specification. But this lack of flow control means that an IS-IS router has to carefully pace (spread out in time) LSPs toward a neighbour. In good IS-IS implementations there are a lot of built-in throttles that make the IS-IS router well behaved, even when the network is in a transient stage and several LSP updates are flying around. Additionally, there are also limits for how frequently a router can originate LSP updates. A router not only has to take care that it does not overwhelm its directly connected neighbours, but the router needs to take care that it does not overwhelm all the routers that are beyond the immediately adjacent neighbouring routers. Recall that all routers in a given IS-IS level need to dedicate some resources (such as CPU cycles, bandwidth and so on) to process and relay LSPs farther across the network. So let’s be nice to these routers and not overload them, as we need them to distribute reachability information of all types. Most modern implementations of the IS-IS protocol support a variety of control knobs that makes an IS-IS router slower instead of faster. Realize that going slower when there are transient conditions or LSP storms is the only option that a router has left if the router is to continue running. There are a couple of big “Must-Not’s” that an implementation of IS-IS should never do. We must not trash our neighbours. IS-IS Hellos must always be sent. If a router does not send IS-IS Hellos in time, the adjacency times out. Losing an adjacency in transient situations will additionally contribute more LSPs to a network that is already shaky to begin with. Flow Control and Throttling of LSPs 175 We must not forget to acknowledge LSPs of a neighbour. Even when a router is under pressure in the form of extreme packet loads, not acknowledging an LSP update means that after five seconds the LSP will be retransmitted. So it is much better to acknowledge the LSP the first time before the LSP gets retransmitted. A retransmission consumes the resources of the neighbouring router as well as the receiving router because an LSP has to be retransmitted by the neighbour and re-processed on the receiving side as well. So if making things slower is the only thing a router can do, exactly what kind of events need to be made slower or throttled? The important events to throttle are in the areas of: • The LSPs on an interface • Frequency of originating (generating) LSPs per router • Retransmissions on a interface Each of these is discussed in the following sections. 6.6.1 LSP-transmit-interval The LSP transmit interval is one form of pacing that was originally mentioned in ISO 10589. The specification says that an implementation of IS-IS should make sure not to send more than 30 LSPs per second on a given broadcast link. Both IOS and JUNOS software extended this requirement that LSPs are paced on every IS-IS interface type (broadcast and point-to-point). You can tweak that throttling timer in both JUNOS software and IOS. In IOS, LSP throttling can be enabled using the isis lsp-interval <time> configuration command in interface-configuration mode. The time is a constant expressed in milliseconds (ms). The default value is 33 ms. This example sets the LSP pacing so as not to exceed 20 LSPs per second (pacing of 50 ms means 20 LSPs per second). IOS configuration London# show running-config [… ] interface atm 1/2.1 ip router isis isis lsp-interval 50 [… ] In JUNOS software, the throttling of LSPs can be enabled by use of the isis lsp- interval <time> configuration directive under the protocols isis interface <interface-name> configuration hierarchy. The default value is 20 ms and gener- ates 50 LSPs per second, which means that JUNOS software is contrary to the original 20 LSP-per-second specification, but this limit is fairly old in that respect. Modern routers should easily handle 50 LSPs per second. This example sets the JUNOS software value to the specification limit of 50 ms (20 LSPs per second). 176 6. Generating, Flooding and Ageing LSPs . tell other routers that there is connectivity to the LAN. Now, if the DIS is changed (re-elected) due to changes, such as a higher DIS election priority or the time-out of the old DIS, then the. however, just think if there are not six routers, but 100 routers in the network. The problem is that the number of LSPs grows by the square of the number of routers, or in mathematical speak O(N 2 ) (DIS). The role of the DIS was first discussed in Chapter 5. Each DIS borrows an ID that is unique across the net- work from the LAN on which it is the DIS. The DIS floods that LAN-ID throughout the