SR-TE BGP EPE for Unified MPLS
Last updated
Last updated
Load multi-domain.init.cfg. You may need to load bootflash:blank.cfg and commit replace first.
There are two separate ISIS domains. eBGP is used between R3, R4, R5 and R6. Using BGP EPE with SR PCE, achieve an end-to-end LSP between R1 and R7.
Use R10 as the PCE. Do not distribute link state on R10 under the IGP processes.
Before making any changes, IPv4 reachability exists between the PEs, but it is not an LSP. This is because each domain is advertising its PEs’ loopbacks via BGP and redistributing this into the IGP.
BGP EPE allows for an elegant method to provide inter-domain “unified MPLS” style end-to-end LSPs. As in the previous lab, we simply need to enable egress-engineering under each neighbor:
However, the egress-engineering command does not enable MPLS on the interface towards the eBGP peer. This is because you might be using BGP EPE simply for traffic engineering the egress PE’s forwarding decision. You may not want to actually run MPLS over the link to the eBGP peer. But when we are using BGP for inter-domain MPLS, we do need to use MPLS. With BGP IPv4/LU, MPLS is automatically enabled, as the AFI/SAFI itself requires forwarding of labeled traffic. For BGP EPE we can simply enable MPLS on the interface using mpls static. Another option is enabling the interface under mpls traffic-eng. Note that using “router bgp mpls activate” does not work, likely because we are not running any labeled AFIs with the eBGP peer.
Next, we need to use a PCE to calculate the inter-domain paths. The PCE must receive all topology information: both ISIS domains, and all BGP EPE information. This is done using BGP-LS. The task instructs us not to use distribute link-state on the PCE itself. Instead, we can simply do this on the BGP edge routers. The IGP topology information will automatically be injected into BGP-LS locally on the router.
BGP-LS is another AFI/SAFI. The AFI=link-state and SAFI=link-state, so you use address-family link-state link-state. This adress-family is used to carry IGP/LS topology information in the form of BGP udpates.
BGP-LS has three NLRI types:
Type 1 - Node [V]
Contains the hostname, area ID, RID, SRLB, SRGB, algos supported, etc. (information about the node)
Code [V] is for vertex
In graph theory, a vertex is a node in the graph
Type 2 - Link [E]
Contains the local/remote RID, IGP metric, admin group, max BW, TE metric, Adj SID etc. (information about the link)
Code [E] is for edge
In graph theory, an edge connects exactly two vertecies
Type 3 - IPv4 Prefix [T]
Contains the prefix, metric, flags, and SID index
Code [T] is likely for Topology Prefix
The IGP topology is transcoded into a common BGP-LS format. The elegant aspect of this solution is that both OSPF and ISIS data are identically encoded into BGP-LS NLRI. You will see next that the NLRI acts as a “key” for an entry in the TED. The NLRI contains the minimum information that uniquely identifies an entry. For example, a link is identified by the protocol and topology ID, the node on either end of the link, and the link identifying information, such as ifIndex or IP addresses, in case multiple links exist between the two nodes. All aspects of the entry, such as IGP/TE metric, link affinity, SID information, etc, is present in a BGP LS attribute, not the NLRI.
Let’s examine a type 1 NLRI, for example, for R1 (0000.0000.0001). In the NLRI you first see a [L2] for ISIS L2, a [I0x65] for instance ID 101. The protocol and instance ID together make an entry unique in a given IGP. This is why a unique instance ID is necessary per protocol. However, since the protocol itself is part of the entry, you can technically assign the same instance ID to two separate IGPs on the same node, ex. OSPF 1 and ISIS 1. The XR parser allows you to do this, but will give you a commit error if you try to assign the same instance ID to two separate instances of the same protocol (i.e. ISIS 1 and ISIS 2). (Note that in our specific topology, we can use the same instance ID for both nodes without any real issues, because all nodes belong to only one of the two IGPs, with the exception of R9 and R10. However, nodes R9 and R10 do not have to be in the LSP path).
Next in the NLRI we see a BGP-LS ID of 0 (XR does not use BGP-LS ID), and the SysID of the node. Below in the LS attribute, you can see MSD (max SID depth), node name, area, TE RID, etc. All of this information is translated from R1’s LSP into a BGP-LS Update. As the IGP topology changes, BGP-LS is automatically updated so that the BGP-LS feed always accurately represents the current IGP topology.
Next we’ll look at a type 2 NLRI, for example the link between R1 and R3. The BGP NLRI is extremely complex, but makes more sense when you notice that it is broken down into three parts: the source node identifier (which contains the exact information from the node’s type 1 NLRI), the remote node identifier, and a link identifier. The fact that each node’s type 1 NLRI is present in this type 2 NLRI allows the PCE to place the link between these two nodes in its condensed topology.
In the BGP-LS attribute, you see all the attributes of the link, such as TE metric and Adj SID.
Finally we’ll look at a type 3 NLRI, for example R1’s Lo1 prefix. This shows the prefix SID (index 1), flags, and metric. Notice that this NLRI again contains the node’s type 1 NLRI, allowing this prefix to be linked to the node object. The PFX-SID flags (40) means the N-flag is set. The 0 in 40/0 indicates algo 0. Likewise, the “extended IGP flags 0x20” indicates that the N flag is set.
For all of these NLRI, you can use the detail keyword to get a breakdown of the NLRI fields. For example, using the detail keyword on type 2 NLRI proves very useful:
Additionally, the BGP-EPE information is automatically encoded in BGP-LS. For example, we can see the link betwen R3 and R5. All of the complexity is within the NLRI “prefix” itself: the ASNs, instance number (BGP uses instance 0 to represent the “global” instance), and link IPs. In the Link State attribute, we simply see the peer SID. A metric isn’t advertised for EPE links.
Note that each EPE node must use a BGP RID equal to its IGP TE-RID. This allows the PCE to collapse all topology information into a single, global topology. If the BGP RID and TE-RID are different, then the PCE will not know that the BGP node connects to that IGP.
The PCE now has a fully populated SR-TED. The next step is to configure the PEs to request PCE path computation for the existing SR-TE policy. (The init file had pre-exisiting SR-TE policies on R1 and R7).
The PCE successfully computes a path between R1 and R7:
Note above that the accumulated metric is 20, which at first glance seems too low. The cost from R1 to R3 is 10, the cost from R3 to R5 is 0, and the cost from R5 to R7 10. BGP EPE links are always considered a metric of 0.
We now have a working end-to-end LSP that is inter-domain.
The beauty of this is that the inter-domain end-to-end LSP reachability is also fully TE-capable! This means that we can use the PCE to calculate paths that minimize latency, avoid link colors, etc. This is only possible in a very limited sense with RSVP-TE. In RSVP-TE, we can use a sort of “hack” on an IOS-XE ASBR to run the link as a passive TE interface. Then, the headend can define a loose ERO, which requests each ASBR to do part of the path calculation. You cannot use end-to-end constraints or minimize a particular metric. Additionally, each section of the LSP is only optimized for that one domain. (Each ASBR calculates the best TE metric towards the next reachable hop in the ERO, not the complete end-to-end best metric path). All of these downsides are solved by using SR-TE with a PCE.
Additionally, notice the beauty of using BGP-EPE over BGP-LU. We did not have to do complex next-hop self behavior and add state by signaling a label with PE prefixes. We can simply use basic BGP IPv4/unicast. The ASBRs allocate a BGP peer SID for each eBGP neighbor, which does not break the LSP. The only complexity is that we must manually enable MPLS on these interfaces.
Any IGP topology changes are immediately signaled to BGP-LS. For example, let’s bring down the link between R1 and R4.
Using the command show pce verification, we can see that a topology update event occured:
The BGP NLRI for the R1-R4 link has been removed from BGP-LS, and is reflected immediately in the SR-TED. The SR-TED shows R1 only has a single link now. Note that the below show command uses show pce ipv4 top for variety. This is fed by the SR-TED, so you can also use show segment-routing traffic-eng ipv4 top.
Let’s bring the R1-R4 link back up and examine what happens when the R3-R5 link is brought down:
We get a crazy 5-deep label stack path:
This is because all of the BGP EPE “links” have a metric of 0. So this figure-8 path is preferred over simply going from R6 to R7.
One way to fix this is to use hopcount as the metric for the SR-TE policy. This is because the EPE links have an IGP and TE metric of 0.
BGP-LS uses the same BGP mechanisms for bestpath selection and path validation. Only a single BGP-LS “route” is choosen and imported into the SR-TED among multiple identical BGP-LS NLRI from different peers. You can use standard BGP PAs to influence bestpath decision if needed, although you shouldn’t have to, as this should just be duplicate information.
Additionally, you can run into an issue where the BGP-LS NLRI nexthop is not reachable. In that case, the NLRI will not be valid, and will not be imported into the SR-TED.
An alternative to BGP-EPE for ipv4/uni on the ASBR links is to use BGP-LU and set the label-index. This will propagate the loopbacks with their correct prefix SID index between the IGP domains. While this works to produce an end-to-end LSP, this LSP cannot be traffic engineered. The ability to use TE is the advantage of using BGP ipv4/uni with EPE instead. You produce a per-eBGP peer SID at each ASBR, allowing the PCE to create traffic engineered policies that meet given constraints/metric objectives. With simple BGP-LU and the SR label-index attribute, all you get is a plain end-to-end LSP that is inter-domain.