Saturday, April 12, 2014

MPLS EXP-based QoS and QoS Groups

This topic is a bit of a stretch for the R&S lab, really being more oriented towards Service Provider, but I wanted to talk about it anyway.

So what does your MPLS carrier do with those QoS settings you pass them?
It's unlikely they're queuing at congestion spots in their network based on the DSCP values you set.

You've probably heard about the EXP bits in the MPLS tag.  These are used "for QoS".  But no one really seems to know how.  And there's only 3 bits, but we use 6 bits for DSCP, so what's the story?

Here's our topology:



We'll be setting DSCP values on H1 and manipulating them, or their MPLS equivalents, on the way to H2.

Without any special config, let's see how this works right out of the box. Of important note, I have null-routed H1's IP address on H2. This makes it easier to read the output from "debug mpls packet", because we're only seeing a one-way flow instead of a two-way flow.

H1#ping
Protocol [ip]:
Target IP address: 192.168.1.6
Repeat count [5]: 2
Datagram size [100]:
Timeout in seconds [2]:
Extended commands [n]: y
Source address or interface:
Type of service [0]: 184
Set DF bit in IP header? [no]:
Validate reply data? [no]:
Data pattern [0xABCD]:
Loose, Strict, Record, Timestamp, Verbose[none]:
Sweep range of sizes [n]:
Type escape sequence to abort.
Sending 2, 100-byte ICMP Echos to 192.168.1.6, timeout is 2 seconds:
..
Success rate is 0 percent (0/2)

(Remember, we were not expecting responses)
So we sent this as EF traffic (TOS 184, above).  Any hypothesis on what's seen in transit?

P2#debug mpls packet
MPLS packet debugging is on
P2#
*Mar  1 09:20:04.473: MPLS: Fa0/0: recvd: CoS=5, TTL=253, Label(s)=16/19
*Mar  1 09:20:04.473: MPLS: Fa0/1: xmit: CoS=5, TTL=252, Label(s)=19
P2#
*Mar  1 09:20:06.437: MPLS: Fa0/0: recvd: CoS=5, TTL=253, Label(s)=16/19
*Mar  1 09:20:06.437: MPLS: Fa0/1: xmit: CoS=5, TTL=252, Label(s)=19

CoS=5, meaning the EXP bits are set to 5.  The default behavior on a PE is to map the IPP (IP Precedence) on to the EXP bits. These line up nicely both being three bits.  Reference the ToS value above - 184.  That's a full 8 bit QoS value, in binary it's 10111000.  Chop off the last two unused digits for the 6-bit DSCP value of 101110, and you have 46 (which I suspect you recognize as "EF"), knock off everything except the first three bits - 101 - for the IPP, and you have?  Five.  Hence, EXP becomes 5 as well. This default feature is known as "ToS Reflection".

We'll look at how this value can be used to our advantage later.

What comes in on H2?

For those following my blog for a while, you may know about 14 months ago I wrote a giant ACL that matches every possible QoS value.  I still have it on file, and I'll be using it here to see what values come in on H2.

H2#sh ip access-list | i match
    460 permit ip any any dscp ef (2 matches)
    480 permit ip any any dscp cs6 (1 match)

Ok great! We've got two EF matches, and a ... Class Selector 6?
The EF matches are the two pings arriving.  I found this odd right off the bat, I would've expected that if IOS takes the IPP bits and maps them to EXP, that it would then take EXP and match them to IPP on the way out the other PE when the final label is popped. However, it doesn't work that way - instead, it just uses the DSCP that was already in the packet - which, of course, never changed.  An MPLS label was put on top of it, but the underlying packet was left intact.  

The class selector 6 packet is a BGP keepalive.  We'll be seeing more of them throughout the post.

It turns out there are terms for the different types of MPLS QoS behavior.  What we observed above would be either "Pipe Mode" or "Short Pipe Mode". Both of these behaviors include using the original ToS bits instead of replacing them based on the EXP bits.  The difference between Pipe Mode and Short Pipe Mode is that Pipe Mode egress queues based on the EXP bits, and Short Pipe Mode egress queues at the PE on the original ToS (DSCP) bits.  This post assumes the audience understands how to write a hierarchical QoS policy, so I'm not going to elaborate or examine the differences between them any further. Any additional mention of "Pipe Mode" assumes either of the above behaviors.  The third option is "Uniform Mode", which is the process of replacing the IP Packet's ToS bits (IPP/DSCP) with something derived from the EXP bits.

We just saw Pipe Mode in action above, let's look at how to implement Uniform Mode.

First we need to take a quick look at QoS groups.

There's a particular challenge with ingress and egress marking on a PE. On ingress, you can't set an IPP or DSCP value because the MPLS header is still on the frame.  On the egress interface, you can't match on the EXP bits to set IPP or DSCP bits, because the MPLS label is already popped.  So how do you match on an EXP value and set a DSCP value?  Enter QoS groups.

PE2:

class-map match-all EXP5
 match mpls experimental topmost 5

policy-map uniform-ingress
 class EXP5
  set qos-group 5
 class class-default
  set qos-group 0

interface fa0/0 ! MPLS side
 service-policy input uniform-ingress

This config will match a decimal value of five on the topmost MPLS label - which, in our case, on the PE, is the only MPLS label thanks to Penultimate Hop Pop.  We'll assign a local value of "5" (although this could be any number 1-99) if the EXP bit is 5.  Anything else will get reset to 0.

class-map match-all GROUP5
 match qos-group 5

policy-map uniform-egress
 class GROUP5
  set ip dscp af41  
 class class-default
  set ip dscp default

interface fa0/1 ! IP/VRF side
  service-policy output uniform-egress

On egress, we'll match on that 5, and set af41.  Why af41?  Because I wanted to show the policy was doing something.

We'll ping from H1 to H2 again.  I'm omitting any non-essential bits from the extended ping for brevity.

H1#ping
Target IP address: 192.168.1.6
Repeat count [5]: 2
Extended commands [n]: y
Type of service [0]: 184
Sending 2, 100-byte ICMP Echos to 192.168.1.6, timeout is 2 seconds:
..
Success rate is 0 percent (0/2)

Again, expected failure, this is a deliberate one-way flow.

H2#sh ip access-list | i match
    340 permit ip any any dscp af41 (2 matches)
    640 permit ip any any precedence routine (4 matches)

We see our two af41 hits, and 4 routine.  The routine are because the IPP 6 packets are being remarked to zero because it doesn't match anything else in the policy. 

Now obviously this is a pretty useless policy, but it was more about showing how the function works.
Here's an adaptation for a more scalable Uniform Mode solution:

policy-map uniform-ingress
 class class-default
  set qos-group mpls experimental topmost

interface fa0/0
 service-policy input uniform-ingress

policy-map uniform-egress
 class class-default
  set precedence qos-group

interface Fa0/1
  service-policy output uniform-egress

Let's see what the outcome is.

H1#ping
Target IP address: 192.168.1.6
Repeat count [5]: 2
Extended commands [n]: y
Type of service [0]: 184
Sending 2, 100-byte ICMP Echos to 192.168.1.6, timeout is 2 seconds:
..
Success rate is 0 percent (0/2)

H2#clear ip access-list count

H2#sh ip access-list | i match
    460 permit ip any any dscp ef (2 matches)

I was rather surprised the first time I saw this output.  We're setting a precedence value but getting back a DSCP value. I expected to see a precedence/class-selector value. The original bits were 101110 (DSCP 46, or EF), and I expected to replace them with 101000, which would be class selector 5.  Things brings up an important difference in IOS's handling of class-selector vs precedence, I'd always treated them the same, but it turns out IOS is more literal - Precedence sets only the precedence bits.  So we re-wrote the first three bits with 101, which ... were already set to 101.  So we ended up with 101110 (DSCP 46/EF) again.

We could do something like this:

policy-map uniform-egress
 class class-default
  set dscp qos-group 

But then we'd get literal DSCP values: if the QoS Group is 5, it would set DSCP 5.  Not DSCP CS5 (101000), but actual binary 5 - (000101).  To accomplish EF -> EXP 5 -> CS5, we'd have to use either a lengthy QoS-Group -> DSCP class-map/policy-map setup, or we could use a table map!

table-map TABMAP
 map from 1 to 8     ! Group 1 to DSCP CS1
 map from 2 to 16   ! Group 2 to DSCP CS2
 map from 3 to 24   ! ...
 map from 4 to 32
 map from 5 to 40
 map from 6 to 48
 map from 7 to 56   ! Group 7 to DSCP CS7

policy-map uniform-egress
 class class-default
  set dscp qos-group table TABMAP

H1#ping
Target IP address: 192.168.1.6
Repeat count [5]: 2
Extended commands [n]: y
Type of service [0]: 184
Sending 2, 100-byte ICMP Echos to 192.168.1.6, timeout is 2 seconds:
..
Success rate is 0 percent (0/2)

H2#sh ip access-list | i match
    400 permit ip any any dscp cs5 (2 matches)
    480 permit ip any any dscp cs6 (18 matches)

I think the table map use is pretty obvious - take a qos group and match it to some other integer, which has some meaning when applied to a DSCP or IPP field.  Now we have the CS5 output we were looking for.

Now clearly, MPLS/EXP QoS needs to be able to be modified on more than just the egress PE.  Let's take a look at the other spots we can match and adapt behavior to it.

So far we've been doing matches on the "topmost" label, so what other options have we got?  Keeping this oriented towards the R&S CCIE, I'm not going to look at anything other than a 2-tag (VRF + MPLS PE) system. When traffic is received in from the host towards the PE, the PE is going to impose a label for the VRF. It will then add the MPLS transit label on top of that, for reaching the other PE. So to reiterate, we go from zero labels to two labels on the PE.

We can set both those labels, and it's really not hard, but you have to pay attention to what label is being manipulated on which interface. IOS is picky about the order of operations in this case.

For ingress on a PE, we can only set imposition. We clearly can't set "topmost" because there are no labels on the packet yet:

PE1:
policy-map impose1
 class class-default
  set mpls experimental imposition 4

int fa0/0
 service-policy input impose1

H1#ping 192.168.1.6 rep 1

Type escape sequence to abort.
Sending 1, 100-byte ICMP Echos to 192.168.1.6, timeout is 2 seconds:
.
Success rate is 0 percent (0/1)

H2#sh ip access-list | i match
    320 permit ip any any dscp cs4 (1 match)
    480 permit ip any any dscp cs6 (2 matches)

And what if we set EF manually on H1?

H1#ping
Target IP address: 192.168.1.6
Repeat count [5]: 2
Extended commands [n]: y
Type of service [0]: 184
Sending 2, 100-byte ICMP Echos to 192.168.1.6, timeout is 2 seconds:
..
Success rate is 0 percent (0/2)

H2#sh ip access-list | i match
    320 permit ip any any dscp cs4 (3 matches)
    480 permit ip any any dscp cs6 (2 matches)

Still CS4, because we're remarking the EXP bits on the inner label on PE1 to 4, that's carried down to PE2, and then the qos-group-based policy remarks the DSCP to CS4.

What about the outer label?

H1#ping 192.168.1.6 rep 1

Type escape sequence to abort.
Sending 1, 100-byte ICMP Echos to 192.168.1.6, timeout is 2 seconds:
.
Success rate is 0 percent (0/1)

We'd need to look at the results on P2, because PE2 never gets the outer label - the PHP process removes it before forwarding the frame.

P2#
*Mar  2 01:44:59.409: MPLS: Fa0/0: recvd: CoS=4, TTL=253, Label(s)=16/19
*Mar  2 01:44:59.409: MPLS: Fa0/1: xmit: CoS=4, TTL=252, Label(s)=19

So P2 receives the outer label as 4, and the inner label as 4.  We see 4 coming in on Fa0/0 on label 16, and going out on label 19 on Fa0/1, showing both the PHP process and the fact that both EXP values are the same.  That's because the default behavior of a PE is to copy the inner label's EXP bits to the outer label.  But what if we wanted to set the outer label to something different?

There's two places we could do that: egress on the PE, or ingress on the P routers.

Let's try the PE first.

PE1:
policy-map topmost1
 class class-default
  set mpls experimental topmost 2

interface FastEthernet0/1
 service-policy output topmost1

H1#ping 192.168.1.6 rep 1

Type escape sequence to abort.
Sending 1, 100-byte ICMP Echos to 192.168.1.6, timeout is 2 seconds:
.
Success rate is 0 percent (0/1)

P2#
*Mar  2 01:53:51.609: MPLS: Fa0/0: recvd: CoS=2, TTL=253, Label(s)=16/19
*Mar  2 01:53:51.609: MPLS: Fa0/1: xmit: CoS=4, TTL=252, Label(s)=19

Now we see EXP 2 on the topmost and EXP 4 on the inner.

It's of some interest that if we wanted the final PE (PE2) to see that value of 2, we'd want to disable PHP.  PHP is disabled from the PE, not the router upstream from it.  This is done by the PE advertising an explicit blank label for the prefixes terminating on it:

PE2:

mpls ldp explicit-null

H1#ping 192.168.1.6 rep 1

Type escape sequence to abort.
Sending 1, 100-byte ICMP Echos to 192.168.1.6, timeout is 2 seconds:
.
Success rate is 0 percent (0/1)

P2#
*Mar  2 01:57:56.889: MPLS: Fa0/0: recvd: CoS=2, TTL=253, Label(s)=16/19
*Mar  2 01:57:56.889: MPLS: Fa0/1: xmit: CoS=2, TTL=252, Label(s)=0/19

H2#sh ip access-list | i match
    160 permit ip any any dscp cs2 (1 match)

We see that P2 forwarded both labels, one of which was the explicit null/0 label (reference 0/19).  The PE has to pop both labels before forwarding.  Consequently, we also see that the PE now marked CS2 based on the EXP2 in the topmost label.

Now let's see about manipulating the topmost label on a P device.
For clarity's sake on P2, I am disabling the implicit null (enabling PHP) on PE2:

PE2(config)#no mpls ldp explicit-null

P1:

policy-map set-topmost
 class class-default
  set mpls experimental topmost 7

interface FastEthernet0/1
 service-policy output set-topmost

Before I show the output of this, it's important to note that setting the topmost EXP on egress is the only option I could find that worked on the P routers.  The P routers aren't imposing any labels (just swapping, which is different), so imposition doesn't work, and setting topmost on ingress doesn't appear to do anything (although I am not sure why).  And now for the outcome:

H1#ping 192.168.1.6 rep 1

Type escape sequence to abort.
Sending 1, 100-byte ICMP Echos to 192.168.1.6, timeout is 2 seconds:
.
Success rate is 0 percent (0/1)

P2#
*Mar  2 02:22:25.641: MPLS: Fa0/0: recvd: CoS=7, TTL=253, Label(s)=16/19
*Mar  2 02:22:25.645: MPLS: Fa0/1: xmit: CoS=4, TTL=252, Label(s)=19

As anticipated, EXP 7 on the outer label only.

It's also important to note how P routers treat the EXP bits.  By default, unless you manually change it with the processes I've demonstrated and will demonstrate to come, the P router, as it swaps labels hop-by-hop, will always copy the EXP of the old outer label to the new outer label unmodified.

And now for our final topic - policing based on EXP.

P1:
class-map match-all EXP5
 match mpls experimental topmost 5

policy-map POLICER
 class EXP5
   police cir 32000
     conform-action transmit
     exceed-action set-mpls-exp-topmost-transmit 1

interface FastEthernet0/1
 service-policy output POLICER

H1#ping
Protocol [ip]:
Target IP address: 192.168.1.6
Repeat count [5]: 500
Datagram size [100]: 1000
Extended commands [n]: y
Type of service [0]: 184
Sending 500, 1000-byte ICMP Echos to 192.168.1.6, timeout is 0 seconds:
......................................................................
<output omitted>
..........
Success rate is 0 percent (0/500)

This one is tricky to validate - we want to see some MPLS packets leave P1 as 5, and some leave as 1.  Unfortunately my ACL doesn't work here (Without turning PHP back off) because we're playing with the upper label and not the inner label, and the Uniform Mode config on PE2 won't take heed of the outer label, because it's popped before hitting the egress interface.

Instead, we're just going to look at a sampling of "debug mpls packet" on P2:

*Mar  2 02:48:26.057: MPLS: Fa0/0: recvd: CoS=5, TTL=253, Label(s)=16/19
*Mar  2 02:48:26.057: MPLS: Fa0/1: xmit: CoS=5, TTL=252, Label(s)=19
*Mar  2 02:48:26.097: MPLS: Fa0/0: recvd: CoS=1, TTL=253, Label(s)=16/19
*Mar  2 02:48:26.097: MPLS: Fa0/1: xmit: CoS=5, TTL=252, Label(s)=19
*Mar  2 02:48:26.097: MPLS: Fa0/0: recvd: CoS=1, TTL=253, Label(s)=16/19
*Mar  2 02:48:26.097: MPLS: Fa0/1: xmit: CoS=5, TTL=252, Label(s)=19
*Mar  2 02:48:26.101: MPLS: Fa0/0: recvd: CoS=1, TTL=253, Label(s)=16/19

Let's decipher this a bit:

Remember, P2 is performing PHP for PE2, so what we see coming in and what we see going out will be different.  P1 is only making modifications to the topmost label.

*Mar  2 02:48:26.057: MPLS: Fa0/0: recvd: CoS=5, TTL=253, Label(s)=16/19

We got an MPLS packet in as EXP 5.

*Mar  2 02:48:26.057: MPLS: Fa0/1: xmit: CoS=5, TTL=252, Label(s)=19

We popped the upper label and sent the inner label on as EXP 5 as well.

*Mar  2 02:48:26.097: MPLS: Fa0/0: recvd: CoS=1, TTL=253, Label(s)=16/19

By this point, we've already gotten the policer to kick in, so we receive EXP 1.

*Mar  2 02:48:26.097: MPLS: Fa0/1: xmit: CoS=5, TTL=252, Label(s)=19

and we transmit EXP 5 based on the inner label, which was set on PE1 because of the IPP -> EXP ToS Reflection.  The policer on P1 did not modify this value.

That's MPLS QoS/QoS Groups in a nutshell.  Hope you enjoyed!

Jeff

3 comments: