Previous

Content

Next 


3.0.- New Congestion Avoidance algorithm

 

In april 1990, through an e-mail, Van Jacobson refined the congestion avoidance algorithm. Next is a summary of its document:

Congestion avoidance algorithm is intended to do two things:
  1. Prevent the pipe from going empty after a loss (if the pipe doesn't go empty, you won't have to waste round-trip times refilling it).
     
  2. Correctly account for the amount of data actually in the pipe (since that's what congestion avoidance is supposed to be estimating and adapting to).
For (1), remember that a packet loss is a signal that the pipe is congested and a packet loss can be detected of one of two different ways:
  1. via a retransmit timeout.
     
  2. when some small number (3-4) of consecutive duplicate acks has been received (also called duplicate ack threshold).
In case a.-, the pipe is guarantee to be empty, so we want slow-start.
In case b.-, if the duplicate ack threshold is small compared to the bandwidth-delay product (bdp), we have detected the loss with the pipe almost full, i.e., given a dup-ack threshold of 3 packets (3 dup-ack infering a packet loss) and a bdp of 16 packets (24KB assuming 1500 bytes MTU), the pipe is 75% full when loss is detected. As soon as the bdp is higher, the pipe will be almost full when loss is detected (maintaining the same dup-ack threshold of 3 packets). Since the pipe is (almost) full, there's no need to slow-start after the fast-retransmit.
For (2), a duplicate ack means that the receiver got an out-of-order packet, been the usual case of this a missing packet. i.e., if there are W packets in transit (W=window size) and one is dropped, the receiver will get W-1 out-of-order packets and will generate W-1 duplicate acks. If the dup-ack threshold is set high enough (i.e. 3 packets), we can reasonably assume that reaching the dup-ack threshold means a dropped packet.
But there's more information in the ack: receiver can only generate an ack in response to a packet arrival, i.e., a duplicate ack means that a packet has left the network; if the sender is limitted by the congestion window, a packet can now be sent (the congestion window is a count of how many packets will fit in the pipe; the ack says a packet has left the pipe so a new one can be added to take its place). So, conceptually, the new algorithm will be:
  • When the dupack threshold is reached, set ssthresh to ˝ the current congestion window cwnd and retransmit the missing packet.
     
  • Set now cwnd to ssthresh (original *Tahoe* specification used to set cwnd to one segment to force a slow-start).
     
  • Change now the sender's output routine to use an effective window of:
     
      min(snd_wnd, cwnd + dupacks * mss)
Van Jacobson tried to explain congestion avoidance algorithm behavior using what is called some algebra in these terms:
When packet U drop is detected we have:
  • U = first un-acked packet.
     
  • W = window size (in packets).
     
  • Then packets U..U+W are in transit.
     
  • When loss is detected we send packet U and pull the window back to ˝ W.
     
  • In the round-trip time it takes the U retransmit to fill the receiver's hole and an ack to get back, W-1 dupacks will arrive (one for each packet in transit).
 

The window is effectively inflated by one packet for each of these acks, so packets [U..U+W/2+W-1] are sent.
But, we don't re-send packets when we know they've been lost, so the amount actually sent between the loss detection and the recovery ack is U+W/2+W-1 - (U+W) = W/2-1, which is exactly the amount congestion avoidance allows us to send (if we add in the retransmit of U). The recovery ack is for packet U+W so when the effective window is pulled back from W/2+W-1 to W/2 (when happens because the recovery ack is "new" and sets dupack to zero), we are allowed to send up to packet U+W+W/2, which is exactly the first packet we haven't yet sent (i.e., there is no sudden burst of packets as the "hole" is filled).
Also, when sending packets between the loss detection and the recovery ack, we do nothing for the first W/2 dup acks (because they only allow us to send packets we've already sent) and the bottleneck gateway is given W/2 packet times to clean out its backlog. Thus, when we start sending our W/2-1 new packets, the bottleneck queue is as empty as it can be.
The explanation is not easy to understand and how himself says: "I don't know if you can get the flavor of what happens from the description - it's hard to see without a picture. But I was delighted by how beautiful it worked…"
I prepared a table for trying to explain what Van Jacobson was talking about; let's comment this table (LB):

Cwnd in the first column is the congestion window. Initially it is set to 1. Outs is the number of outstanding packets traveling through the network. Initially they are 0. Allows is the number of packets that can be sent by TCP. Because the initial congestion window is 1 and there are no packets in the network, we can send one packet. The first packet sent is packet number 0; column Sent tells us the number of the packets that are sent. Initially we send the packet number 0.
Column #-sent tells us how many packets are sent for this time. Because only packet number 0 is sent we write here 1 packet. Acked is the number of the packets acknowledged by the receiver. The packet number 0 reaches successfully the receiver end and it is acknowledged as well. #-acked indicates the number of packets acknowledged; this case 1 packet has been acknowledged.
Because one packet is successfully acknowledged the congestion window is increased by the number of packets being acknowledged; the new congestion window is 2 and it is represented in the table below the second column titled Cwnd.
Ssthresh is the slow start threshold; initially it is set to 64 packets. The last column, Algo, represents the TCP algorithm being used this time. For this moment TCP is using the Slow Start Algorithm represented by the acronym SS.
The second line starts with a congestion window of 2 packets; there are no packets in the network, then, we can send two new packets. Packets 1 and 2 are sent. They reach again successfully the other end and are acknowledged back. The congestion window is increased by the 2 packets recently acknowledged and the new congestion windows is 4 packets. Ssthresh and Algo remain the same.
This process is repeated for the line number 3 but in line number 4 we begin to have problems. For this line the congestion window is 8 packets and we don't have packets traveling through the network; then we can send 8 packets and we do this. Packets 7, 8, 9, .... 14 are sent (8 packets); unfortunately packet 14 is lost in the path; some router have a lunch with it. Only packets from number 7 to 13 can be acknowledged (7 packets); the new congestion window is now 15 packets (8+7).
Now we are in line 5. Congestion window is 15 packets but there is one packet that has not been acknowledged yet (packet number 14); then, for TCP, there is one packet traveling through the network. TCP really doesn't know yet that the packet is being digested by one bad router at this moment, and it supposes that it is somewhere in the network having a great time. Because the congestion window is 15 packets and there is one packet outstanding, then we can send 14 packets and we exactly do this. Packets 15, 16, 17, ... 28 (14 packets) are sent.
When the packet number 15 reaches the receiver end, this, having received a new packet will have to send one acknowledgement. But, my god, tells it, packet 14 has not been received yet, then it has to acknowledge only its last packet successfully received, and this is packet number 13 (bad lucky number, isn't it?). For acknowledging packet number 13 for the second time it answers to the sender asking again for the packet number 14. Okay, packet 13 is again acknowledged (second time), but it is its first duplicate acknowledgment; we represent this fact under the column Acked indicating that 0 (new) packets have been acknowledged and that this is the first duplicate acknowledgement of the packet number 13 (dp=1).
Because no new packets are acknowledged the congestion window cannot be increased; it maintains its previous value of 15 packets. Ssthresh and Algo do not change either. Now we are in the line number 6. Congestion window is 15 packets but, for TCP, 15 packets are yet traveling somewhere in the network because no new packets have been acknowledged. Really the packet number 15 was already received and the pipe (network) has now 14 packets outstanding (packet 14 that is lost and packets 16 to 28). Observe that the pipe is being emptied in this circumstance.
Following with line number 6, the congestion window is 15 packets but 15 packets are yet outstanding then we can't send any new packet. But, then, packet 16 is received and the receiver end sends again a new acknowledgement for packet number 13; this is the second duplicate acknowledgement for this packet. Congestion window must remain in 15 packets and Ssthresh and Algo maintain its initial values. Now we are in line number 7; the same situation is repeated here again, but beware. Something different has occurred this time.
When packet 17 reaches the receiver end, this sends back the third duplicate acknowledgement for packet number 13. When this acknowledgement is received by the sender side, it indicates to TCP that something bad is going on. TCP takes this third duplicate as an indication that packet number 14 is lost. Immediately, TCP reduces the congestion window to half the actual value (integer value of 15 divided by 2, which is just 7). But it also inflates the congestion window for the number of packets recently acknowledged. Now it knows that the three duplicate acknowledgement correspond to three packets that have been received by the other side (packets 15, 16, and 17 for our case). Then, it inflates the value of 7 packets of the congestion window with 3 new packets, because if 3 acknowledgement have been received it is because 3 packets have left the pipe. The new congestion window is now 10 packets.
But TCP also enters in a new algorithm. Now, Fast Retransmit (FR) is activated and the lost packet is immediately retransmitted. Ssthresh is reduced to 7 packets. In line number 8, packet number 14 is sent again. Packet 18 reaches now the receiver end, and  the fourth duplicate acknowledgement is sent back. When the new algorithm (FR) is running, the congestion window will be incremented by one packet for each duplicate acknowledgement received. The new congestion window is now 11 packets. The pipe continues being emptied; only packets number 19 to 28 + the recently retransmitted packet number 14 are now running (11 packets).
In lines number 9 to13 the previous process is repeated. Packets 19, 20, 21, 22, and 23 leave the pipe. The pipe continues being emptied and now it has only 6 packets (packets number 24 to 28 + packet number 14). Duplicate acknowledgement 5, 6, 7, 8, and 9 for packet number 13 are sent back (retransmitted packet number 14 has not been received yet, of course, because it was sent just after packet number 28 was sent), and the congestion window is increased by 1 packet each time, until reaching 16 packets.
In line number 14, the congestion window is 16 packets, but, for TCP, 15 packets are yet in the pipe because it only has received the acknowledgement for packet number 13. Then, it has to suppose that packets 15, 16, 17, ..., 28, and 14 (15 packets) are somewhere enjoying their trip. But now, because congestion window is 16 packets and 15 packets are outstanding, we can send a new packet to try to refill the pipe. Packet number 29 is sent. The pipe has now 7 packets (packets number 24 to 28 + packet number 14 + packet number 29).  

When packet number 24 reaches the receiver end, the duplicate acknowledgement number 10 for packet number 13 is sent back and the congestion window is increased to 17 packets. Now we are at the beginning of the line number 15. For this line and for lines number 16, 17, and 18, the process is very similar. Congestion window is always one packet more than packets TCP thinks are going in the network. Packets 25, 26, 27, and 28 leave the pipe generating duplicate aknowledgement numbers 11, 12, 13 and 14 for packet number 13. But packets 30, 31, 32, and 33 enters the pipe at the same time to replace leaving packets. The pipe continues having 7 packets (packets number 14, 29, 20, 31, 32, and 33). It is very important because TCP doesn't allow the pipe from having less than 7 packets.
At line number 19, packet number 34 is sent but something very interesting occurs. Packet number 14 reaches the receiver end breaking the bottleneck we have (this means, packet number 13 being repeatedly acknowledged). When packet 14 is acknowledged, all packets from packet number 15 to packet number 28 are acknowledged simultaneously too, because retransmitted packet number 14 was sent exactly behind packet number 28; after having acknowledged all these packets, TCP sender adjusts the congestion window to the value of the slow start threshold (7 packets) and enters in congestion avoidance algorithm (CA). This is the moment that makes Van Jacobson to say: "But I was delighted by how beautiful it worked…". The congestion window is set to 7 packets that is exactly one packet more that the number of packets we have outstanding in the pipe (packets number 29, 30, 31, 32, 33, and 34). Then, we can send another packet (packet number 35) and, as Van Jacobson said, "there is no sudden burst as the hole is filled...".
From here, TCP applies congestion avoidance algorithm (CA). For each roundtrip time, the congestion window is increased by one packet. When packets number 29 to number 34 are acknowledged, packets number 36 to number 41 are sent and the congestion window is set to 8 packets. When packets number 35 to number 41 are acknowledged, 8 new packets can be sent and packets number 42 to number 49 are sent. Congestion avoidance is the TCP's steady state algorithm that represent TCP behavior by the TCP response function studied by Sally Floyd, Matthew Mathis, Jeffrey Semke and Jamshid Mahdavi. Have a look here for more information.
Note: these algorithms were included in what was called Reno TCP implementation.
LB.

 

 


Previous

Content

Next