<![CDATA[Andree's Musings]]>http://toonk.io/http://toonk.io/favicon.pngAndree's Musingshttp://toonk.io/Ghost 3.34Mon, 23 Nov 2020 17:39:56 GMT60<![CDATA[Introducing Mysocket.io]]>http://toonk.io/introducing-mysocket-io/5fbb2ca251a5ff44fb64d8b7Mon, 23 Nov 2020 03:40:48 GMTIn this blog, I’d like to introduce a new project I’m calling Mysocket.io. Before we dive in, a bit of background.Introducing Mysocket.io

Loyal readers know I enjoy building global infrastructure services that need to be able to carry a significant amount of traffic and a large number of requests. Building services like these often require us to solve several challenges. Things to consider include: high availability, scaling, DDoS proofing, monitoring, logging, testing, deployments, user-facing & backend APIs, policy management (user preferences) and distribution, life-cycling, etc. All this while keeping an eye on cost and keeping complexity to a minimum (which really is human, operational cost).

To experiment with these topics, it’s necessary to have a project to anchor these experiments to. Something I can continuously work on, and while doing so, improve the project as a whole. Now, there are many projects I started over time, but one that I've worked most on recently, and wanted to share with a wider audience, is mysocket.io. A service that provides secure public endpoints for services that are otherwise not publicly reachable.

Introducing Mysocket.io

A typical example case that mysocket.io can help with is a web service running on your laptop, which you’d like to make available to a team member or client. Or ssh access to servers behind NAT or a firewall, like a raspberry pi on your home network or ec2 instances behind NAT.

Introducing Mysocket.io
make your localhost app available from anywhere
Introducing Mysocket.io
Provide SSH access to your home server behind NAT.

More details

Alright, a good way to share more details and is to do a quick demo! You can see a brief overview in this video, or even better, try it yourself by simply following the four easy steps below.

If you’re interested or curious, feel free to give it a spin and let me know what worked or didn’t, or even better, how it can be improved. Getting started will take you just one minute. Just follow these simple steps.
#Install client, using python's package manager (pip)
pip3 install mysocketctl

#Create account
mysocketctl account create \
    --name "your_name" \
    --email "your_email_address" \
    --password "a_secure_password" \
    --sshkey "$(cat ~/.ssh/id_rsa.pub)"
    
#login
mysocketctl login  \
    --email "your_email_address" \
    --password "a_secure_password" 
    
 
#Launch your first global socket ;)
mysocketctl connect \
    --port 8000 \
    --name "my test service"

Architecture overview

Ok, so how does it work? The process for requesting a “global socket” starts with an API call. You can do this by directly interfacing with the RESTful API, or by using the mysocketctl tool. This returns a global mysocket object, which has a name, port number(s), and some other information.

Users can now use this socket object to create tunnel objects. These tunnels are then used to connect your local service to the global mysocket.io infrastructure. By stitching these two TCP sessions together, we made your local service globally available.

Introducing Mysocket.io
Creating a Socket, a Tunnel and connecting to mysocket.io

The diagram below provides a high-level overview of the service data-plane. On the left, we have the origin service. This could be your laptop, your raspberry pi at home, or even a set of containers in a Kubernetes cluster. The origin service can be behind a very strict firewall or even NAT. All it needs is outbound network access. We can then set up a secure encrypted tunnel to any of the mysocket.io servers around the world.

Introducing Mysocket.io
Mysocket.io dataplane

Anycast

The Mysocket.io services use AWS’ global accelerator. With this, I’m making both the tunnel servers and proxy services anycasted. This solves some of the load balancing and high availability challenges. The mysocket tunnel and proxy servers are located in North America, Europe, and Asia.

Once the tunnel is established, the connection event is signaled to all other nodes in real-time, ensuring that all edge nodes know where the tunnel for the service is.

Documentation

One of my goals is to make Mysocket super easy to use. One way to do that is to have good documentation. I invite you to check out our readthedocs.io documentation here https://mysocket.readthedocs.io/

It’s divided into two sections:

  1. General information about mysocket.io and some of the concepts.
  2. Information and user guides for the mysocketctl command-line tool.

The documentation itself and mysocketctl tool are both opensource so feel free to open pull requests or open issues if you have any questions.

You may have noticed there’s a website as well. I wanted to create a quick landing page, so I decided to play with Wix.com. They make it super easy;  I may have gone overboard a bit ;)  All that was clicked together in just one evening, pretty neat.

More to come

There’s a lot more to tell and plenty more geeky details to dive into. More importantly, we can continue to build on this and make it even better (ping me if you have ideas or suggestions)!
So stay tuned. That’s the plan for subsequent Blog posts soon, either in this blog or the mysocket.io blog.

Cheers,
-Andree

]]>
<![CDATA[AWS and their Billions in IPv4 addresses]]>Earlier this week, I was doing some work on AWS and wanted to know what IP addresses were being used. Luckily for me, AWS publishes this all here https://ip-ranges.amazonaws.com/ip-ranges.json. When you go through this list, you’ll quickly see that AWS has a massive asset

]]>
http://toonk.io/aws-and-their-billions-in-ipv4-addresses/5f88bf632257c9dcdf92feeeTue, 20 Oct 2020 16:01:11 GMT

Earlier this week, I was doing some work on AWS and wanted to know what IP addresses were being used. Luckily for me, AWS publishes this all here https://ip-ranges.amazonaws.com/ip-ranges.json. When you go through this list, you’ll quickly see that AWS has a massive asset of IPv4 allocations. Just counting quickly I noticed a lot of big prefixes.

However, the IPv4 ranges on that list are just the ranges that are in use and allocated today by AWS. Time to dig a bit deeper.

IPv4 address acquisitions by AWS

Over the years, AWS has acquired a lot of IPv4 address space. Most of this happens without gaining too much attention, but there were a few notable acquisitions that I’ll quickly summarize below.

2017: MIT selling 8 million IPv4 addresses to AWS

In 2017 MIT sold half of its 18.0.0.0/8 allocation to AWS. This 18.128.0.0/9 range holds about 8 million IPv4 addresses.

2018: GE sells 3.0.0.0/8 to AWS

In 2018 the IPv4 prefix 3.0.0.0/8 was transferred from GE to AWS. With this, AWS became the proud owner of its first /8! That’s sixteen million new IPv4 addresses to feed us hungry AWS customers. https://news.ycombinator.com/item?id=18407173

2019: AWS buys AMPRnet 44.192.0.0/10

In 2019 AWS bought a /10 from AMPR.org, the Amateur Radio Digital Communications (ARDC). The IPv4 range 44.0.0.0/8 was an allocation made to the Amateur Radio organization in 1981 and known as the AMPRNet. This sell caused a fair bit of discussion, check out the nanog discussion here.

Just this month, it became public knowledge AWS paid $108 million for this /10. That’s $25.74 per IP address.

These are just a few examples. Obviously, AWS has way more IP addresses than the three examples I listed here. The IPv4 transfer market is very active. Check out this website to get a sense of all transfers: https://account.arin.net/public/transfer-log

All AWS IPv4 addresses

Armed with the information above it was clear that not all of the AWS owned ranges were in the JSON that AWS published. For example, parts of the 3.0.0.0/8 range are missing. Likely because some of it is reserved for future use.

I did a bit of digging and tried to figure out how many IPv4 addresses AWS really owns. A good start is the Json that AWS publishes. I then combined that with all the ARIN, APNIC, and RIPE entries for Amazon I could find. A few examples include:

https://rdap.arin.net/registry/entity/AMAZON-4
https://rdap.arin.net/registry/entity/AMAZO-4
https://rdap.arin.net/registry/entity/AT-88-Z

Combining all those IPv4 prefixes, removing duplicates and overlaps by aggregating them results in the following list of unique IPv4 address owned by AWS: https://gist.github.com/atoonk/b749305012ae5b86bacba9b01160df9f#all-prefixes

The total number of IPv4 addresses in that list is just over 100 Million (100,750,168). That’s the equivalent of just over six /8’s, not bad!

If we break this down by allocation size, we see the following:

1x /8     => 16,777,216 IPv4 addresses
1x /9     => 8,388,608 IPv4 addresses
4x /10    => 16,777,216 IPv4 addresses
5x /11    => 10,485,760 IPv4 addresses
11x /12   => 11,534,336 IPv4 addresses
13x /13   => 6,815,744 IPv4 addresses
34x /14   => 8,912,896 IPv4 addresses
53x /15   => 6,946,816 IPv4 addresses
182x /16  => 11,927,552 IPv4 addresses
<and more>

A complete breakdown can be found here: https://gist.github.com/atoonk/b749305012ae5b86bacba9b01160df9f#breakdown-by-ipv4-prefix-size

Putting a valuation on AWS’ IPv4 assets

Alright.. this is just for fun…

Since AWS is (one of) the largest buyers of IPv4 addresses, they have spent a significant amount on stacking up their IPv4 resources. It’s impossible, as an outsider, to know how much AWS paid for each deal. However, we can for fun, try to put a dollar number on AWS’ current IPv4 assets.

The average price for IPv4 addresses has gone up over the years. From ~$10 per IP a few years back to ~$25 per IP nowadays.
Note that these are market prices, so if AWS would suddenly decide to sell its IPv4 addresses and overwhelm the market with supply, prices would drop. But that won’t happen since we’re all still addicted to IPv4 ;)

Anyway, let’s stick with $25 and do the math just for fun.

100,750,168 ipv4 addresses x $25 per IP = $2,518,754,200

Just over $2.5 billion worth of IPv4 addresses, not bad!

Peeking into the future

It’s clear AWS is working hard behind the scenes to make sure we can all continue to build more on AWS. One final question we could look at is: how much buffer does AWS have? ie. how healthy is their IPv4 reserve?

According to their published data, they have allocated roughly 53 Million IPv4 addresses to existing AWS services. We found that all their IPv4 addresses combined equates to approximately 100 Million IPv4 addresses. That means they still have ~47 Million IPv4 addresses, or 47% available for future allocations. That’s pretty healthy! And on top of that, I’m sure they’ll continue to source more IPv4 addresses. The IPv4 market is still hot!

]]>
<![CDATA[100G networking in AWS, a network performance deep dive]]>

Loyal readers of my blog will have noticed a theme, I’m interested in the continued move to virtualized network functions, and the need for faster networking options on cloud compute. In this blog, we’ll look at the network performance on the juggernaut of cloud computing, AWS.

AWS is

]]>
http://toonk.io/aws-network-performance-deep-dive/5f889c6f2257c9dcdf92feb0Thu, 15 Oct 2020 19:06:42 GMT100G networking in AWS, a network performance deep dive

Loyal readers of my blog will have noticed a theme, I’m interested in the continued move to virtualized network functions, and the need for faster networking options on cloud compute. In this blog, we’ll look at the network performance on the juggernaut of cloud computing, AWS.

AWS is the leader in the cloud computing world, and many companies now run parts of their services on AWS. The question we’ll try to answer in this article is: how well suited is AWS’ ec2 for high throughput network functions.

I’ve decided to experiment with adding a short demo video to this blog. Below you will find a quick demo and summary of this article. Since these videos are new and a bit of an experiment, let me know if you like it.

100G networking

It’s already been two years since AWS announced the C5n instances, featuring 100 Gbps networking. I’m not aware of any other cloud provider offering 100G instances, so this is pretty unique. Ever since this was released I wondered exactly what, if any, the constraints were. Can I send/receive 100g line rate (144Mpps)? So, before we dig into the details, let’s just check if we can really get to 100Gbs.

100gbs testing.. We’re gonna need a bigger boat..

There you have it, I was able to get to 100Gbs between 2 instances! That’s exciting. But there are a few caveats. We’ll dig into all of them in this article, with the aim to understand exactly what’s possible, what the various limits are, and how to get to 100g.

Understand the limits

Network performance on Linux is typically a function of a few parameters. Most notably, the number of TX/RX queues available on the NIC (network card). The number of CPU cores, ideally at least equal to the number of queues. The pps (packets per second) limit per queue. And finally, in virtual environments like AWS and GCP, potential admin limits on the instance.

100G networking in AWS, a network performance deep dive

Doing networking in software means that processing a packet (or a batch of them) uses a number of CPU cycles. It’s typically not relevant how many bytes are in a packet. As a result, the best metric to look at is the: pps number (related to our cpu cycle budget). Unfortunately, the pps performance numbers for AWS aren’t published so, we’ll have to measure them in this blog. With that, we should have a much better understanding of the network possibilities on AWS, and hopefully, this saves someone else a lot of time (this took me several days of measuring) ;)

Network queues per instance type

The table below shows the number of NIC queues by ec2 (c5n) Instance type.

100G networking in AWS, a network performance deep dive

In the world of ec2, 16 vCPUs on the C5n 4xl instance means 1 Socket, 8 Cores per socket, 2 Threads per core.

On AWS, an Elastic Network Adapter (ENA) NIC has as many queues as you have vCPUs. Though it stops at 32 queues, as you can see with the C5n 9l and C5n 18xl instance.

Like many things in computing, to make things faster, things are parallelized. We see this clearly when looking at CPU capacity, we’re adding more cores, and programs are written in such a way that can leverage the many cores in parallel (multi-threaded programs).

Scaling Networking performance on our servers is done largely the same. It’s hard to make things significantly faster, but it is easier to add more ‘workers’, especially if the performance is impacted by our CPU capacity. In the world of NICs, these ‘workers’ are queues. Traffic send and received by a host is load-balanced over the available network queues on the NIC. This load balancing is done by hashing (typically the 5 tuples, protocol, source + destination address, and port number). Something you’re likely familiar with from ECMP.

100G networking in AWS, a network performance deep dive
So queues on a NIC are like lanes on a highway, the more lanes, the more cars can travel the highway. The more queues, the more packets (flows) can be processed.

Test one, ENA queue performance

As discussed, the network performance of an instance is a function of the number of available queues and cpu’s. So let’s start with measuring the maximum performance of a single flow (queue) and then scale up and measure the pps performance.

In this measurement, I used two c5n.18xlarge ec2 instances in the same subnet and the same placement zone. The sender is using DPDK-pktgen (igb_uio). The receiver is a stock ubuntu 20.04 LTS instance, using the ena driver.

The table below shows the TX and RX performance between the two c5n.18xlarge ec2 instances for one and two flows.

100G networking in AWS, a network performance deep dive

With this, it seems the per queue limit is about 1Mpps. Typically the per queue limit is due to the fact that a single queue (soft IRQ) is served by a single CPU core. Meaning, the per queue performance is limited by how many packets per second a single CPU core can process. So again, what you typically see in virtualized environments is that the number of network queues goes up with the number of cores in the VM. In ec2 this is the same, though it’s maxing out at 32 queues.

Test two, RX only pps performance

Now that we determined that the per queue limit appears to be roughly one million packets per second, it’s natural to presume that this number scales up horizontally with the number of cores and queues. For example, the C5n 18xl comes with 32 nic queues and 72 cores, so in theory, we could naively presume that the (RX/TX) performance (given enough flows) should be 32Mpps. Let’s go ahead and validate that.

The graph below shows the Transmit (TX) performance as measured on a c5n.18xlarge. In each measurement, I gave the packet generator one more queue and vcpu to work with. Starting with one TX queue and one VCPu, incrementing this by one in each measurement until we reached 32 vCPU and 32 queues (max). The results show that the per TX queue performance varied between 1Mpps to 700Kpps. The maximum total TX performance I was able to get however, was ~8.5Mpps using 12 TX queues. After that, adding more queues and vCPu’s didn’t matter, or actually degraded the performance. So this indicates that the performance scales horizontally (per queue), but does max out at a certain point (varies per instance type), in this case at 8.5 Mpps

100G networking in AWS, a network performance deep dive
c5n.18xlarge per TX queue performance

In this next measurement, we’ll use two packet generators and one receiver. I’m using two generators, just to make sure the limit we observed earlier isn’t caused by limitations on the packet generator. Each traffic generator is sending many thousands of flows, making sure we leverage all the available queues.

100G networking in AWS, a network performance deep dive
RX pps per C5N instance type

Alright, after a few minutes of reading (and many, many hours, well really days.. of measurements on my end), we now have a pretty decent idea of the performance numbers. We know how many queues each of the various c5n instance types have.

We have seen that the per queue limit is roughly 1Mpps. And with the table above, we now see how many packets per second each instance is able to receive (RX).

Forwarding performance

If we want to use ec2 for virtual network functions, then just receiving traffic isn’t enough. A typical router or firewall should both receive and send traffic at the same time. So let’s take a look at that.

For this measurement, I used the following setup. Both the traffic generator and receiver were C5n-18xl instances. The Device Under Test (DUT) was a standard Ubuntu 20.04 LTS instance using the ena driver. Since the earlier observed pps numbers weren’t too high, I determined it’s safe to use the regular Linux kernel to forward packets.

100G networking in AWS, a network performance deep dive
test setup
100G networking in AWS, a network performance deep dive
pps forwarding performance

The key takeaway from this measurement is that the TX and RX numbers are similar as we’d seen before for the instance types up to (including) the C5n 4xl. For example, earlier we saw the C5n 4xl could receive up to ~3Mpps. This measurement shows that it can do ~3Mpps simultaneously on RX and TX.

However, if we look at the C5n 9l, we can see it was able to process RX+ TX about 6.2Mpps. Interestingly, earlier we saw it was also able to receive (rx only) ~6Mpps. So it looks like we hit some kind of aggregate limit. We observed a similar limit for the C5n 18xl instance.

In Summary.

In this blog, we looked at the various performance characteristics of networking on ec2. We determined that the performance of a single queue is roughly 1Mpps. We then saw how the number of queues goes up with the higher end instances up until 32 queues maximum.

We then measure the RX performance of the various instances as well as the forwarding (RX + TX aggregate) performance. Depending on the measurement setup (RX, or TX+RX) we see that for the largest instance types, the pps performance maxes out at roughly 6.6Mpps to 8.3Mpps. With that, I think that the C5n 9l hits the sweet spot in terms of cost vs performance.

So how about that 100G test?

Ah yes! So far, we talked about pps only. How does that translate that to gigabits per second?
Let’s look at the quick table below that shows how the pps number translates to Gbs at various packet sizes.

100G networking in AWS, a network performance deep dive

These are a few examples to get to 10G at various packet sizes. This shows that in order to support line-rate 10G at the smallest packet size, the system will need to be able to do ~14.88 Mpps. The 366 byte packet size is roughly the equivalent average of what you’ll see with an IMIX test, for which the systems needs to be able to process ~3,4Mpps to get to 10G line rate.

If we look at the same table but then for 100gbps, we see that at the smallest packet size, an instance would need to be able to process is over 148Mpps. But using 9k jumbo frames, you only need 1.39Mpps.

100G networking in AWS, a network performance deep dive

And so, that’s what you need to do to get to 100G networking in ec2. Use Jumbo frames (supported in ec2, in fact, for the larger instances, this was the default). With that and a few parallel flows you’ll be able to get to 100G “easily”!

A few more limits

One more limitation I read about while researching, but didn’t look into myself. It appears that some of the instances have time-based limits on the performance. This blog calls it Guaranteed vs. Best Effort. Basically, you’re allowed to burst for a while, but after a certain amount of time, you’ll get throttled. Finally, there is a per-flow limit of 10Gbs. So if you’re doing things like IPSEC, GRE, VXLAN, etc, note that you will never go any faster than 10g.

Closing thoughts

Throughout this blog, I mentioned the word ‘limits’ quite a bit, which has a bit of a negative connotation. However, it’s important to keep in mind that AWS is a multi-tenant environment, and it’s their job to make sure the user experience is still as much as possible as if the instance is dedicated to you. So you can also think of them as ‘guarantees’. AWS will not call them that, but in my experience, the throughput tests have been pretty reproducible with, say a +/- 10% measurement margin.

All in all, it’s pretty cool to be able to do 100G on AWS. As long as you are aware of the various limitations, which unfortunately aren’t well documented. Hopefully, this article helps some of you with that in the future.
Finally, could you use AWS to run your virtual firewalls, proxies, VPN gateways, etc? Sure, as long as you’re aware of the performance constraints. And with that design a horizontally scalable design, according to AWS best practices. The one thing you really do need to keep an eye on is the (egress) bandwidth pricing, which, when you started doing many gigabits per second, can add up.

Cheers
- Andree


]]>
<![CDATA[Building a global anycast service in under a minute]]>This weekend I decided to take another look at Stackpath and their workload edge compute features. This is a relatively new feature, in fact, I wrote about it in Feb 2109 when it was just released. I remember being quite enthusiastic about the potential but also observed some things that

]]>
http://toonk.io/building-a-global-anycast-service-in-under-a-minute/5f531ce73680d46e73ec5fdfSun, 21 Jun 2020 05:15:00 GMT

This weekend I decided to take another look at Stackpath and their workload edge compute features. This is a relatively new feature, in fact, I wrote about it in Feb 2109 when it was just released. I remember being quite enthusiastic about the potential but also observed some things that were lacking back then. Now, one and a half years later, it seems most of those have been resolved, so let’s take a look!

I’ve decided to experiment with adding a small demo video to these blogs.
Below you will find a quick 5min demo of the whole setup. Since these videos are new and a bit of an experiment, let me know if you like it.
Demo: Building a global anycast service in under a minute

Workloads

Stackpath support two types of workloads (in addition to serverless), VM and container-based deployments. Both can be orchestrated using API’s and Terraform. Terraform is an “Infrastructure as code” tool. You simply specify your intent with Terraform, apply it, and you’re good to go. I’m a big fan of Terraform, so we’ll use that for our test.

One of the cool things about Stackpath is that they have built-in support for Anycast, for both their VM and Container service. I’m going to use that feature and the Container service to build this highly available, low latency web service. It’s super easy, see for your self on my github here.

Docker setup

Since I’m going to use the container service, we need to create a Docker container to work with. This is my Dockerfile

FROM python:3
WORKDIR /usr/src/app
COPY ./mywebserver.py .
EXPOSE 8000
ENV PYTHONUNBUFFERED 1
CMD [ “python”, “./mywebserver.py” ]

The mywebserver.py program is a simple web service that prints the hostname environment variable. This will help us determine which node is servicing our request when we start our testing.

After I built the container, I uploaded it to my Dockerhub repo, so that Stackpath can pull it from there.

Terraform

Now it’s time to define our infrastructure using terraform. The relevant code can be found on my github here. I’ll highlight a few parts:

On line 17 we start with defining a new workload, and I’m requesting an Anycast IP for this workload. This means that Stackpath will load balance (ECMP) between all nodes in my workload (which I’m defining later).

resource “stackpath_compute_workload” “my-anycast-workload” {   
    name = “my-anycast-workload”
    slug = “my-anycast-workload”   
    annotations = {       
        # request an anycast IP       
        “anycast.platform.stackpath.net” = “true”   
    }

On line 31, we define the type of workload, in this case, a container. As part of that we’re opening the correct ports, in my case port 8000 for the python service.

container {   
    # Name that should be given to the container   
    name = “app”   
    port {      
        name = “web”      
        port = 8000      
        protocol = “TCP”      
        enable_implicit_network_policy = true   
    }

Next up we define the container we’d like to deploy (from Dockerhub)

# image to use for the container
image = “atoonk/pythonweb:latest”

In the resources section we define the container specifications. In my case I’m going with a small spec, of one CPU core and 2G of ram.

resources {
   requests = {
      “cpu” = “1”
      “memory” = “2Gi”
   }
}

We now get to the section where we define how many containers we’d like per datacenter and in what datacenters we’d like this service to run.

In the example below, we’re deploying three containers in each datacenter, with the possibility to grow to four as part of auto-scaling. We’re deploying this in both Seattle and Dallas.

target {
    name         = "global"
    min_replicas = 3
    max_replicas = 4
    scale_settings {
      metrics {
        metric = "cpu"
        # Scale up when CPU averages 50%.
        average_utilization = 50
      }
    }
    # Deploy these instances to Dallas and Seattle
    deployment_scope = "cityCode"
    selector {
      key      = "cityCode"
      operator = "in"
      values   = [
        "DFW", "SEA"
      ]
    }
  }

Time to bring up the service.

Now that we’ve defined our intent with terrraform, it’s time to bring this up. The proper way to do this is:

terraform init
terraform plan
terraform apply
After that, you’ll see the containers come up, and our anycasted python service will become available. Since the containers come up rather quickly, you should have all six containers in the two datacenters up and running in under a minute.

Testing the load balancing.

I’ve deployed the service in both Seattle and Dallas, and since I am based in Vancouver Canada, I expect to hit the Seattle datacenter as that is the closest datacenter for me.

$ for i in `seq 1 10`; do curl 185.85.196.41:8000 ; done

my-anycast-workload-global-sea-2
my-anycast-workload-global-sea-0
my-anycast-workload-global-sea-2
my-anycast-workload-global-sea-0
my-anycast-workload-global-sea-1
my-anycast-workload-global-sea-1
my-anycast-workload-global-sea-2
my-anycast-workload-global-sea-1
my-anycast-workload-global-sea-2
my-anycast-workload-global-sea-0

The results above show that I am indeed hitting the Seattle datacenter, and that my requests are being load balanced over the three instances in Seattle, all as expected.

Building a global anycast service in under a minute
In the portal, I can see the per container logs as well

In Summary

Compared to my test last year with Stackpath, there has been a nice amount of progress. It’s great to now be able to do all of this with just a Terraform file. It’s kind of exciting you can bring up a fully anycast service in under a minute with only one command! By changing the replicate number in the Terraform file we can also easily grow and shrink our deployment if needed.
In this article we looked at the container service only, but the same is possible with Virtual machines, my github repo has an example for that as well.

Finally, don’t forget to check the demo recording and let me know if you’d like to see more video content.

]]>
<![CDATA[Building an XDP (eXpress Data Path) based BGP peering router]]>Over the last few years, we’ve seen an increase in projects and initiatives to speed up networking in Linux. Because the Linux kernel is slow when it comes to forwarding packets, folks have been looking at userland or kernel bypass networking. In the last few blog posts, we’ve

]]>
http://toonk.io/building-an-xdp-express-data-path-based-bgp-peering-router/5f53ccd83680d46e73ec6039Sun, 19 Apr 2020 00:00:00 GMT

Over the last few years, we’ve seen an increase in projects and initiatives to speed up networking in Linux. Because the Linux kernel is slow when it comes to forwarding packets, folks have been looking at userland or kernel bypass networking. In the last few blog posts, we’ve looked at examples of this, mostly leveraging DPDK to speed up networking. The trend here is, let’s just take networking away from the kernel and process them in userland. Great for speed, not so great for all the Kernel network stack features that now have to be re-implemented in userland.

The Linux Kernel community has recently come up with an alternative to userland networking, called XDP, Express data path, it tries to strike a balance between the benefits of the kernel and faster packet processing. In this article, we’ll take a look at what it would take to build a Linux router using XDP. We will go over what XDP is, how to build an XDP packet forwarder combined with a BGP router, and of course, look at the performance.

XDP (eXpress Data Path)

XDP (eXpress Data Path) is an eBPF based high-performance data path merged in the Linux kernel since version 4.8. Yes, BPF, the same Berkeley packet filter as you’re likely familiar with from tcpdump filters, though that’s now referred to as Classic BPF. Enhanced BPF has gained a lot of popularity over the last few years within the Linux community. BPF allows you to connect to Linux kernel hook points, each time the kernel reaches one of those hook points, it can execute an eBPF program. I’ve heard some people describe eBPF as what Java script was for the web, an easy way to enhance the ’web’, or in this case, the kernel. With BPF you can execute code without having to write kernel modules. XDP, as part of the BPF family, operates early on in the Kernel network code. The idea behind XDP is to add an early hook in the RX path of the kernel and let a user-supplied eBPF program decide the fate of the packet. The hook is placed in the NIC driver just after the interrupt processing and before any memory allocation needed by the network stack itself. So all this happens before an SKB (the most fundamental data structure in the Linux networking code) is allocated. Practically this means this is executed before things like tc and iptables.

A BPF program is a small virtual machine, perhaps not the typical virtual machines you’re familiar with, but a tiny (RISC register machine) isolated environment. Since it’s running in conjunction with the kernel, there are some protective measures that limit how much code can be executed and what it can do. For example, it can not contain loops (only bounded loops), there are a limited number of eBPF instructions and helper functions. The maximum instruction limit per program is restricted to 4096 BPF instructions, which, by design, means that any program will terminate quickly. For kernel newer than 5.1, this limit was lifted to 1 million BPF instructions.

When and Where is the XDP code executed

XDP programs can be attached to three different points. The fastest is to have it run on the NIC itself, for that you need a smartnic and is called offload mode. To the best of my knowledge, this is currently only supported on Netronome cards. The next attachment opportunity is essentially in the driver before the kernel allocates an SKB. This is called “native” mode and means you need your driver to support this, luckily most popular drivers do nowadays.

Finally, there is SKB or Generic Mode XDP, where the XDP hook is called from netif _ receive _ skb(), this is after the packet DMA and skb allocation are completed, as a result, you lose most of the performance benefits.

Assuming you don’t have a smartnic, the best place to run your XDP program is in native mode as you’ll really benefit from the performance gain.

XDP actions

Now that we know that XDP code is an eBPF C program, and we understand where it can run, now let’s take a look at what you can do with it. Once the program is called, it receives the packet context and from that point on you can read the content, update some counters, potentially modify the packet, and then the program needs to terminate with one of 5 XDP actions:

XDP_DROP
This does exactly what you think it does; it drops the packet and is often used for XDP based firewalls and DDOS mitigation scenarios.
XDP_ABORTED
Similar to DROP, but indicates something went wrong when processing. This action is not something a functional program should ever use as a return code.
XDP_PASS
This will release the packet and send it up to the kernel network stack for regular processing. This could be the original packet or a modified version of it.
XDP_TX
This action results in bouncing the received packet back out the same NIC it arrived on. This is usually combined with modifying the packet contents, like for example, rewriting the IP and Mac address, such as for a one-legged load balancer.
XDP_REDIRECT
The redirect action allows a BPF program to redirect the packet somewhere else, either a different CPU or different NIC. We’ll use this function later to build our router. It is also used to implement AF_XDP, a new socket family that solves the highspeed packet acquisition problem often faced by virtual network functions. AF_XDP is, for example, used by IDS’ and now also supported by Open vSwitch.

Building an XDP based high performant router

Alright, now that we have a better idea of what XDP is and some of its capabilities, let’s start building! My goal is to build an XDP program that forwards packets at line-rate between two 10G NICs. I also want the program to use the regular Linux routing table. This means I can add static routes using the “ip route” command, but it also means I could use an opensource BGP daemon such as Bird or FRR.

We’ll jump straight to the code. I’m using the excellent XDP tutorial code to get started. I forked it here, but it’s mostly the same code as the original. This is an example called “xdp_router” and uses the bpf_fib_lookup() function to determine the egress interface for a given packet using the Linux routing table. The program then uses the action bpf_redirect_map() to send it out to the correct egress interface. You can see code here. It’s only a hundred lines of code to do all the work.

After we compile the code (just run make in the parent directory), we load the code using the ./xdp_loader program included in the repo and use the ./xdp_prog_user program to populate and query the redirect_params maps.

#pin BPF resources (redirect map) to a persistent filesystem
mount -t bpf bpf /sys/fs/bpf/

# attach xdp_router code to eno2
./xdp_loader -d eno2 -F — progsec xdp_router

# attach xdp_router code to eno4
./xdp_loader -d eno4 -F — progsec xdp_router

# populate redirect_params maps
./xdp_prog_user -d eno2
./xdp_prog_user -d eno4

Test setup

So far, so good, we’ve built an XDP based packet forwarder! For each packet that comes in on either network interface eno2 or eno4 it does a route lookup and redirects it to the correct egres interface, all in eBPF code. All in a hundred lines of code, Pretty awesome, right?! Now let’s measure the performance to see if it’s worth it. Below is the test setup.

Building an XDP (eXpress Data Path) based BGP peering router
test setup

I’m using the same traffic generator as before to generate 14Mpps at 64Bytes for each 10G link. Below are the results:

Building an XDP (eXpress Data Path) based BGP peering router
XDP forwarding Test results

The results are amazing! A single flow in one direction can go as high as 4.6 Mpps, using one core. Earlier, we saw the Linux kernel can go as high as 1.4Mpps for one flow using one core.

14Mpps in one direction between the two NICs require four cores. Our earlier blog showed that the regular kernel would need 16 cores to do this work!

Building an XDP (eXpress Data Path) based BGP peering router
Test result — XDP forwarding using XDP_REDIRECT, 5 cores to forward 29Mpps

Finally, for the bidirectional 10,000 flow test, forwarding 28Mpps, we need five cores. All tests are significantly faster than forwarding packets using the regular kernel and all that with minor changes to the system.

Just so you know

Since all packet forwarding happens in XDP, packets redirected by XDP won’t be visible to IPtables or even tcpdump. Everything happens before packets even reach that layer, and since we’re redirecting the packet, it never moves up higher the stack. So if you need features like ACLs or NAT, you will have to implement that in XDP (take a look at https://cilium.io/).

A word on measuring cpu usage.
To control and measure the number of CPU cores used by XDP, I’m changing the number of queues the NIC can use. I increase the number of queues on my XL710 Intel NIC incrementally until I get a packet loss-free transfer between the two ports on the traffic generator. For example, to get 14Mpps in one direction from port 0 to 1 on the traffic generator through our XDP router, which was forwarding between eno2 and eno4, I used the following settings:

ethtool -L eno2 combined 4
ethtool -L eno4 combined 4

For the 28Mpps testing, I used the following

ethtool -L eno2 combined 4
ethtool -L eno4 combined 4
A word of caution
Interestingly, increasing the number of queues, and thus using more cores, appears to, in some cases, have a negative impact on the efficiency. Ie. I’ve seen scenarios when using 30 queues, where the unidirectional 14mps test with 10,000 flows appear to use almost no CPU (between 1 and 2) while the same test bidirectionally uses up all 30 cores. When restarting this test, I see some inconsistent behavior in terms of CPU usage, so not sure what’s going on, I will need to spend a bit more time on this later.

XDP as a peering router

The tests above show promising results, but one major difference between a simple forwarding test and a real life peering router is the number of routes in the forwarding table. So the questions we need to answer was how the bpf_fib_lookup function will perform when there are more than just a few routes in the routing table. More concretely, could you use Linux with XDP as a full route peering router?
To answer this question, I installed bird as a bgp daemon on the XDP router. Bird has a peering session with an exabgp instance, which I loaded with a full routing table using mrt2exabgp.py and a MRT files from RIPE RIS.
Just to be a real peering router, I also filtered out the RPKI invalid routes using rtrsub. The end result is a full routing table with about 800k routes in the Linux FIB.

Building an XDP (eXpress Data Path) based BGP peering router
Test result — XDP router with a ful routing table. 5 cores to forward 28Mpps
After re-running the performance tests with 800k bgp routes in the FIB, I observed no noticeable decrease in performance.
This indicates that a larger FIB table has no measurable impact on the XDP helper bpf_fib_lookup(). This is exciting news for those interested in a cheap and fast peering router.

Conclusion and closing thoughts.

We started the article with a quick introduction to eBPF and XDP. We learned that XDP is a subset of the recent eBPF developments focused specifically on the hooks in the network stack. We went over the different XDP actions and introduced the redirect action, which, together with the bpf_fib_lookup helper allows us to build the XDP router.

When looking at the performance, we see this we can speed up packet forwarding in Linux by roughly five times in terms of CPU efficiency compared to regular kernel forwarding. We observed we needed about five cores to forward 28Mpps bidirectional between two 10G NICs.

When we compare these results with the results from my last blog, DPDK and VPP, we see that XDP is slightly slower, ie. 3 cores (vpp) vs 5 cores (XDP) for the 28Mpps test. However, the nice part about working with XDP was that I was able to leverage the Linux routing table out of the box, which is a major advantage.

The exciting part is that this setup integrates natively with Netlink, which allowed us to use Bird, or really any other routing daemon, to populate the FIB. We also saw that the impact of 800K routes in the fib had no measurable impact on the performance.

The fib_lookup helper function allowed us to build a router and leverage well-known userland routing daemons. I would love to also see a similar helper function for conntrack, or perhaps some other integration with Netfilter. It would make building firewalls and perhaps even NAT a lot easier. Punting the first packet to the kernel, and subsequent packets are handled by XDP.
Wrapping up, we started with the question can we build a high performant peering router using XDP? The answer is yes! You can build a high performant peering router using just Linux and relying on XDP to accelerate the dataplane. While leveraging the various open-source routing daemons to run your routing protocols. That’s exciting!
Cheers
-Andree
]]>
<![CDATA[Kernel bypass networking with FD.io and VPP.]]>http://toonk.io/kernel-bypass-networking-with-fd-io-and-vpp/5f53d4c53680d46e73ec6097Sun, 05 Apr 2020 18:14:00 GMT

Over the last few years, I have experimented with various flavors of userland, kernel-bypass networking. In this article, we’ll take FD.IO for a spin.

We will compare the result with the results of my last blog in which we looked at how much a vanilla Linux kernel could do in terms of forwarding (routing) packets. We observed that on Linux, to achieve 14Mpps we needed roughly 16 and 26 cores for a unidirectional and bidirectional test. In this article, we’ll look at what we need to accomplish this with FD.io

Userland networking

The principle of Userland networking is that the networking stack is no longer handled by the kernel, but instead by a userland program. The Linux kernel is incredibly feature-rich, but for fast networking, it also requires a lot of cores to deal with all the (soft) interrupts. Several of the userland networking projects rely on DPDK to achieve incredible numbers. One reason why DPDK is so fast is that it doesn’t rely on Interrupts. Instead, it’s a poll mode driver. Meaning it’s continuously spinning at 100% picking up packets from the NIC. A typical server nowadays comes with quite a few CPU cores, and dedicating one or more cores for picking packets of the NIC is, in some cases, entirely worth it. Especially if the server needs to process lots of network traffic.

So DPDK provides us with the ability to efficiently and extremely fast, send and receive packets. But that’s also it! Since you’re not using the kernel, we now need a program that takes the packets from DPDK and does something with it. Like for example, a virtual switch or router.

FD.IO

Kernel bypass networking with FD.io and VPP.

FD.IO is an open-source software dataplane developed by Cisco. At the heart of FD.io is something called Vector Packet Processing (VPP).

The VPP platform is an extensible framework that provides switching and routing functionality. VPP is built on a ‘packet processing graph.’ This modular approach means that anyone can ‘plugin’ new graph nodes. This makes extensibility rather simple, and it means that plugins can be customized for specific purposes.

FD.io can use DPDK as the drivers for the NIC and can then process the packets at a high performant rate that can run on commodity CPU. It’s important to remember that it is not a fully-featured router, ie. it doesn’t really have a control plane; instead, it’s a forwarding engine. Think of it as a router line-card, with the NIC and the DPDK drivers as the ports. VPP allows us to take a packet from one NIC to another, transform it if needed, do table lookups, and send it out again. There are API’s that allow you to manipulate the forwarding tables. Or you can use the CLI to, for example, configure static routes, VLAN, vrf’s etc.

Test setup

I’ll use mostly the same test setup as in my previous test. Again using two n2.xlarge.x86 servers from packet.com and our DPDK traffic generator. The set up is as below.

Kernel bypass networking with FD.io and VPP.
Test setup

I’m using the VPP code from the FD.io master branch and installed it on a vanilla Ubuntu 18.04 system following these steps.

Test results — Packet forwarding using VPP

Now that we have our test setup ready to go, it’s time to start our testing!
To start, I configured VPP with “vppctl” like this, note that I need to set static ARP entries since the packet generator doesn’t respond to ARP.

set int ip address TenGigabitEthernet19/0/1 10.10.10.2/24
set int ip address TenGigabitEthernet19/0/3 10.10.11.2/24
set int state TenGigabitEthernet19/0/1 up
set int state TenGigabitEthernet19/0/3 up
set ip neighbor TenGigabitEthernet19/0/1 10.10.10.3 e4:43:4b:2e:b1:d1
set ip neighbor TenGigabitEthernet19/0/3 10.10.11.3 e4:43:4b:2e:b1:d3

That’s it! Pretty simple right?

Ok, time to look at the results just like before we did a single flow test, both unidirectional and bidirectional, as well as a 10,000 flow test.

Kernel bypass networking with FD.io and VPP.
VPP forwarding test results

Those are some remarkable numbers! With a single flow, VPP can process and forward about 8Mpps, not bad. The perhaps more realistic test with 10,000 flows, shows us that it can handle 14Mpps with just two cores. To get to a full bi-directional scenario where both NICs are sending and receiving at line rate (28 Mpps per NIC) we need three cores and three receive queues on the NIC. To achieve this last scenario with Linux, we needed approximately 26 cores. Not bad, not bad at all!

Kernel bypass networking with FD.io and VPP.
Traffic generator on the left, VPP server on the right. This shows the full line-rate bidirectional test: 14Mpps per NIC, while VPP uses 3 cores.

Test results — NAT using VPP

In my previous blog we saw that when doing SNAT on Linux with iptables, we got as high as 3Mpps per direction needing about 29 CPUs per direction. This showed us that packet rewriting is significantly more expensive than just forwarding. Let’s take a look at how VPP does nat.

To enable nat on VPP, I used the following commands:

nat44 add interface address TenGigabitEthernet19/0/3
nat addr-port-assignment-alg default
set interface nat44 in TenGigabitEthernet19/0/1 out TenGigabitEthernet19/0/3 output-feature

My first test is with one flow only in one direction. With that, I’m able to get 4.3Mpps. That’s’ exactly half of what we saw in the performance test without nat. It’s no surprise this is slower due to the additional work needed. Note that with Linux iptables I was seeing about 1.1Mpps.

A single flow for nat isn’t super representative of a real-life nat example where you’d be translating many sources. So for the next measurements, I’m using 255 different source IP addresses and 255 destination IP addresses as well as different port numbers; with this setup, the nat code is seeing about 16k sessions. I can now see the numbers go to 3.2Mpps; more flows mean more nat work. Interestingly, this number is exactly the same as I saw with iptables. There is however one big difference, with iptables the system was using about 29 cores. In this test, I’m only using two cores. That’s a low number of workers, and also the reason I’m capped. To remove that cap, I added more cores and validated that the VPP code scales horizontally. Eventually, I need 12 cores to run 14Mpps for a stable experience.

Kernel bypass networking with FD.io and VPP.
VPP forwarding with NAT test results

Below is the relevant VPP config to control the number of cores used by VPP. Also, I should note that I isolated the cores I allocated to VPP so that the kernel wouldn’t schedule anything else on it.

cpu {
    main-core 1
    # CPU placement:
    corelist-workers 3–14
    # Also added this to grub: isolcpus=3-31,34-63
}
dpdk {
   dev default {
      # RSS, number of queues
      num-rx-queues 12
      num-tx-queues 12
      num-rx-desc 2048
      num-tx-desc 2048
   }
   dev 0000:19:00.1
   dev 0000:19:00.3
}
plugins {
   plugin default { enable }
   plugin dpdk_plugin.so { enable }
}
nat {
   endpoint-dependent
   translation hash buckets 1048576
   translation hash memory 268435456
   user hash buckets 1024
   max translations per user 10000
 }

Conclusion

Kernel bypass networking with FD.io and VPP.
Photo by National Cancer Institute on Unsplash

In this blog, we looked at VPP from the FD.io project as a userland forwarding engine. VPP is one example of a kernel bypass method for processing packets. It works closely with and further augments DPDK.

We’ve seen that the VPP code is feature-rich, especially for a kernel bypass packet forwarder. Most of all, it’s crazy fast.

We need just three cores to have two NICs forward full line-rate (14Mpps) in both directions. Comparing that to the Linux kernel, which needed 26 cores, we see an almost 9x increase in performance.
We noticed that the results were even better when using nat. In Linux, I wasn’t able to get any higher than 3.2Mpps for which I needed about 29 cores. With VPP we can do 3.2Mpps with just two cores and get to full line rate nat with 12 cores.

I think FD.io is an interesting and exciting project, and I’m a bit surprised it’s not more widely used. One of the reasons is likely that there’s a bit of a learning curve. But if you need high-performance packet forwarding, it’s certainly something to explore! Perhaps this is the start of your VPP project? if so, let me know!

Cheers
-Andree

]]>
<![CDATA[Linux Kernel and Measuring network throughput.]]>http://toonk.io/linux-kernel-and-measuring-network-throughput/5f53d62b3680d46e73ec60b5Sun, 29 Mar 2020 18:21:00 GMT

In my last blog, I wrote about how we can use dpdk pktgen for performance testing. Today I spent some time on some baseline testing to see what we can expect out of a vanilla Linux system nowadays when used as a router. Over the last two years I’ve been playing a fair bit with kernel bypass networking and hope to write about it in the near future. The promise of kernel bypass networking is higher performance, to determine how much of performance increase over the Kernel we need to establish a baseline first, we’ll do that in this article.

Test setup

Linux Kernel and Measuring network throughput.
n2.large.x86 CPU specs.

I’m using two n2.xlarge.x86 servers from packet.com. With its two Numa nodes, 16cores per socket, 32 cores in total, 64 with hyper-threading, this is a very beefy machine! It also comes with a quad-port Intel x710 NIC, giving us 4 x 10Gbs. Packet allows you to create custom vlans and assign network ports to a vlan. I’ve created two vlans and assigned one NIC to each vlan. The setup looks like below.

Linux Kernel and Measuring network throughput.
Test setup

The Device Under Test (DUT), is a vanilla Ubuntu 19.04 system running a 5.0.0–38-generic kernel. The only minor tune I’ve done is to set the NIC rx ring to 4096. And I enabled ip forwarding ( net.ipv4.ip_forward=1)

Using the traffic generator, I’m sending as many packets possible and observe when packets stop coming back at the same rate, which indicates packet-loss. I record the point that happens as the maximum throughput. I’m also keeping a close eye on the CPU usage, to get a sense of how many CPU cores (hyper threads) are needed to serve the traffic.

Test 1 — packet forwarding on Linux

The first test was easy. I’m simply sending packets from 10.10.11.1 to 10.10.12.1 and vice versa, through the DUT (Device under Test), which is routing the packets between the two interfaces eno2 and eno4.
Note that that I did both a one directional test (10.10.11.1 > 10.10.12.1) and a bidirectional test (10.10.11.1 > 10.10.12.1 AND 10.10.12.1 > 10.10.11.1).
I also tested with just one flow, and with 10,000 flows.

Linux Kernel and Measuring network throughput.
Receive Side Scaling (RSS)

This is important as the NIC is doing something called Receive Side Scaling (RSS), which will load balance different flows on to different NIC receive Queues. Each queue is then served by a different core, meaning the system scales horizontally. But, keep in mind, you may still be limited by what a single core can do depending on your traffic patterns.

Ok, show me the results! Keep in mind that we’re talking mostly about Packets Per Second (PPS) as that is the major indicator of the performance, it’s not super relevant how much data is caried in each packet. In the world of Linux networking, it really comes down to, how many interrrupts per second the system can process.

Linux Kernel and Measuring network throughput.
Test results for test 1

In the results above, you can see that one flow can go as high as 1.4Mpps. At that point, the core serving that queue is maxed out (running 100%), and can not process any more packets and will start dropping. The single flow forwarding performance is good to know for DDOS use-cases or large single flow network streams such as ESP. For services like these, the performance is as good as a single queue / cpu can handle.

When doing the same test with 10,000 flows, I get to 14 Mpps, full 10g line rate at the smallest possible packet size (64B), yay! At this point I can see all cores doing a fair amount of work. This is expected and is due to the hashing of flows over different queues. Looking at the CPU usage, I estimate that you’d need roughly 16 cores at 100% usage to serve this amount of packets (interrupts).

Linux Kernel and Measuring network throughput.
14M pps, unidirectional test.

Interestingly, I wasn’t able to get to full line rate when doing the bidirectional test. Meaning both NICs both sending and receiving simultaneously. Although I am getting reasonably close at 12Mpps (24Mpps total per NIC). When eyeballing the cpu usage and amount of idle left over, I’d expect you’d need roughly 26 cores at 100% usage to do that.

Test 2 - Introducing a simple stateful iptables rule

In this test we’re adding two simple iptables rules to the DUT to see what the impact is. The hypothesis here is that since we’re now going to ask the system to invoke conntrack and do stateful session mapping, we’re starting to execute more code, which could impact the performance and system load. This test will show us the impact of that.

The Iptables rules added were:

iptables -I FORWARD -d 10.10.11.1 -m conntrack — ctstate RELATED,ESTABLISHED -j ACCEPT
iptables -I FORWARD -d 10.10.12.1 -m conntrack — ctstate RELATED,ESTABLISHED -j ACCEPT

Linux Kernel and Measuring network throughput.
Test results for test 2, impact of conntrack

The results for the single flow performance test look exactly the same, that’s good. The results for the 10,000 flows test, look the same as well when it comes to packet per second. However, we do need a fair amount of extra CPU’s to do the work. Good thing, our test system has plenty.
So you can still achieve (close) to full line rate with a simple stateful iptables rule, as long as you have enough cpu’s. Note that in this case, the state table had 10,000 state entries. I didn’t test with more iptables rules.

Test 3 - Introducing a NAT rule

In this test, we’re starting from scratch as we did in test 1 and I’m adding a simple nat rule which causes all packets going through the DUT to be rewritten to a new source IP. These are the two rules:

iptables -I POSTROUTING -t nat -d 10.10.12.1 -s 10.10.11.1 -j SNAT — to 10.10.12.2iptables -I POSTROUTING -t nat -d 10.10.11.1 -s 10.10.12.1 -j SNAT — to 10.10.11.2

The results below are quite different than what we saw earlier.

Linux Kernel and Measuring network throughput.
Test results for test 3

The results show that rewriting the packets is quite a bit more expensive than just allowing or dropping a packet. For example, if we look at the unidirectional test with 10,000 flows, we see that we dropped from 14M pps (test 1) to 3.2 Mpps, we also needed 13 cores more to do this!

Linux Kernel and Measuring network throughput.
This is what a (unhappy) 64core system looks like when trying to forward and NAT 5.9M pps

For what it’s worth, i did do a quick measurement with using nftables instead of iptables, but saw no significant changes in NAT performance.

Conclusion

One of the questions I had starting this experiment was: can Linux route at line-rate between two network interfaces? The answer is yes, we saw 14Mpps (unidirectional), as long as there are sufficient flows, and you have enough cores (~16). The bidirectional test made it to 12Mpps (24Mpps total per NIC) with 26cores at 100%.

We also saw that with the addition of two stateful Iptables rules, I was still able to get the same throughput, but needed extra CPU to do the work. So at least it scales horizontally.

Finally, we saw the rather dramatic drop in performance when adding SNAT rules to test. With SNAT the maximum I was able to get out of the system was 5.9Mpps; this was for 20k sessions (10k per direction).

So yes, you can build a close to line rate router in Linux, as long as you have sufficient cores and don’t do too much packet manipulations. All in all, an interesting test, and now we have a starting benchmark for future (kernel bypass / userland) networking experiments on Linux!

]]>
<![CDATA[Building a high performance - Linux Based Traffic generator with DPDK]]>

Often in my, now 20 years, networking career, I had to do some form of network performance testing. Use-cases varied, from troubleshooting a customer problem to testing new network hardware, and nowadays more and more Virtual network functions and software-based ‘bumps in the wire’.

I’ve always enjoyed playing with

]]>
http://toonk.io/building-a-high-performance-linux-based-traffic-generator-with-dpdk/5f53d8003680d46e73ec60d6Wed, 18 Mar 2020 18:32:00 GMTBuilding a high performance - Linux Based Traffic generator with DPDK

Often in my, now 20 years, networking career, I had to do some form of network performance testing. Use-cases varied, from troubleshooting a customer problem to testing new network hardware, and nowadays more and more Virtual network functions and software-based ‘bumps in the wire’.

I’ve always enjoyed playing with hardware-based traffic generators. My first experience with for example an IXIA hardware testing goes back to I think 2003, at the Amsterdam Internet Exchange where we were testing brand new Foundry 10G cards. These hardware-based testers were super powerful and a great tool to validate new gear, such as router line cards, firewalls, and IPsec gear. However, we don’t always have access to these hardware-based traffic generators, as they tend to be quite expensive or only available in a lab. In this blog, we will look at a software-based traffic generator that anyone can use - based on DPDK. As you’re going through this remember that the scripts and additional info can be found on my GitHub page here.


Building a high performance - Linux Based Traffic generator with DPDK

DPDK, The Data Plane Development Kit, is an Open source software project started by Intel and now managed by the Linux Foundation. It provides a set of data plane libraries and network interface controller polling-mode drivers that are running in userspace. Ok, let’s think about that for a sec, what does that mean? Userspace networking is something you likely increasingly hear and read about. The main driver behind userspace networking (aka Kernel bypass) has to do with the way Linux has built its networking stack; it is built as part of a generic, multi-purpose, multi-user OS. Networking in Linux is powerful and feature-rich, but it’s just one of the many features of Linux, and so; as a result, the networking stack needs to play fair and share its resources with the rest of the kernel and userland programs. The end result is that getting a few (1 to 3) million packets per second through the Linux networking stack is about what you can do on a standard system. That’s not enough to fill up a 10G link at 64 bytes packets, which is the equivalent of 14M packets per second (pps). This is the point where the traditional interrupt-driven (IRQs) way of networking in Linux starts to limit what is needed, and this is where DPDK comes in. With DPDK and userland networking programs, we take away the NIC from the kernel and give it to a userland DPDK program. The DPDK driver is a pull mode driver (PMD), which means that, typically, one core per nic always uses a 100% CPU, it’s in a busy loop always pulling for packets. This means that you will see that core running at 100%, regardless of how many packets are arriving or being sent on that nic. This is obviously a bit of waste, but nowadays, with plenty of cores and the need for high throughput systems, this is often a great trade-off, and best of all, it allows us to get to the 14M pps number on Linux.

Ok, high performance, we should all move to DPDK then, right?! Well, there’s just one problem… Since we’re now bypassing the kernel, we don’t get to benefit from the rich Linux features such as Netfilter and not even some of what we now think are basic features like a TCP/IP stack. This means you can’t just run your Nginx, Mysql, Bind, etc, socket-based applications with DPDK as all of these rely on the Linux Socket API and the Kernel to work. So although DPDK gives us a lot of speed and performance by bypassing the kernel, you also lose a lot of functionality.

Now there are quite a few DKDK based applications nowadays, varying from network forwarders such as software-based routers and switches as well as TCP/IP stacks such as F-stack.

In this blog, we’re going to look at DPDK-pktgen, a DPDK based traffic generator maintained by the DPDK team. I’m going to walk through installing DPDK, setting up SR-IOV, and running pktgen; all of the below was tested on a Packet.com server of type x1.small.x86 which has a single Intel X710 10G nic and a 4 core E3–1578L Xeon CPU. I’m using Ubuntu 18.04.4 LTS.

Installing DPDK and Pktgen

First, we need to install the DPKD libraries, tools, and drivers. There are various ways to install DPDK and pktgen; I elected to compile the code from source. There are a few things you need to do; to make it easier, you can download the same bash script I used to help you with the installation.

Solving the single NIC problem

One of the challenges with DPDK is that it will take full control of the nic. To use DPDK, you will need to release the nic from the kernel and give it to DPDK. Given we only have one nic, once we give it to DKDK, I’d lose all access (remember there’s no easy way to keep on using SSH, etc., since it relies on the Linux kernel). Typically folks solve this by having a management NIC (for Linux) and one or more NICs for DPDK. But I have only one NIC, so we need to be creative: we’re going to use SR-IOV to achieve the same. SR-IOV allows us to make one NIC appear as multiple PCI slots, so in a way, we’re virtualizing the NIC.

To use SR-IOV, we need to enable iommu in the kernel (done in the DPDK install script). After that, we can set the number of Virtual Functions (the number of new PCI NIC) like this.

echo 1 > /sys/class/net/eno1/device/sriov_numvfs
ip link set eno1 vf 0 spoofchk off
ip link set eno1 vf 0 trust on

dmesg -t will show something like this:

[Tue Mar 17 19:44:37 2020] i40e 0000:02:00.0: Allocating 1 VFs.
[Tue Mar 17 19:44:37 2020] iommu: Adding device 0000:03:02.0 to group 1
[Tue Mar 17 19:44:38 2020] iavf 0000:03:02.0: Multiqueue Enabled: Queue pair count = 4
[Tue Mar 17 19:44:38 2020] iavf 0000:03:02.0: MAC address: 1a:b5:ea:3e:28:92
[Tue Mar 17 19:44:38 2020] iavf 0000:03:02.0: GRO is enabled
[Tue Mar 17 19:44:39 2020] iavf 0000:03:02.0 em1_0: renamed from eth0

We can now see the new PCI device and nic name:

root@ewr1-x1:~# lshw -businfo -class network | grep 000:03:02.0
pci@0000:03:02.0 em1_0 network Illegal Vendor ID

Next up we will unbind this NIC from the kernel and give it to DPDK to manage:

/opt/dpdk-20.02/usertools/dpdk-devbind.py -b igb_uio 0000:03:02.0

We can validate that like this (note em2 is not connected and not used):

/opt/dpdk-20.02/usertools/dpdk-devbind.py -s
Network devices using DPDK-compatible driver
============================================
0000:03:02.0 'Ethernet Virtual Function 700 Series 154c' drv=igb_uio unused=iavf,vfio-pci,uio_pci_generic
Network devices using kernel driver
===================================
0000:02:00.0 'Ethernet Controller X710 for 10GbE backplane 1581' if=eno1 drv=i40e unused=igb_uio,vfio-pci,uio_pci_generic
0000:02:00.1 'Ethernet Controller X710 for 10GbE backplane 1581' if=em2 drv=i40e unused=igb_uio,vfio-pci,uio_pci_generic

Testing setup

Now that we’re ready to start testing, I should explain our simple test setup. I’m using two x1-small servers; one is the sender (running dpdk-pktgen), the other is a vanilla Ubuntu machine. What we’re going to test is the ability for the receiver Kernel, sometimes referred to as Device Under Test (DUT), to pick up the packets from the NIC. That’s all; we’re not processing anything, the IP address the packets are sent to isn’t even configured on the DUT, so the kernel will drop the packets asap after picking it up from the NIC.

Building a high performance - Linux Based Traffic generator with DPDK
test setup

Single flow traffic

Ok, time to start testing! Let’s run pktgen and generate some packets! My first experiment is to figure out how much I can send in a single flow to the target machine before it starts dropping packets.

Note that you can find the exact config in the GitHub repo for this blog. The file pktgen.pkt contains the commands to configure the test setup. Things that I configured were the mac and IP addresses, ports and protocols, and the sending rate. Note that I’m testing from 10.99.204.3 to 10.99.204.8. These are on /31 networks, so I’m setting the destination mac address to that of the default gateway. With the config as defined in pktgen.pkt I’m sending the same 64 byte packets (5 tuple, UDP 10.99.204.3:1234 > 10.99.204.8:81 ) over and over.

I’m using the following to start pktgen.

/opt/pktgen-20.02.0/app/x86_64-native-linuxapp-gcc/pktgen - -T -P -m "2.[0]" -f pktgen.pkt

After adjusting the sending rate properties on the sender and monitoring with ./monitorpkts.sh on the receiver, we find that a single flow (single queue, single-core) will run clean on this receiver machine up until about 120k pps. If we up the sending rate higher than that, I’m starting to observe packets being dropped on the receiver. That’s a bit lower than expected, and even though it’s one flow, I can see the CPU that is serving that queue having enough idle time left. There must be something else happening…

The answer has to do with the receive buffer ring on the receiver network card. It was too small for the higher packet rates. After I increased it from 512 to 4096. I can now receive up to 1.4Mpps before seeing drops, not bad for a single flow!

ethtool -G eno1 rx 4096

Multi flow traffic

Pktgen also comes with the ability to configure it for ranges. Examples of ranges include source and destination IP addresses as well as source and destination ports. You can find an example in the pktgen-range.pkt file. For most environments, this is a more typical scenario as your server is likely to serve many different flows from many different IP addresses. In fact, the Linux system relies on the existence of many flows to be able to deal with these higher amounts of traffic. The Linux kernel hashes and load-balances these different flows to the available receive queues on the nic. Each queue is then served by a separate Interrupt thread, allowing the kernel to parallelize the work and leverage the multiple cores on the system.

Below you’ll find a screenshot from when I was running the test with many flows. The receiver terminals can be seen on the left, the sender on the right. The main thing to notice here is that on the receiving node, all available CPU’s are being used, note the ksoftirqd/X processes. Since we are using a wide range of source and destination ports, we’re getting proper load balancing over all cores. With this, I can now achieve 0% lost packets up to about 6Mpps. To get to 14Mpps, 10g line rate @64Bytes packets, I’d need more CPUs.

Building a high performance - Linux Based Traffic generator with DPDK

IMIX test

Finally, we’ll run a basic IMIX test, using the dpdk-pktgen pcap feature. Internet Mix or IMIX refers to typical Internet traffic. When measuring equipment performance using an IMIX of packets, the performance is assumed to resemble what can be seen in “real-world” conditions.

The imix pcap file contains 100 packets with the sizes and ratio according to the IMIX specs.

tshark -r imix.pcap -V | grep 'Frame Length'| sort | uniq -c|sort -n
9 Frame Length: 1514 bytes (12112 bits)
33 Frame Length: 590 bytes (4720 bits)
58 Frame Length: 60 bytes (480 bits)

I need to rewrite the source and destination IP and MAC addresses so that they match my current setup, this can be done like this:

tcprewrite \
 - enet-dmac=44:ec:ce:c1:a8:20 \
 - enet-smac=00:52:44:11:22:33 \
 - pnat=16.0.0.0/8:10.10.0.0/16,48.0.0.0/8:10.99.204.8/32 \
 - infile=imix.pcap \
 - outfile=output.pcap
For more details also see my notes here : https://github.com/atoonk/dpdk_pktgen/blob/master/DPDKPktgen.md

We then start the packetgen app and give it the pcap

/opt/pktgen-20.02.0/app/x86_64-native-linuxapp-gcc/pktgen - -T -P -m "2.[0]" -s 0:output.pcap

I can now see I’m sending and receiving packets at a rate of 3.2M pps at 10Gbs, well below the maximum we saw earlier. This means that the Device Under Test (DUT) is capable of receiving packets at 10Gbs using an IMIX traffic pattern.

Building a high performance - Linux Based Traffic generator with DPDK
Result of IMIX test with a PCAP as the source. Receiver (DUT) on the left, sender window on the right.

Conclusion

In this article, we looked at getting DPDK up and running, talked a bit about what DPDK is, and used its pktgen traffic generator application. A typical challenge when using DPDK is that you lose the network interface, meaning that the kernel can no longer use it. In this blog, we solved this using SR-IOV, which allowed us to create a second logical interface for DPDK. Using this interface, I was able to generate 14Mpps without issues.

On the receiving side of this test traffic, we had another Linux machine (no DPDK), and we tested its ability to receive traffic from the NIC (after which the kernel dropped it straight away). We saw how the packets per second number is limited by the rx-buffer, as well as the ability for the CPU cores to pick up the packets (soft interrupts). We saw a single core was able to do about 1,4Mpps. Once we started leveraging more cores, by creating more flows, we started seeing dropped packets at about 6M pps. If we would have had more CPU’s we’d likely be able to do more than that.

Also note that throughout this blog, I spoke mostly of packets per second and not much in terms of bits per second. The reason for this is that every new packet on the Linux receiver (DUT) creates an interrupt. As a result, the number of interrupts the system can handle is the most critical indicator of how many bits per second the Linux system can handle.

All in all, pktgen and dpdk require a bit of work to set up, and there is undoubtedly a bit of a learning curve. I hope the scripts and examples in the GitHub repo will help with your testing and remember: with great power comes great responsibility.

Building a high performance - Linux Based Traffic generator with DPDK
]]>
<![CDATA[TCP BBR - Exploring TCP congestion control]]>

One of the oldest protocols and possibly the most used protocol on the Internet today is TCP. You likely send and receive hundreds of thousands or even over a million TCP packets (eeh segments?) a day. And it just works! Many folks believe TCP development has finished, but that’s

]]>
http://toonk.io/tcp-bbr-exploring-tcp-congestion-control/5f53dac53680d46e73ec610dSat, 15 Feb 2020 00:00:00 GMT

One of the oldest protocols and possibly the most used protocol on the Internet today is TCP. You likely send and receive hundreds of thousands or even over a million TCP packets (eeh segments?) a day. And it just works! Many folks believe TCP development has finished, but that’s incorrect. In this blog will take a look at a relatively new TCP congestion control algorithm called BBR and take it for a spin.

Alright, we all know the difference between the two most popular transport protocols used on the Internet today. We have UDP and TCP. UDP is a send and forget protocol. It is stateless and has no congestion control or reliable delivery support. We often see UDP used for DNS and VPNs. TCP is UDP’s sibling and does provide reliable transfer and flow control, as a result, it is quite a bit more complicated.

People often think the main difference between TCP and UDP is that TCP gives us guaranteed packet delivery. This is one of the most important features of TCP, but TCP also gives us flow control. Flow control is all about fairness, and critical for the Internet to work, without some form of flow control, the Internet would collapse.

Over the years, different flow control algorithms have been implemented and used in the various TCP stacks. You may have heard of TCP terms such as Reno, Tahoe, Vegas, Cubic, Westwood, and, more recently, BBR. These are all different congestion control algorithms used in TCP. What these algorithms do is determining how fast the sender should send data while adapting to network changes. Without these algorithms, our Internet pipes would soon be filled with data and collapse.

BBR

Bottleneck Bandwidth and Round-trip propagation time (BBR) is a TCP congestion control algorithm developed at Google in 2016. Up until recently, the Internet has primarily used loss-based congestion control, relying only on indications of lost packets as the signal to slow down the sending rate. This worked decently well, but the networks have changed. We have much more bandwidth than ever before; The Internet is generally more reliable now, and we see new things such as bufferbloat that impact latency. BBR tackles this with a ground-up rewrite of congestion control, and it uses latency, instead of lost packets as a primary factor to determine the sending rate.

TCP BBR - Exploring TCP congestion control
Source: https://cloud.google.com/blog/products/gcp/tcp-bbr-congestion-control-comes-to-gcp-your-internet-just-got-faster

Why is BBR better?

There are a lot of details I’ve omitted, and it gets complicated pretty quickly, but the important thing to know is that with BBR, you can get significantly better throughput and reduced latency. The throughput improvements are especially noticeable on long haul paths such as Transatlantic file transfers, especially when there’s minor packet loss. The improved latency is mostly seen on the last mile path, which is often impacted by Bufferbloat (4 seconds ping times, anyone?). Since BBR attempts not to fill the buffers, it tends to be better in avoiding buffer bloat.

TCP BBR - Exploring TCP congestion control
Photo by Zakaria Zayane on Unsplash

let’s take BBR for a spin!

BBR has been in the Linux kernel since version 4.9 and can be enabled with a simple sysctl command. In my tests, I’m using two Ubuntu machines and Iperf3 to generate TCP traffic. The two servers are located in the same data center; I’m using two Packet.com servers type: t1.small, which come with a 2.5Gbps NIC.

The first test is a quick test to see what we can get from a single TCP flow between the two servers. This shows 2.35Gb/s, which sounds about right, good enough to run our experiments.

The effect of latency on TCP throughput
In my day job, I deal with machines that are distributed over many dozens of locations all around the world, so I’m mostly interested in the performance between machines that have some latency between them. In this test, we are going to introduce 140ms round trip time between the two servers using Linux Traffic Control (tc). This is roughly the equivalent of the latency between San Francisco and Amsterdam. This can be done by adding 70ms per direction on both servers like this:

tc qdisc replace dev enp0s20f0 root netem latency 70ms

If we do a quick ping, we can now see the 140ms round trip time

root@compute-000:~# ping 147.75.69.253
PING 147.75.69.253 (147.75.69.253) 56(84) bytes of data.
64 bytes from 147.75.69.253: icmp_seq=1 ttl=61 time=140 ms
64 bytes from 147.75.69.253: icmp_seq=2 ttl=61 time=140 ms
64 bytes from 147.75.69.253: icmp_seq=3 ttl=61 time=140 ms

Ok, time for our first tests, I’m going to use Cubic to start, as that is the most common TCP congestion control algorithm used today.

sysctl -w net.ipv4.tcp_congestion_control=cubic

A 30 second iperf shows an average transfer speed of 347Mbs. This is the first clue of the effect of latency on TCP throughput. The only thing that changed from our initial test (2.35Gbs) is the introduction of 140ms round trip delay. Let’s now set the congestion control algorithm to bbr and test again.

sysctl -w net.ipv4.tcp_congestion_control=bbr

The result is very similar, the 30seconds average now is 340Mbs, slightly lower than with Cubic. So far no real changes.

The effect of packet loss on throughput

We’re going to repeat the same test as above, but with the addition of a minor amount of packet loss. With the command below, I’m introducing 1,5% packet loss on the server (sender) side only.

tc qdisc replace dev enp0s20f0 root netem loss 1.5% latency 70ms

The first test with Cubic shows a dramatic drop in throughput; the throughput drops from 347Mb/s to 1.23 Mbs/s. That’s a ~99.5% drop and results in this link basically being unusable for today’s bandwidth needs.

If we repeat the exact same test with BBR we see a significant improvement over Cubic. With BBR the throughput drops to 153Mbs, which is a 55% drop.

The tests above show the effect of packet loss and latency on TCP throughput. The impact of just a minor amount (1,5%) of packet loss on a long latency path is dramatic. Using anything other than BBR on these longer paths will cause significant issues when there is even a minor amount of packet loss. Only BBR maintains a decent throughput number at anything more than 1,5% loss.

The table below shows the complete set of results for the various TCP throughput tests I did using different congestion control algorithms, latency and packet loss parameters.

TCP BBR - Exploring TCP congestion control
Throughput Test results with various congestion control algorithms
Note: the congestion control algorithm used for a TCP session is only locally relevant. So, two TCP speakers can use different congestion control algorithms on each side of the TCP session. In other words: the server (sender), can enable BBR locally; there is no need for the client to be BBR aware or support BBR.

TCP socket statistics

As you’re exploring tuning TCP performance, make sure to use socket statistics, or ss, like below. This tool displays a ton of socket information, including the TCP flow control algorithm used, the round trip time per TCP session as well as the calculated bandwidth and actual delivery rate between the two peers.

TCP BBR - Exploring TCP congestion control

When to use BBR

Both Cubic and BBR perform well for these longer latency links when there is no packet loss, and BBR really shines under (moderate) packet loss. Why is that important? You could argue why you would want to design for these packet loss situations. For that, let’s think about a situation where you have multiple data centers around the world, and you rely on transit to connect the various data centers (possibly using your own Overlay VPN). You likely have a steady stream of data between the various data centers, think of logs files, ever-changing configuration or preference files, database synchronization, backups, etc. All major Transit providers at times suffer from packet loss due to various reasons. If you have a few dozen of these globally distributed data centers, depending on your Transit providers and the locations of your POPs you can expect packet loss incidents between a set of data centers several times a week. In situations like this BBR will shine and help you maintain your SLO’s.

I’ve mostly focused on the benefits of BBR for long haul links. But CDNs and various application hosting environments will also see benefits. In fact, Youtube has been using BBR for a while now to speed up their already highly optimized experience. This is mostly due to the fact that BBR ramps up to the optimal sending rate aggressively, causing your video stream to load even faster.

Downsides of BBR

It sounds great right, just execute this one sysctl command, and you get much better throughput resulting in your users to get a better experience. Why would you not do this? Well, BBR has received some criticism due to its tendency to consume all available bandwidth and pushing out other TCP streams that use say Cubic or different congestion algorithms. This is something to be mindful of when testing BBR in your environment. BBRv2 is supposed to resolve some of these challenges.

All in all, I was amazed by the results. It looks to me this is certainly worth taking a closer look at. You won’t be the first, in addition to Google, Dropbox and Spotify are two other examples where BBR is being used or experimented with.

]]>
<![CDATA[Building A Smarter AWS Global Accelerator]]>http://toonk.io/building-a-smarter-aws-global-accelerator/5f53e04f3680d46e73ec613fSun, 01 Dec 2019 00:00:00 GMT

In my last blog post, we looked at Global Accelerator, a global load balancer provided by AWS. I think Global Accelerator is an excellent tool for folks building global applications in AWS as it will help them directing traffic to the right origin locations or servers. This is great for high volume applications, as well as providing improved availability.

In this blog post, we’ll take a look at how we could build our own global accelerator by building on other previous blog posts (building anycast applications on packet). I’ve been thinking a lot about Global Accelerator and while it provides a powerful data plane, I think it would benefit from a smarter control plane. A control plane that provides load balancing with more intelligence, by taking into account the capacity, load and round trip time to each origin. In this blog post, we’ll evaluate and demonstrate what that could look like by implementing a Smarter Global Accelerator ourselves.

Typical Architecture

Many applications nowadays are delivered using the architecture below. Clients always hit one of the nearest edge nodes (the blue diamonds). Which edge node depends on the way traffic is directed to the edge node, typically either using DNS based load balancing, or straight anycast, this is what AWS Global Accelerator uses.

Building A Smarter AWS Global Accelerator
Typical application delivery architecture

The edge node then needs to determine what origin server to send the request to (assuming there is no caching or cache misses). This is how your typical CDN works, but also how for example, Google and Facebook deliver their applications. In the case of a simple CDN there could be one or more origin servers. In the case of Facebook, the choice is more which of their ‘core’ or ‘larger’ datacenters to send the request to.

With AWS Global Accelerator you can configure listeners (the diamond) in a region to send a certain percentage of traffic to an origin group, ‘Endpoint Groups’ in AWS speak. This is a static configuration, which is not ideal. Additionally, if an origin (the green box in the diagram) reaches its capacity, you will need to update the configuration. This is the part we’re going to make smarter.

In an ideal world, each edge node (the diamond) routes the requests to the closest origin based on the latency between the edge node and the origin. It should also be aware of the total load the origin is under, and how much load the origin can handle. An individual edge node doesn’t know how many other edge nodes there are, and how much each of them is sending each origin. So we need a centralized brain and a feedback loop.

Building a Closed-loop system

To have the system continuously adapt to the changing environment, we need to have access to several operational metrics. We need to know how many requests each edge node (load balancer) is receiving, with that we can infer the total, global, number of incoming requests per second. We also need to know the capacity of each origin since we want to make sure we don’t send more traffic to an origin than it can handle. Finally, we need to know the health and latency between each edge node and each origin. Most of these metrics are dynamic, so we need to continuously publish (or poll) the health information and request per second information.

Now that we have all the input data, we can feed this into our software. The software essentially a scheduler, solving a constraint-based assignment problem. The output is a list with all edge nodes and a weight assignment per edge node for each origin. A simple example could look like this:

-Listener 192.0.2.10:443 
 - Edge node Amsterdam:
    - Origin EU DC:  90%
    - Origin US-WEST DC:  0%
    - Origin US-EAST DC:  10%
    - Origin Asia DC:  0%
 - Edge node New York:
    - Origin EU DC: 0%
    - Origin US-WEST DC:  0%
    - Origin US-EAST DC:  100%
    - Origin Asia DC:  0%

In the above example for listener 192.0.2.10:443, the Amsterdam edge node will send 90% of the requests to the EU origin, while the remaining 10% is directed to the next closest DC, US-EAST. This means this the EU datacenter is at capacity and is offloading traffic to the next closest origin.

The New York edge node is sending all traffic to the EU-EAST datacenter as there is enough capacity at this point and no need to offload traffic.

Our closed-loop system will re-calculate and publish the results every few seconds so that we can respond to changes quickly.

Let’s start building

I’m going to re-use much of what we’ve built earlier, in this experiment I’m again using Packet.net and their BGP anycast support to build the edge nodes. Please see this blog for details. I’m using Linux LVS as a load balancer for this setup. Each edge node is publishing the needed metrics to prometheus, a time-series database, every 15 seconds. With this, we now have a handful of edge nodes, fully anycasted, and access to all needed metrics per edge node in a centralized system.

Building A Smarter AWS Global Accelerator
Swagger file, built using Flask-RESTPlus

The other thing that is needed is a centralized source of truth. For this, I wrote a Flask based REST API. This API allows us to create new load balancers, add origins, etc. We can also ask this same API for all load balancers, its origins and the health and operational metrics.

The next thing we needed is a script that every few seconds talks to the API to retrieve the latest configuration. With that information, each edge node can update the load balancer configuration, such as create new load balancer listeners and update the origin details such as weight per origin. We now have everything in place and can start testing.

Building A Smarter AWS Global Accelerator
JSON definition for each load balancer

Observations and tuning

I started testing by generating many get requests to one Listener that has two origins, one digitalocean VM in the US and one in Europe. Since all testing was performed from one location, it was hitting one edge datacenter, that has two edge nodes. Those edge nodes would send the traffic to the closest origin, which is the US origin. Now imagine this origin server was hitting its maximum capacity and I want to protect it from being overloaded and start sending some traffic to the other origin. To do that I set the maximum load number for the US origin to 200 requests per second (also see JSON above).

Below you’ll see an interesting visualization of this measurement. At t=0 the total traffic load for both origins is 0, no traffic is coming in at all, which also meant that both origins are well below their max capacity. This means that the US load balancers are configured to send all requests to the US origin as they have the lowest latency and are below the max threshold.

After we generate the traffic, all requests are sent to the US origin to start. As the metrics start coming in the system detects that the total number of incoming requests is exceeding 200, as a result, the load balancer configuration will be updated to start sending traffic to both origin servers, with the intent to not send more than 200 requests per second to the US origin.

Building A Smarter AWS Global Accelerator

Speed vs. accuracy.

One of the things we want to prevent is big sudden swings in traffic. To achieve that I built in a dampening factor that limits the load shifting (ie. load balancer configuration) change to 2 percent per origin for each 15-second interval. Note; that if you have two origins, this means a 2% change per origin, so 4% swing per 15 secsond cycle. This means it will take a bit longer for the system to respond to major changes but will give us substantially more stability, meaning less oscillation between origins and will allow for the system to stabilize. In my initial version, I had no dampening and the system never stabilized and showed significant unwanted sudden traffic swings.

The graph shows an interesting side effect of my testing setup. Since I’m testing from the US west coast and start offloading more and more requests to the EU origin, this means that on average, a single curl will take longer due to the increased round trip time. As a result, the total number of requests goes down. Which is fun, because it means the software needs to adapt constantly. Every time we change the origin weights slightly, the total number or requests changes slightly. This causes some oscillation, but it’s also exactly the oscillation a closed-loop system is designed for, and it works well as long as we have a dampening factor. It also shows that in some instances the total number of inbound requests increases if your website (or any app) is responding faster. Though I’m not sure if that’s representative of a real-world scenario. Still, this was a fun side effect that put a bit of extra stress on the software.

Closing thoughts

This project was fun as it allowed me to combine several of my interests. One of them is global traffic engineering, ie. how do we get traffic to where we’d like it to be processed. We also go to re-use the lessons learned and experience gained from a few of the previous blog posts, specifically how to build anycasted applications with Packet and a deep-dive into AWS Global Accelerator. I got to improve my Python skills by building a Restful API in Flask and make sure it was properly documented using the OpenAPI spec.
Finally, building the actual scheduler was a fun challenge, and it took me a while to figure out how to best solve the assignment problem before I was really happy with the outcome. The result is what could be a Smarter Global Accelerator.
]]>
<![CDATA[Building a high available Anycast service using AWS Global Accelerator]]>http://toonk.io/building-a-high-available-anycast-service-using-aws-global-accelerator/5f53e1f33680d46e73ec6157Mon, 02 Sep 2019 00:00:00 GMT

Everything I’m demonstrating below can easily be replicated by using the terraform code here, with that we will have a global anycast load-balancer, and origin web & dns servers in under 5 minutes.

Global Accelerator became publicly available in late 2018, and I’ve looked at it various times since then with a few potential use-cases in mind. The main value proposition for users is the ability to get a static IPv4 address that isn’t tied to a region. With global accelerator, customers get two globally anycasted IPv4 addresses that can be used to load balance across 14 unique AWS regions. A user request will get routed to the closest AWS edge POP based on BGP routing. From there, you can load balance requests to the AWS regions where your applications are deployed. Global accelerator comes with traffic dials that allow you to control how much traffic goes to what region, as well as instances in that region. It also has built-in health checking to make sure traffic is only routed to healthy instances.

Global accelerator comes with a few primitives, let’s briefly review these:

  • Global Accelerator, think of this of your anycast endpoint. It’s a global primitive and comes with two IPv4 addresses.
  • Next up, you create a Listener which defines what port and protocol (tcp or udp) to listen on. A Global Accelerator can have multiple listeners. We’ll create two for this article later on.
  • For each Listener, you then create one or more Endpoint Groups. An endpoint group allows you to group endpoint together by region. For example, EU-CENTRAL-1 can be an endpoint group. Each endpoint group has a traffic dial that controls the percentage of traffic you’d like to send to a region.
  • Finally, we have an Endpoint, this can be either an Elastic IP address, a Network Load Balancer, or an Application Load Balancer. You can have multiple endpoints in an Endpoint group. For each endpoint, you can configure a weight that controls how traffic is load-balanced over the various endpoints within an endpoint group.

These are some basic yet powerful primitives to help you build a highly available, low latency application with granular traffic control.

Let’s start building!

Alright that sounds cool, let’s dive right in. In this example, we’re going to build a highly available web service and DNS service in four AWS regions, us-east-1, us-west-2, eu-central-1, and ap-southeast-1. Each region will have two ec2 instances with a simple Go web server and a simple DNS server. We’ll create the global accelerator, listeners, endpoint groups, endpoints, as well as all the supporting infrastructure such as VPCs, subnets, security groups all through terraform.

I’ve published the terraform and other supporting code on my github page: https://github.com/atoonk/aws_global_accelerator. This allows you to try it yourself with minimum effort. Time to deploy our infrastructure:

terraform init && terraform plan && terraform apply -auto-approve
…
Plan: 63 to add, 0 to change, 0 to destroy.
...
GlobalAccelerator = [
  [
    [
      {
        "ip_addresses" = [
          "13.248.138.197",
          "76.223.8.146",
        ]
        "ip_family" = "IPv4"
      },
    ],
  ],
]


After a few minutes, all 63 components have been built. Now our globally distributed application is available over two anycast IPv4 addresses, in four regions, with two origin servers per region like the diagram below.

Building a high available Anycast service using AWS Global Accelerator

In the AWS console, you’ll now see something like this. In my example, you see two listeners, one for the webserver on TCP 80 and one for our DNS test on UDP 53.

Building a high available Anycast service using AWS Global Accelerator
Global Accelerator AWS Console

Time to start testing.

Now that we have our anycasted global load balancer up and running, it’s time to start testing. First let’s do a quick http check from 4 servers around the world, Tokyo, Amsterdam, New York and Seattle. To check if load balancing works as expected, I’m using to following test:

for i in {1..100}
	do curl -s -q http://13.248.138.197 
 done | sort -n | uniq -c | sort -rn
Building a high available Anycast service using AWS Global Accelerator
http load balancing test

The above shows that the hundred requests from each of the servers indeed go to the region we’d expect it to go and are nicely balanced over the two origin servers in each region.

Building a high available Anycast service using AWS Global Accelerator

Next, we’ll use RIPE Atlas, a globally distributed measurement network to check the traffic distribution globally on a larger scale. This will tell us how well the AWS anycast routing setup works.

I’m going to use the DNS listener for this. Using RIPE Atlas, I’m asking 500 probes to run the following dig command. The output will tell us what node and region are being used.

dig @13.248.138.197 id.server txt ch +short
"i-0e32d5b1a3f76cc05.us-west-2c"

The image below is visualization of the measurement results. It shows four different colors; each color represents one of the four regions. At first glance, there’s certainly a good distribution among the four geographic regions. The odd ones out are the clients in Australia and New Zealand, which AWS is sending to the us-west-2 region instead of the closer and lower latency Singapore, ap-southeast-1 region. I could, of course, solve this myself by creating a listener and origin server in Australia as well. Other than that there very few outliers, so that looks good.

Building a high available Anycast service using AWS Global Accelerator
Global Accelerator Anycast Traffic distribution using RIPE atlas

Based on this measure from 500 random Ripe Atlas nodes, We can see the region that is getting most of the traffic is the Europe region. In a real-life scenario, this region could now be running hot. To resolve that I’m going to lower the traffic dial for the Europe endpoint group to 50%. This means 50% of the traffic that was previously going to Europe should now go to the next best region. The visualization below shows that the majority of the 50% offload traffic ended up in the us-east region and small portion spilled over into ap-southeast-1 (remember, I’m only deployed in four regions).

Building a high available Anycast service using AWS Global Accelerator
50% of the EU traffic offloaded to alternative regions

Routing and session termination

An interesting difference between Global Accelerator and regular public ec2 IP addresses is how traffic is routed to AWS. For Global Accelerator AWS will try and get traffic on its own network as soon as possible. As compared to the scenario with regular AWS public IP addresses; Amazon only announces those prefixes out of the region where the IP addresses are used. That means that for Global Accelerator IP addresses your traffic is handed off to AWS at its closest Global Accelerator POP and then uses the AWS backbone to get to the origin. For regular AWS public IP addresses, AWS relies on public transit and peering to get the traffic to the region it needs to get to.

Let’s look at a traceroute to illustrate this. One of my origin servers in the ap-southeast-1 regions is 13.251.41.53, a traceroute to that ec2 instance in Singapore from a non AWS server in Sydney looks like this:

Building a high available Anycast service using AWS Global Accelerator
Traceroute from Sydney to an origin Server in Singapore, ap-southeast-1

In this case, we hand off traffic to NTT in Sydney, which sends it to Japan and then to Singapore, where it’s handed off to AWS, it eventually took 180ms to get there. It’s important to observe that traffic was handed of to AWS in Singapore, near the end of the traceroute.

Now, as a comparison, we’ll look at a traceroute from the same server in Sydney to the anycasted Global Accelerator IP

Building a high available Anycast service using AWS Global Accelerator
Traceroute from Sydney to anycasted Global Accelerator IP

The Global Accelerator address on hop 4 is just under a millisecond away. In this case, AWS is announcing the prefix locally via the Internet Exchange and it is handed off to AWS on the second hop. Quite a latency difference.

Keep in mind, although you certainly get better “ping” times to your service’s IP address, the actual response time of the application will still be dependent on where the application is deployed. It’s interesting to note though, that based on the above, AWS does appear to terminate the network connection at all of its Global Accelerator sites even if your application is not deployed there. This is also visible in the logs of our webserver, this is where we observe that the source IP address of the client isn’t the actual client’s IP, instead it’s an IP address of AWS’ Global Accelerator service. Not being able to see the original client IP address, is I imagine, a challenge for many use-cases.

Conclusion

In this article, we looked at how Global Accelerator works and behaves. I think it’s a powerful service. Having the ability to flexibly steer traffic to particular regions and even endpoints gives you a lot of control, which will be useful for high traffic applications. It also makes it easier to make applications highly available even on a per IP address basis (as compared to using DNS based load-balancing). A potentially useful feature for Global Accelerator would be to combine it with the Bring Your Own IP feature.

Global Accelerator is still a relatively young service, and new features will likely be added over time. Currently, I find one of the significant limitations the lack of client IP visibility. Interestingly, AWS just a few days ago, announced Client IP preservation for ALB (only) endpoints. Given that improvement, I’d imagine that Client IP preservation for other types of endpoints such as elastic IP and NLBs may come at some point too.

Finally, other than the flow logs, I didn’t find any logging or statistics in the AWS console for Global Accelerator. This would, in my opinion, be a valuable add-on.

All in all, a cool and useful service that I’m sure will be valuable too many AWS users. For me, it was fun to test drive Global Accelerator, check out AWS’ anycast routing setup and build it all using terraform. Hope you enjoyed it too :)

]]>
<![CDATA[Building a high available Anycast application in 5 minutes on Packet]]>http://toonk.io/building-a-high-available-anycast-application-in-5-minutes-on-packet/5f53e3ba3680d46e73ec617bSun, 24 Feb 2019 00:00:00 GMT

In my previous blog, we took a peek at Stackpath’s Workload features with Anycast. In this blog we will take a look at one of my other favorite compute providers: Packet and we will use Terraform to describe and deploy our infrastructure as code.

Packet

I have been using Packet for about 1,5 years now for various projects. There are a few things why I like Packet. It all started with their BGP and Anycast support, yup BGP straight to your server! That combined with their powerful bare-metal compute options, great customer success team and excellent Terraform support makes Packet one of my go-to supplier for my pet projects.

What are we building?

Alright, let’s dive in! A brief description of what we’re going to build today: We have a small Golang web application that does nothing more than print the machine’s hostname. We want this application to be highly available and deployed to various locations around the world so that our users can access it with low latency. In our case, we are going to deploy this to four Packet datacenters around the globe, with two instances in each datacenter. Each Packet server will have BGP enabled, and we’ll use Anycast for load balancing and high availability.

Using the Packet Terraform provider we can do all of this in an automated way, allowing me to all this in less than five minutes! Sounds like fun? Alright, let’s dive in.

Terraform

We could provision our servers using the web portal, but the ability to describe our infrastructure as code using Terraform is a lot easier and allows us to grow and shrink our deployment easily. If you’d like to play with this yourself and follow along: you can find the code for this demo on my Github page here.
Next up we will take a brief look at the important parts of the Terraform code.

First off we create a project in Packet, set some BGP parameters for this project (AS number and optionally a password). Next up comes the cool part, we reserve what Packet calls a global IPv4 address. This IP address is the Anycast address that we’ll be using and will be announced to Packet using the Bird BGP daemon from all our servers.

provider "packet" {
 auth_token = "${var.packet_api_key}"
}
# Create project
resource "packet_project" "anycast_test" {
 name = "anycast project"
 bgp_config {
 deployment_type = "local"
 #md5 = "${var.bgp_password}"
 asn = 65000
 }
}
# Create a Global IPv4 IP to be used for Anycast
# the Actual Ip is available as: packet_reserved_ip_block.anycast_ip.address
# We'll pass that along to each compute node, so they can assign it to all nodes and announce it in BGP
resource "packet_reserved_ip_block" "anycast_ip" {
 project_id = "${packet_project.anycast_test.id}"
 type = "global_ipv4"
 quantity = 1
}

Next up we define where we want to deploy our compute nodes to, and how many per datacenter.

module "compute_sjc" {
 source = "./modules/compute"
 project_id = "${packet_project.anycast_test.id}"
 anycast_ip = "${packet_reserved_ip_block.anycast_ip.address}"
 operating_system = "ubuntu_18_04"
 instance_type = "baremetal_0"
 facility = "sjc1"
 compute_count = "2"
}
module "compute_nrt" {
 source = "./modules/compute"
 project_id = "${packet_project.anycast_test.id}"
 anycast_ip = "${packet_reserved_ip_block.anycast_ip.address}"
 operating_system = "ubuntu_18_04"
 instance_type = "baremetal_0"
 facility = "nrt1"
 compute_count = "2"
}
module "compute_ams" {
 source = "./modules/compute"
 project_id = "${packet_project.anycast_test.id}"
 anycast_ip = "${packet_reserved_ip_block.anycast_ip.address}"
 operating_system = "ubuntu_18_04"
 instance_type = "baremetal_0"
 facility = "ams1"
 compute_count = "2"
}
module "compute_ewr" {
 source = "./modules/compute"
 project_id = "${packet_project.anycast_test.id}"
 anycast_ip = "${packet_reserved_ip_block.anycast_ip.address}"
 operating_system = "ubuntu_18_04"
 instance_type = "baremetal_0"
 facility = "ewr1"
 compute_count = "2"
}

I created a module to define the compute node, with that we can easily create many of them in the various datacenters. In the example above I created two instances in the following locations: San Jose (sjc1), Tokyo (nrt1), Amsterdam (ams1) and New York (ewr1).

Let’s take a quick look at the compute module:

variable facility { }
variable project_id { }
variable compute_count { }
variable operating_system { }
variable instance_type { }
variable anycast_ip { }
#variable bgp_password { }

resource "packet_device" "compute-server" {
 hostname = "${format("compute-%03d", count.index)}.${var.facility}"
 count = "${var.compute_count}"
 plan = "${var.instance_type}"
 facilities = ["${var.facility}"]
 operating_system = "${var.operating_system}"
 billing_cycle = "hourly"
 project_id = "${var.project_id}"

provisioner "local-exec" {
 command = "scp -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null scripts/create_bird_conf.sh root@${self.access_public_ipv4}:/root/create_bird_conf.sh"
 }

provisioner "local-exec" {
 command = "scp -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null scripts/web.go root@${self.access_public_ipv4}:/root/web.go"
 }

 provisioner "local-exec" {
 command = "scp -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null scripts/start.sh root@${self.access_public_ipv4}:/root/start.sh"
 }
 
 provisioner "local-exec" {
 command = "ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null root@${self.access_public_ipv4} 'bash /root/start.sh ${var.anycast_ip} > /dev/null 2>&1 '"
 }
}

#Enable BGP for the newly creates compute node
resource "packet_bgp_session" "test" {
 count = "${var.compute_count}"
 device_id = "${packet_device.compute-server.*.id[count.index]}"
 address_family = "ipv4"
}

Alright that’s quite a bit of text, luckily most of it is pretty descriptive. The code above creates the compute node we requested at the specified datacenter. You will also see a few local-exec statements; these instruct Terraform to execute a script locally as soon as the instance is created. In this case, I used it to secure copy (scp) some files over and run start.sh on the compute node. Once you have more tasks, it will be easier and cleaner to replace this with an ansible playbook, but this will work just fine for now.
The start.sh script installs Bird (the BGP daemon), generates the correct Bird BGP config, restarts Bird and starts our Golang web service. Finally, we instruct Terraform to enable BGP on the Packet side for each of our compute nodes.

And we’re done! With this, we should be able to deploy our infrastructure using the commands below, and within 5 minutes we should have our application deployed globally with an Anycast IP.
So enough talking, let’s ask Terraform to build our infrastructure:

terraform plan
terraform apply

While we wait, let’s keep an eye on the Packet portal to see what global IP we’ve been allocated. In my case, it’s 147.75.40.35/32.

Building a high available Anycast application in 5 minutes on Packet

Once Terraform is done you’ll see something like this.
Apply complete! Resources: 18 added, 0 changed, 0 destroyed.

In this case, the 18 resources include: 1 Project, 1 Global IP, 8 compute nodes and 8 BGP sessions. In my case that took just 5 minutes! That’s amazing, eight new bare-metal machines, fully deployed, BGP running with an Anycast IP and my App deployed to various datacenters around the world, load balanced and all. A few years ago that was almost unthinkable.

The Packet portal also shows the eight compute instances I requested with the hostnames and unicast IP addresses.

Building a high available Anycast application in 5 minutes on Packet

Time to test

Alright so now this is all deployed we should be all done! Let’s test our Go web application using my local machine, it should be reachable via http://147.75.40.35 … boom! There it is.

Building a high available Anycast application in 5 minutes on Packet

Testing Anycast

Cool, it works! I’m hitting the compute node in San Jose, which is as expected since that is closest to Vancouver, Canada. Next up let’s test from a few locations around the world to make sure Anycast routing works as expected. The screenshot below shows tests from a handful of servers outside the Packet network. Here I’m testing from Palo Alto, New York, Amsterdam, and New York. In all cases, we’re getting routed to the expected closest location, and in all cases, the requests are being load balanced between the two instances in a datacenter.

Building a high available Anycast application in 5 minutes on Packet
Curl tests from around the world to test Anycast Load Balancing

Now let’s say our web application all of a sudden becomes popular in Japan and in order to keep up with demand we need to scale up our resources there. Doubling our capacity in Tokyo is easy. All we do is edit the cluster.tf Terraform file and change the compute count number from 2 to 4.

Building a high available Anycast application in 5 minutes on Packet
Scaling out our capacity in Tokyo to 4 nodes

Run terraform plan and terraform apply, wait 5 minutes and voila. A few minutes later I indeed see that our clients hitting the Tokyo datacenter are indeed being load balanced over four instances now (compute-000.nrt1 compute-001.nrt1 compute-002.nrt1 compute-003.nrt1), instead of the two previously.

Summary

In this post we looked at deploying our little Golang App to 4 datacenters around the world with two instances per datacenter. We demonstrated how we can scale up deployments and how clients are being routed to the closest datacenter to guarantee low latency and saw requests being load balanced between the instances.

Even though we used actual bare-metal machines, the experience is mostly the same as deploying VM’s. I used the smallest instance type available to keep cost down. The Tiny But Mighty t1-small that I used is a 4 Core (atom C2550)CPU, 8GB RAM, with SSD storage and 2.5Gbps network connectivity for just $0.07 per hour, great for pet project likes these.
Packet provides many different configurations, should you need more powerful hardware, make sure to check out all options here.

The power of Terraform really shines for use-cases like this. Kudos to Packet for having excellent Terraform support, it’s pretty cool to do all of the above, including requesting a global Anycast IP using Terraform. The complete deployment takes about 5 minutes, no matter how many datacenters or instances.

Defining and deploying your applications like we did here, using infrastructure as code and Anycast used to be reserved for the large CDN’s, DNS and major infrastructure providers. Packet now makes it easy for everyone to build high available and low latency infrastructure. Working with Packet has been a great experience and I recommend everyone to give it a spin.

]]>
<![CDATA[Experimenting with StackPath Edge Computing and Anycast]]>

I love tinkering with new tech, and since it was a cold and snowy weekend here in Vancouver, I figured I’d explore Edge Containers from Stackpath. Earlier this week I read Stackpath’s blog in which they describe their new service: Introducing containers and virtual machines at the edge

]]>
http://toonk.io/experimenting-with-stackpath-edge-computing-and-anycast/5f53e5733680d46e73ec61a9Mon, 11 Feb 2019 00:00:00 GMT

I love tinkering with new tech, and since it was a cold and snowy weekend here in Vancouver, I figured I’d explore Edge Containers from Stackpath. Earlier this week I read Stackpath’s blog in which they describe their new service: Introducing containers and virtual machines at the edge

Update June 21, 2020: a follow-up article is available here: Building a global anycast service in under a minute

The service offering resonated strongly with me, mostly the ability to deploy your containers or VM’s to many of their locations in one operation, whether that’s an API call or click in the portal. That combined with the fact that Stackpath called out BGP Anycast support in the blog post (for those who know me, I’m a BGP geek), convinced me to give this a spin.

Brand new Feature

So instead of building a snowman, I created an account with stackpath.com. I soon found out their new “Workload” feature is still gated, but within an hour or so Stackpath unlocked this feature for me and it showed up in the Stackpath portal.

Update Feb 12 2019: Stackpath advised me that the “Workload” feature is no longer gated and available for all users now.

Creating workloads

I’ll briefly describe how I created the workload, but also make sure to check out their documentation here.
I gave my workload a name “stackpath-anycast-fun” and used the following docker image “atoonk/stackpath-fun:latest”, more about that later. Next up I exposed port 80 and defined the compute spec for each container. Also, note that I selected the “Add Anycast IP Address.”

Experimenting with StackPath Edge Computing and Anycast

Next up I defined what locations I want to deploy the workload to. This is grouped by continent, in my case I used Europe and North America. In each continent, I chose two PoPs: Amsterdam, London, New York, and Los Angeles. So with that my deployment should be live within four datacenters globally, great for my Anycast testing! Finally, I made sure to bring up two instances in each datacenter just so that I can test the load balancing or ECMP routing within a datacenter.

Experimenting with StackPath Edge Computing and Anycast

Pricing

Experimenting with StackPath Edge Computing and Anycast

At the end, it nicely showed me the pricing. Since I have a particular interest in Anycast, I looked specifically at pricing for that, which at 14 cents per hour comes to $100 per month. All in all for the month for these eight instances & Anycast the price comes to about $627 per month.

Creating a Docker image

Earlier on I defined the Docker image that I wanted to have deployed. Stackpath will pull these images from Dockerhub, in my case I defined atoonk/stackpath-fun:latest. So what does this container do? Basically, it’s a container with Apache + PHP, and a little script that prints the hostname variable using gethostname();. This will allow me to, later on, test what instance I’m hitting by requesting a webpage.

Start the workload

Now that we have a Docker image on Dockerhub and the workload is created it’s time to test it. After you create the workload in the Stackpath portal, it takes about 2 minutes or so for all instances to come online. Eventually, it will look like this.

Experimenting with StackPath Edge Computing and Anycast

Cool, all my container instances went from Starting to Running. In addition to an external IP for each container, I also have an Anycast IP: 185.85.196.10

Experimenting with StackPath Edge Computing and Anycast

Time to test Anycast

I can hit all the containers via their external IP address using my browser. Next up checked to see if the Anycast address worked, and yes it did! The example below shows I’m hitting one of the LAX instances, which is as expected since that’s the closest location from my home in Vancouver.
I refreshed my browser a few times and indeed saw it flip between the two Los Angeles instances I created: stackpath-anycast-fun-na1-lax-0 and stackpath-anycast-fun-na1-lax-1 (note the last digit).

Experimenting with StackPath Edge Computing and Anycast

Global testing

Next up it’s time to test from a few more locations to make sure the global Anycast setup works as expected. To test this, I did a curl from some shell servers in Amsterdam, London, New York and Seattle like this:

time for i in `seq 1 15`; do curl -s 185.85.196.10/server.php | grep stackpath-anycast-fun; done
Experimenting with StackPath Edge Computing and Anycast
Curl tests from around the world to test Anycast Load Balancing

The screenshot above shows the full high availability and load balancing story in one picture.
My Amsterdam server is hitting the Amsterdam containers and being load balanced between the containers there. Seattle is routed to Los Angeles as expected, and similarly, the New York client is being served by the two containers in New York (JFK).
The one that stands out though is my London shell server. That client is following the Anycast route to New York. That’s odd, especially since I have a cluster deployed in both Amsterdam and London. Also note the last line, showing that to complete this request it took over 2secs, while the client in Amsterdam finished its requests in 200ms.

The London routing mystery

I did a bit of poking to understand why the London server is going to New York, instead of staying local in London. By looking at the routing table, I see the following in London:

> 185.85.196.0/24 
*[BGP/170] 3d 13:19:37, MED 38020, localpref 100
 AS path: 3356 209 33438 I, validation-state: unverified
 
 [BGP/170] 23:46:15, MED 2011, localpref 100
 AS path: 2914 1299 33438 I, validation-state: unverified
 
 [BGP/170] 2w0d 09:36:24, MED 30, localpref 100
 AS path: 3257 1299 33438 I, validation-state: unverified

The above indicates this router learns the path to 185.85.196.10 via three ISPs: Level3 (AS3356), NTT (AS2914) and GTT (AS3257). The router selected the path via Level3. Level3 then uses the path via Centurylink (AS209) to send it to Stackpath (AS33438).
This shows us that Stackpath (AS33438), is buying connectivity from Centurylink in the USA. Since Level3 prefers the US path via Centurylink all Level3 customers in Europe are routed to the US, New York in this case. Just announcing Anycast routes out of the US and not scoping them (make sure the announcements stay in the US) will inevitably cause the Anycast routing challenges we see here.

Update Feb 12 2019: The Anycast routing issue has now been resolved by the Stackpath network team. Routing from my test client in London is now correctly routed to the container cluster in London.

Summary

In summary, it was pretty easy to get going with Stackpath’s container service. It’s cool to be able to deploy your workloads around the world with just a single click. The icing on the cake for me is the global load balancing with Anycast feature.
Anycast used to be a feature used by mostly DNS operators and CDN folks. However over the last two years or so we’ve seen infrastructure providers such as Packet and AWS making this available to their customers directly. With these features, folks can now really deploy their high available services globally without necessarily needing to rely on CDN’s or any DNS tricks (or better, build your own DNS tricks now).

I think the pricing is a bit steep, but I’m guessing since it’s a new feature with possibly limited supply, this is one way to get to market while ramping up their capacity. The other thing to keep in mind is that folks like Packet and Stackpath are focusing on the Edge. They are deployed in some of the most expensive and most popular datacenters around the world. This will give its customers great low latency connectivity, but space and power are limited in these facilities, possibly pushing up the pricing.

Going forward I’d love to see some more features around, monitoring, metrics and logging for my containers.
Realizing this is early on, my first experience with Stackpath was a joy, it was easy to use, great support (thanks Ben and Sean) and I look forward to what’s next.

]]>