<![CDATA[Andree's Musings]]>http://toonk.io/http://toonk.io/favicon.pngAndree's Musingshttp://toonk.io/Ghost 3.34Tue, 16 Feb 2021 19:06:06 GMT60<![CDATA[Introducing SSH zero trust, Identity aware TCP sockets]]>In this article, we’ll look at Mysocket’s zero-trust cloud-delivered, authenticating firewall. Allowing you to replace your trusted IP ranges with trusted identities.

Last month we introduced our first zero trust features by introducing the concept of Identity Aware Sockets. It’s been great to see folks giving this

http://toonk.io/introducing-ssh-zero-trust-identity-aware-tcp-sockets/602c13657c52791b41798673Tue, 16 Feb 2021 18:57:54 GMTIn this article, we’ll look at Mysocket’s zero-trust cloud-delivered, authenticating firewall. Allowing you to replace your trusted IP ranges with trusted identities.
Introducing SSH zero trust, Identity aware TCP sockets

Last month we introduced our first zero trust features by introducing the concept of Identity Aware Sockets. It’s been great to see folks giving this a spin and start using it as a remote access alternative for the traditional VPN.

Most services out there today are HTTP based, typically served over HTTPS. However, there are a few other commonly used services that are not HTTP based and, as a result, up until today, didn’t benefit from our identity-aware sockets. In this article, we’ll introduce Zero trust support for non-HTTP based service, with the introduction Identity aware TCP sockets. Specifically, we’ll look at providing zero trust services for SSH as an example.

Determining the user’s identity, authentication, and authorization

Turning your mysocket services into an identity-aware socket is as simple as adding the — cloud_authentication flag to mysocketctl when creating the service. While doing so, you have the ability to add a list of email domains and/or a list of email addresses. Now each time a user tries to access your service, a browser will pop up asking the user to authenticate. Once authentication is finished, we know the user’s identity, and if that identity matches the list of authorized users, only then will the user be let through.

Introducing SSH zero trust, Identity aware TCP sockets
Creating an Identity-aware TCP socket

If you think about what is happening here, you’ll realize that what we have here is a per session, authenticating firewall. Only after the user is authenticated and authorized do we allow the network traffic through. Notice that this is much more advanced than your traditional firewall; now, every network flow has an identity. That’s powerful!

This flow of redirecting users to authenticate and then back to the service was do-able because it’s done in the browser and largely built of HTTP session management. Now we’d like to extend this with non-HTTP services, so we’ll need to find an alternative for the HTTP session part.
The solution for this comes with the help of Mutual TLS (MTLS). MTLS forces the client to authenticate itself when talking to the server. This is achieved by presenting a signed client certificate to the server.

Identity aware TCP sockets

With the introduction of identity-aware TCP sockets, the mysocket edge proxies act as an authenticating firewall. Since we are relying on client TLS certificates, all traffic is securely tunneled over a TLS connection.

As you can see in the flow below, there are a few actions to take before the user can get through. To make this a seamless experience for the users of your service, we’ve extended the mysocketctl command-line tool with the required functionality that kicks of the authentication flow. It starts the authentication process; after that, it requests a client certificate (your ticket in), and then it sets up the TLS tunnel for you. After that, users can send traffic over this authenticated and encrypted tunnel. In its simplest form, it will look something like this:

echo "hello" | mysocketctl client tls \ 
    --host muddy-pond-7106.edge.mysocket.io

In the example above, we’re sending the string hello, to the service served by muddy-pond-7106.edge.mysocket.io.

Introducing SSH zero trust, Identity aware TCP sockets
Traffic flow

Before the string “hello” arrives at the service protected by mysocket, the mysocketctl client will take care of the authentication flow, requests the TLS client certificate, and sends whatever comes in over stdin to the mysocket edge services.

SSH zero trust

Now that we understand the high-level flow let’s look at a more practical example. In this example, we have a server for which I’d like to make the SSH service available to only a subset of users. The ssh is on a private network such as your corporate network, your home network, or even in a private VPC or just firewalled off from the Internet.

First, we’ll provision the service using mysocketctl on the server-side and set up the tunnel.

mysocketctl connect \  
	--name 'remote access for my ssh server' \ 
    --cloudauth \  
    --allowed_email_addresses'contractor@gmail.com,john@example.com' \ 
    --allowed_email_domains 'mycorp.com' \ 
    --port 22 --host localhost \ 
    --type tls

In this example, I'm creating a mysocket service of type TLS, and we enable cloud authentication. This will force the user to present a valid client TLS certificate. The certificate can only ever be handed out to users that authenticate with a mycorp.com email address or using the specific email addresses contractor@gmail.com or john@example.com.
The same command will also set up a secure tunnel to the closest mysocket tunnel servers and expose the ssh service running on port 22.

Introducing SSH zero trust, Identity aware TCP sockets

The result is that this SSH service is now accessible to allowed users only, as crimson-thunder-8434.edge.mysocket.io:38676
Only inbound traffic with a valid client TLS ticket will be let through. Valid TLS client certificates will only ever be issued to users with a mycorp.com domain or the two contractor email addresses we specified.

Setting up an SSH session

Ok, time to test this and connect to this ssh service. Remember that we need a valid TLS client certificate. These are issued only with a valid token, and the token is only handed out to authorized users. To make all of this easier, we’ve extended the mysocketctl tool to take care of this workflow. The example below shows how we use the ssh ProxyCommand to make that easier for us, like this.

ssh ubuntu@crimson-thunder-8434.edge.mysocket.io \ 
	-o 'ProxyCommand=mysocketctl client tls --host %h'

This will tell ssh to send all ssh traffic through this mysocketctl client command. This will start the authentication process, fetch the TLS client certificate for us, set up the TLS tunnel to the mysocket edge server, and transport the ssh traffic through this authenticated tunnel. The user can now log in to the ssh server using whatever method you’re used to.

With this, we’ve made our private ssh server accessible from the Internet, while the authenticating mysocket firewall is only allowing in session from client identities we approved beforehand. No VPN needed. Pretty cool, right?

Mysocket SSH Certificate authorities.

Introducing SSH zero trust, Identity aware TCP sockets

SSH is quite similar to TLS in terms of workflow. It too supports authenticating users using signed certificates.

So we decided to expand on this functionality. In addition to an API endpoint that is responsible for signing TLS certificates, we also created one for signing SSH keys.

If we build on the example above, the user can now, in addition to requesting a TLS client certificate, also request a signed SSH certificate. Our SSH certificate signing service will only sign the signing request if the user is authenticated and authorized, using the same logic as before.

Setting up the server

In order to use this, we’ll need to make a few minor changes to the SSH server. The configuration changes below are needed to enable authentication using CA keys.

echo "TrustedUserCAKeys /etc/ssh/ca.pub" >>/etc/ssh/sshd_config

echo "AuthorizedPrincipalsFile %h/.ssh/authorized_principals" >>/etc/ssh/sshd_config

echo "mysocket_ssh_signed" > /home/ubuntu/.ssh/authorized_principals

Finally, also make sure to get the Public key for the CA (mysocketctl socket show) and copy it into the ca.pub file (/etc/ssh/ca.pub)

Now the server is configured to work with and allow authentication based on signed SSH keys from the mysocket certificate authority. Note that all signed certificates will have two principles, the email address of the authenticated user, as well as ‘mysocket_ssh_signed’. In the example configuration above, we told the server to map users with the principle ‘mysocket_ssh_signed’ to the local user ubuntu.

Now we’re ready to connect, but instead of making the ssh command even longer, I’m going to add the following to my ssh config file ~/.ssh/config

Host *.edge.mysocket.io
    ProxyCommand bash -c 'mysocketctl client ssh-keysign --host %h; ssh -tt -o IdentitiesOnly=yes -i ~/.ssh/%h %r@%h.mysocket-dummy >&2 <&1'

Host *.mysocket-dummy
    ProxyCommand mysocketctl client tls --host %h

The above will make sure that for all ssh sessions to *.edge.mysocket.io we start the authentication flow, fetch a TLS client certificate, and set up the TLS tunnel. We’ll also submit an SSH key signing request, which will result in a short-lived signed SSH certificate that will be used for authenticating the SSH user.

Now the user can just SSH like this, and the whole workflow will kick-off.

ssh ubuntu@crimson-thunder-8434.edge.mysocket.io

For those interested, the ssh certificate will end up in your ~/.ssh/ directory and will look like this.

$ ssh-keygen -Lf ~/.ssh/nameless-thunder-8896.edge.mysocket.io-cert.pub
        Type: ecdsa-sha2-nistp256-cert-v01@openssh.com user certificate
        Public key: ECDSA-CERT SHA256:0u6TICEhrISMCk7fbwBi629In9VWHaDG1IfnXoxjwlg
        Signing CA: ECDSA SHA256:MEdE6L0TUS0ZZPp1EAlI6RZGzO81A429lG7+gxWOonQ (using ecdsa-sha2-nistp256)
        Key ID: "atoonk@gmail.com"
        Serial: 5248869306421956178
        Valid: from 2021-02-13T12:15:20 to 2021-02-13T12:25:20
        Critical Options: (none)

With this, users can SSH to the same server as before, but the cool thing is that the server won’t need to know any traditional known credentials for its users. Things like passwords or a public key entry in the authorized_keys file belong to the past. Instead, with the help of mysocketctl, the user will present a short-lived signed ssh certificate, which the server will trust.

With this, we achieved true Single Sign-on (SSO) for your SSH servers. Since the certificates are short-lived, five minutes in the past(to allow for time drift) to five minutes in the future, we can be sure that for each log-in the authentication and authorization flow was successful.

Give it a try yourself using my SSH server

If you’re curious about what it looks like for the user and want to give it a try? I have a test VM running on, it has firewall rules to prevent SSH traffic from the Internet, but using mysocket, anyone with a gmail.com account can access it. I encourage you to give it a spin, it will take you less than a minute to set up, just copy-paste the one-time setup config.

Onetime setup

If you’re using a Mac laptop as your client, you’ll need the mysockectl tool which will request the short-lived certs and will setup up the TLS tunnel. To install the client just copy-paste the below (for Mac only, see download.edge.mysocket.io for other platforms).

curl -o mysocketctl https://download.edge.mysocket.io/darwin_amd64/mysocketctl 

chmod +x ./mysocketctl
sudo mv ./mysocketctl /usr/local/bin/

To make it easy to use we’ll add the following to our ssh client config file. This is a one-time setup and will make sure the ssh traffic to *.edge.mysocket.io is sent through the mysocketctl client tool.

cat <<EOF >> ~/.ssh/config
Host *.edge.mysocket.io
 ProxyCommand bash -c ‘mysocketctl client ssh-keysign --host %h; ssh -tt -o IdentitiesOnly=yes -i ~/.ssh/%h %r@%h.mysocket-dummy >&2 <&1’
Host *.mysocket-dummy
 ProxyCommand mysocketctl client tls --host %h

Now you should be able to ssh to my test server using

ssh testuser@frosty-feather-1130.edge.mysocket.io

When the browser pops up, make sure to use the “log in with Google” option, as this socket has been configured to only allow identities that have a Gmail.com email address.

Wrapping up

In this post, we showed how we continued to build on our previous work with “identity-aware sockets”. We Introduced support for identity-aware TCP sockets, by leveraging TLS tunnels and Mutual TLS for authentication.

I like to think of this as a cloud-delivered, authenticating firewall. With this, we can make your services available to the Internet on a very granular basis, and make sure that each flow has an identity attached to it. Ie, we know exactly, on a per TCP flow basis, what identity (user) is using this flow. That’s a really powerful feature when compared to a traditional firewall, where we had to allow SSH traffic from certain network ranges that were implicitly trusted. What we can now do with these identity-aware sockets is rewrite these firewall rules and replace the trusted IP ranges with trusted identities. This is incredibly powerful for those that need strict compliance, and need to answer things like, who (not an IP, but an identity) connected to what when.

We looked at how this can be used to provide zero trust remote access to your SSH servers. And how it can be further extended by using the new SSH key signing service.

That’s it for now, I hope you found this interesting and useful. As always, if you have any questions or feedback, feel free to reach out.

Hungry for more? check out all our demo’s on Youtube here

<![CDATA[Introducing Identity Aware Sockets: Enabling Zero Trust access for your Private services]]>In this blog post, we’ll introduce an exciting new feature that, with the help of Mysocket, allows you to deploy your own Beyond Corp setup.

What is Zero Trust

The main concept behind Zero Trust is that users shouldn’t just be trusted because they are on your network.

http://toonk.io/introducing-identity-aware-sockets-enabling-zero-trust-access-for-your-private-services/60007bd37c52791b41798651Thu, 14 Jan 2021 17:19:18 GMT

In this blog post, we’ll introduce an exciting new feature that, with the help of Mysocket, allows you to deploy your own Beyond Corp setup.

What is Zero Trust

The main concept behind Zero Trust is that users shouldn’t just be trusted because they are on your network. This implicit trust problem is something we typically see with, for example, corporate VPNs. With most corporate VPN’s once a user is authenticated, the user becomes part of the corporate network and, as a result, has access to many of the resources within the corporate infrastructure. In other words, once you’re on the VPN, you’re within the walls of the castle, you’re trusted, and you have lots of lateral access.

The world is changing, and the once traditional approach of trusting devices on your network are over. A few years ago, Google started the journey to their implementation of Zero trust called Beyond Corp. One of the core building blocks to get to a Zero Trust model is Identity aware application proxies. These proxies can provide strict access control on a per-application granularity, taking the users’ identity and contexts such as location and device status into account.

Identity aware proxies

As of today, the Mysocket proxies have support for OpenID Connect, and with that, your sockets are identity-aware. This means that mysocket users can now enable authentication for their services and provide authorization rules.

mysocketctl connect \
   --port 3000 \
   --name "My Identity aware socket" \
   --cloudauth \
   --allowed_email_domains "mycorp.com" \
   --allowed_email_addresses "contractor1@gmail.com,john@doe.com"

The example above shows how we can enable authentication by using the “cloudauth” CLI parameter. With cloudauth enabled, users will be asked to authenticate before access to your service is granted. Currently, we support authentication using Google, Facebook, Github, or locally created accounts.

The authentication flow uses OpenID connect to interact with the various Identity providers (IDP); thus, we can easily add more Identity providers in the future. We’re also looking at SAML as a second authentication flow. Please let us know if you have a need for more IDP’s or SAML, and we’ll work with you.

Authorizations rules

In addition to what the Identity Service Provider (IDP) provides, mysocket also provides two authorization rules. The — allowed_email_domain allows users to specify a comma-separated list of email domains. If your users are authenticating using their mycorp.com email address, then by adding this domain as an allowed_email_domain, will make sure only users with that domain have will be granted access.

Since multiple identities providers are supported, it’s easy to extend access to contractors or other 3rd party users. To provide access to external contractors that are not part of your mycorp.com domain, we can use the allowed_email_addresses parameter to add individual identities. This is great because now you can provide these contractors access without creating corporate accounts for them.

These are just two authorization rules; we’re planning to add more types of rules in the near future. Additional authorization rules that come to mind are, Geo-based rules (only allow access from certain countries or regions) or time of day type rules. If you require these types of rules or have suggestions for additional authorization rules, please let us know!

VPN replacement

One of the unique features of mysocket is that the origin server initiates the connection to the Mysocket edge. This means that the origin servers can be on a highly secure network that only allows outbound connections. Meaning the origins can be hosted behind strict firewall rules or even behind NAT, like for example, a Private AWS VPC. With this, your origin server remains private and hidden.

With the addition of authentication and authorization to Mysocket, we can now, on a very granular basis, provide access to your private services. Combining the secure outbound tunnel property and the identity-aware sockets, we can now look at this to provide an alternative to VPNs, while providing much more granular access to private or corporate resources.

Introducing Identity Aware Sockets: Enabling Zero Trust access for your Private services

Example use case

Imagine a scenario where you work with a contractor that needs access to one specific private application, say an internal wiki, ticket system, or the git server in your corporate network. With the traditional VPN setup, this means we’d need to provide the contractor with a VPN account. Typically this means the contractor is now part of the corporate network, has a corporate user account, and now has access to much more than just the application needed.

Instead, what we really want is to only provide access to the one application and be very granular in who has access. With the addition of identity-aware sockets, this is now possible.

Demo time!

Alright, let’s give this a spin, demo time! In this demo, we’re making a Grafana instance that’s on a private network and behind two layers of NAT available to our employees as well as a contractor.

We’ll start by setting up the socket and tunnel using the “mysocketctl connect” command.

This works great for demo’s; for more permanent setups, it’s recommended to use “mysocketctl socket create” and “mysocketctl tunnel create” so that you have a permanent DNS name for your service.

mysocketctl connect \
   --port 3000 \
   --name "My Identity aware socket" \
   --cloudauth \
   --allowed_email_domains "mycorp.com" \
   --allowed_email_addresses "andree@toonk.io,john@doe.com"

With this, we created a socket on the mysocket.io infrastructure, enabled authentication, and provided a list of authorization rules. The same command also created the secure tunnel to the closest mysocket.io tunnel server, and we’re forwarding port 3000 on localhost to the newly created socket.

Next, we launch a Grafana container. For fun, I’m passing in my AWS cloudwatch credentials, so I can create some dashboards for my AWS resources. I’ve configured grafana for proxy authentication. Meaning it will trust mysocket.io to do the authentication and authorization. Grafana will use the HTTP headers added by mysocket to determine the user information.

enabled = true
header_name = X-Auth-Email
header_property = username
auto_sign_up = true
headers = Email:X-Auth-Email

The complete example grafana.ini config file I used can be found here. Now we’re ready to launch Grafana. I’m doing this from my laptop, using Docker.

docker run -i -v grafana.ini:/etc/grafana/grafana.ini \
 -e “GF_AWS_PROFILES=default” \
 -e “GF_AWS_default_REGION=us-east-1” \
 -p 3000:3000 grafana/grafana
Introducing Identity Aware Sockets: Enabling Zero Trust access for your Private services

Grafana is now listening on localhost port 3000. The mysocket connection we created earlier is relaying authenticated and authorized traffic to that local socket. With that, we should now be able to test and see if we have access to Grafana.

Wrapping up

In this article, we introduced identity aware sockets. We saw how Mysocket users can easily enable authentication for their HTTP(S) based sockets and how OpenID connect is used for the authentication flow to Google, Facebook, or Github (for now). We then looked at how authorization rules can be added by either matching the email domain or even a list of email addresses.

With this, it’s now easy to provide access to internal applications, from any device, any time anywhere, without the need for a VPN.

<![CDATA[Global load balancing with Kubernetes and Mysocket.io]]>If you’re in the world of cloud infrastructure, then you’ve heard of Kubernetes. Some of you are experts already, while some of us are just learning or getting started. In this blog, we’ll introduce a mysocket controller for Kubernetes and demonstrate how easy it is to use

http://toonk.io/global-load-balancing-with-kubernetes/5fe18d6451a5ff44fb64d99aTue, 22 Dec 2020 06:17:15 GMT

If you’re in the world of cloud infrastructure, then you’ve heard of Kubernetes. Some of you are experts already, while some of us are just learning or getting started. In this blog, we’ll introduce a mysocket controller for Kubernetes and demonstrate how easy it is to use mysocket.io as your cloud-delivered load balancer for your Kubernetes Services. If you’re a Kubernetes user already, then it should just take a minute to get this mysocket controller setup.

See this video for a demo of the Mysocket.io integration with Kubernetes

Pods, Deployments, and Services

Before we continue, let’s review some of the main Kubernetes building blocks we will be using to make this work.

A typical workload running in Kubernetes will look something like the diagram below. It typically starts with a deployment, in which you define how many replicas (pods) you’d like.

Since pods are ephemeral and can scale in and out as needed, the Pod IP addresses will be dynamic. As a result, communicating with the pods in your deployment from other Deployments would require constant service discovery, which may be challenging for certain apps. To solve this, Kubernetes has the concept of a Service. A service acts as a logical network abstraction for all the pods in a workload and is a way to expose an application running on a set of Pods as a network service.

In the diagram below the service is reachable via on port 8000. The Service will make sure traffic is distributed over all the healthy endpoints for this service.

Global load balancing with Kubernetes and Mysocket.io

Taking your Service Global

Now that we know how a service makes your workload available within the cluster, it’s time to take your workload global! Kubernetes has a few ways to do this, typically using an ingress service. We’re going to use the architecture as outlined in the diagram below. We’ll use Mysocket.io to create a secure tunnel between our ‘myApp’ Service and the Mysocket cloud. From there on, it will be globally available via its anycasted edge nodes.

Global load balancing with Kubernetes and Mysocket.io

To make this work, we’ll deploy a controller pod that runs the mysocketd container. The Container Image can be found on Docker Hub, and the corresponding Dockerfile on our github repo here.

The mysocketd controller does two things:

1) it subscribes to the Kubernetes API and listens for events related to services. Specifically, it will watch for service events that have the annotation mysocket.io/enabled

If a service has the following annotation mysocket.io/enabled: “true” then the mysocketd app will start a thread for that service.

2) In the per service thread, using the Mysocket API, a new mysocket.io “Socket” object is created if needed. This will give the “myApp” service a public DNS name and public IP. Next, it will check with the mysocket.io API to see if a tunnel already exists; if not, then a new one is created.

Finally, the secure tunnel is established, and the “myApp” service is now globally available and benefits from the mysocket.io high-performance infrastructure.

Demo time. How to Deploy mysocketd to your cluster

The easiest way to get started is to download the mysocketd workload yaml file from the Mysocket Kubernetes controller repository  and update the following three secrets to the ones for your mysocket account (line 14,15,16).

email: <mysocket login email in base64> 
password: <mysocket password in base64> 
privatekey: <mysocket private ssh key in base64>

Then simply apply like this and you’re good to go!

kubectl apply -f mysocketd.yaml

Cool, now we have the controller running!

A simple demo work load

Next up, we’d like to make our myApp service available to the Internet by using the mysocket.io service. I’m going to use this deployment as our demo app. The deployment consists of three pods, with a little python web app, printing its hostname.

Next up, we’ll build a service (see definition here) that acts as an in-cluster load balancer for the demo workload, and we’ll request the Service to be enabled for Mysocket.io.

$ kubectl apply -f demo-deploy.yaml

$ kubectl apply -f demo-service.yaml

$ kubectl get all -n demo-app
NAME                            READY   STATUS    RESTARTS   AGE
pod/demo-app-6896cd4b88-5jzgn   1/1     Running   0          2m22s
pod/demo-app-6896cd4b88-78ngc   1/1     Running   0          2m22s
pod/demo-app-6896cd4b88-pzbc7   1/1     Running   0          2m22s
NAME                   TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)    AGE
service/demo-service   ClusterIP   <none>        8000/TCP   64s
NAME                       READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/demo-app   3/3     3            3           2m22s
NAME                                  DESIRED   CURRENT   READY   AGE
replicaset.apps/demo-app-6896cd4b88   3         3         3       2m22s
Global load balancing with Kubernetes and Mysocket.io
Simply set mysocket.io/enabled to “true” and your service will be globally available.

The same can be done for existing services: to do so, just edit your existing Service definition and add the “mysocket,io/enabledannotation, like this:

kind: Service
    mysocket.io/enabled: "true"

After applying this change, the mysocketd controller will detect that the service has requested to be connected to the Mysocket global infrastructure. The controller will create the Socket and Tunnels as needed and start the tunnel.

A quick load balancing test shows that requests to our global mysocket endpoint are now being load-balanced over the three endpoints.

$ for i in {1..60}; do  curl -s \ 
  https://blue-snowflake-1578.edge.mysocket.io/ & ;done | sort | uniq -c | sort -n  
  19 Server: demo-app-6896cd4b88-pzbc7  
  20 Server: demo-app-6896cd4b88-5jzgn  
  21 Server: demo-app-6896cd4b88-78ngc  

Just like that, in a few seconds, your workload is globally available. Pretty neat he!? Make sure to watch the video recording of this demo as well.

Wrapping up

In this article, we looked at how we can easily connect Kubernetes Services to Mysocket.io. We saw that it was easy to use. All that is needed is to: 1) start the Mysocketd controller, and: 2) add the mysocket annotation to your Services. By doing so, we are giving the Kubernetes service a publicly reachable IP address, a TLS cert, and it instantly benefits from the mysocket.io Anycast infrastructure. Pretty powerful and easy to use.

I should add that this Mysocket Controller for Kubernetes is for now just a starting point, ie. an MVP integration with Kubernetes. There are ways to improve this, both in terms of user-friendliness and high availability (i.e. more than one pod). The code for this Kubernetes integration is open source, and we’d be happy to accept improvements. Mostly it serves as an example of what’s possible and how we can continue to build on the Mysocket services.

Last but certainly not least, I’d like to thanks Brian for his code to improve the various mysocketctl client libraries. Also, a big thanks to Bas Toonk (yes, my brother) for his help with this MVP. Most of this was his work, thanks buddy!

Finally, I hope this sparked your imagination, and you’ll all give it a whirl, and let me know your experience.

<![CDATA[Easy Multi-region load balancing with Mysocket.io]]>Last week AWS had a major outage in its US-EAST1 region, lasting for most of the day, just before the big black Friday sales! Incidents like this are a great reminder of the importance of multi-region or even multi-cloud deployments for your services.

Depending on your “cloud maturity” and your

http://toonk.io/easy-multi-region-load-balancing-with-mysocket/5fcd79a451a5ff44fb64d968Mon, 07 Dec 2020 00:42:46 GMT

Last week AWS had a major outage in its US-EAST1 region, lasting for most of the day, just before the big black Friday sales! Incidents like this are a great reminder of the importance of multi-region or even multi-cloud deployments for your services.

Depending on your “cloud maturity” and your products’ complexity, you may already be there or just getting started. Either way, in today’s blog, we will take a look at how we can use mysocket’s load balancing features to make deployments over multi-region easier.

A global load balancing service

In earlier blogs, we looked mostly at how the mysocket.io tunnel service can help securely connect your resources that may be behind NAT and firewalls to the Internet. In this article, we’ll look at Mysocket’s global load balancing feature.

Three types of load balancers

Mysocket today supports three different types of cloud-native load balancers.

1) Application load balancers, for your HTTP and HTTPS services.
2) Network load balancer, for your TCP services.
3) TLS Load balancer, for your TCP services, where we take care of the encryption.

Easy Multi-region load balancing with Mysocket.io
Load balancer types for mysocket.io 
When deploying Mysocket with your services, you now have a new front door. It just happens to be a front door that is anycasted, and as such has many doorbells around the globe made available to you. As a result, you’re always close to your users.

Demo: Creating a multi-region service in two minutes!

Alright, time to get building! In this demo, we’ll continue using the Gif service we built in our last blog. Just like your workloads, I think this is a “critical” service, so we need it to be deployed in multiple regions. The service will need a 100% uptime, which means that even if a region is down, the Gif service will need to be available to our users.

Infrastructure as code

In this example, we’re going to deploy the service on Digital ocean VM’s in two of its regions, New York and Toronto. We’re big fans of infrastructure as code and require our service to be deployed with a simple deploy script. So we’ll use Terraform to spin up the instances and use a cloud-init script to bootstrap the necessary work to install the required software, start the Gif service and connect it to the Mysocket global infrastructure.

Step 1: terraform

We’re using terraform to define what type VM’s we’d like, which regions we want them to be deployed in, and how many per region. For this demo we’re using just two VM’s in two regions. But I’m sure, by just looking at the terraform file, you can see how easy it is to deploy many VM’s per region by just changing the ‘count’ number. It’s also easy to add additional regions. So scaling our Gif service for the future will be easy ;)

Step 2: cloud-init

The VM’s we’re launching are just vanilla Ubuntu machines. So we’ll need to modify them slightly to get our software and our ‘secrets’ on the machine. To do that, we’ll use cloud-init.
Cloud-init allows us to define a set of tasks that need to be executed when the VM is created. In our case, we’re asking it to create a user ‘mysocket’, we’re adding the Giphy API key, and mysocket credentials as secrets to the mysocket home directory. Finally, we’re telling it to download and execute a bootstrap script.

Step 3: bootstrapping and starting our services.

Cloud-init’s last step was to start the bootstrap script that it download from here. Basically, all it does is install the two python packages (mysocketctl and giphy_client) and download and start two more programs. The first program is our Gif application, which will run on port 8000. The second program is a small python script that makes sure the VM registers and connects to the mysocket.io infrastructure.

The code to connect to mysocket is available on github. It’s pretty easy, and I’m sure it’s low friction to get started with, even if you have limited experience. Below are the most important parts:

#These variables were made available as secrets in cloud Init.
username = os.getenv("email")
password = os.getenv("password")
socket_id = os.getenv("socket_id")

#Login to mysocket API and get a token. We can contruct the Auth header using this token.
token = get_token(username, password)
authorization_header = {
   "x-access-token": token["token"],    
   "accept": "application/json",    
   "Content-Type": "application/json",}
# register the service by creating a new tunnel for this VM
tunnel = new_tunnel(authorization_header, socket_id)
# setup the tunnel to mysocket and ready to serve traffic!
ssh_tunnel(port, tunnel["local_port"], ssh_server, ssh_user)
Make sure you take a quick peak at the code, it’s pretty easy to get started with and a good “getting started” example for both terraform and cloud-init.

And with that, we now have four VM’s in two regions. All four VM’s registered with the Mysocket service and are now serving our ‘critical’ Gif app. You can see the demo service yourself here: https://fluffy-bunny-5697.edge.mysocket.io/

Easy Multi-region load balancing with Mysocket.io

The example below shows how the mysocket load balancing service distributes traffic evenly over all four of the origin servers in New York and Toronto.

$ for i in {1..20}; do  curl -s \
 https://fluffy-bunny-5697.edge.mysocket.io/ | grep Server; 
done | sort | uniq -c

   5 Server: compute-000-nyc1
   5 Server: compute-000-tor1
   5 Server: compute-001-nyc1
   5 Server: compute-001-tor1

Horizontally scaling the service is easy; simply change the count number in the terraform file, or add / remove regions. Since the VM’s will call the mysocket api on boot, these new VM’s will automatically become part of the load balancing pool. Pretty neat and easy, right?!

Failover scenarios

So what happens when a server or region becomes unavailable? There are a few things that make this as painless as possible for users. First, the origin service can de-register itself; this will allow for the most graceful scenario.

For less graceful scenarios, mysocket load balancers will, after 60 seconds automatically detect that a tunnel has gone down and take the origin out of the load balancing pool. And even during this 60-second degradation before the tunnel is declared down, our load balancers will use a 10 second connect time out when connecting to origin service and automatically fail back to the remaining origins. So, all in all, failures should be hidden as much as possible from your users.

Wrapping up

In this blog, we looked at how Mysocket can be used as a global load balancer, for your multi-region load deployments. In our demo, we looked at two Digital ocean regions, but this could also be over multiple AWS regions, or even Multi-cloud, with one cluster in AWS, one in Digital Ocean, and throw in some Google cloud for good measure.

We saw how Mysocket provides users with a global anycasted ingress point and provides seamless load balancing for your services. Best of all, it only took us 90 seconds to get all of this going! I guess it’s fair to say that Mysocket makes going multi-region and even multi-cloud easier.

<![CDATA[Static DNS names for your mysocket.io services (and a new gif service)]]>In my last blog post, I announced the mysocket.io service and demonstrated how to get started quickly. It’s been great to see people signing up and giving it a spin! Your feedback has been great, motivating, and has helped make the service better already.

Most users that gave

http://toonk.io/static-dns-names-for-your-sockets-and-a-new-gif-service/5fc5be7c51a5ff44fb64d914Tue, 01 Dec 2020 04:14:11 GMT

In my last blog post, I announced the mysocket.io service and demonstrated how to get started quickly. It’s been great to see people signing up and giving it a spin! Your feedback has been great, motivating, and has helped make the service better already.

Most users that gave mysocket a try used the mysocketctl connect (aka “quick connect”) feature. This is the easiest way to get started and instantly creates a global socket, great for quick testing. However, when you’re done and exit the program, the “connect” feature cleans up the socket. It’s easy to create new ones, but each time with a different name. Not surprisingly then, that a few users asked the question about sockets with static names.

Static names for your services

Most residential ISPs give their customers a dynamic IP address (one that changes from time to time) through DHCP. If you have a service you’d like to make available but don’t have a static IP, then this post is for you!

Perhaps you have a server at home, either a fancy setup, rack-mounted and UPS for emergency power, or perhaps just a modest Raspberry Pi. Either way, you’d like to make this available from the Internet, so you can access it when you’re not at your home network, or perhaps you are home, but your nice cooperate VPN client blocks access to your local network. Or maybe you’d like to make it available to your friend.

That’s great, but you have two challenges:

1) the server sits behind NAT, so you can’t simply connect to the server from the Internet.

2) your ISP gives you dynamic IPs, so every few days/weeks, the IP changes, so it’s hard to keep track of the IP. Ideally, you could use a static name, that no matter your dynamic IP, remains stable.

So what you’d need is a static DNS name! Good news, that’s the default on the Socket primitive.

Mysocket primitives

Let’s take a more in-depth look at the two main primitives that make up the mysocket service: Sockets and Tunnels

Static DNS names for your mysocket.io services (and a new gif service)
Socket and Tunnel primitives

A Global Socket

The first primitive we need is a socket object. This object is the public endpoint for your service and can be created like this:

mysocketctl socket create \
	--type http \
	--name "my service at home"

This, among other things, returns a static DNS name that is yours to use. Think of it as your global public endpoint for your load balancer. In this case, mysocket runs the load balancer, and it’s made highly available through our anycast setup.

In the example above, we created an HTTP/HTTPS socket. Other options are TCP and TLS sockets. In those cases, the API will return not only a static DNS name for your service but also your own dedicated static TCP port number. A typical use case for a TCP socket is making your ssh service available.


The second primitive is a tunnel. A tunnel object represents your origin service and the secure connection between the origin and the mysocket global infrastructure. Let’s look at the example below.

mysocketctl tunnel create \
	--socket_id 1fab407c-a49d-4c5e-8287-8b138b7549c0

When creating the tunnel object, simply pass along the socket_id from the socket you’d like to be connected to.

Putting it together

In summary, we can create a globally available service by first creating a socket of type HTTP/HTTPS, TCP or TLS. This returns a static name for your service and optionally a unique port number. After that, we create a tunnel that links the origin to the socket.

Static DNS names for your mysocket.io services (and a new gif service)

Now that we have created a Socket and Tunnel, it’s time to connect to it, spin up the dataplane, and expose the local service port. We do that using the following command:

mysocketctl tunnel connect \
	--port 8000 \
	--socket_id 1fab407c-a49d-4c5e-8287-8b138b7549c0 \
	--tunnel_id 3f46a01f-ef5b-4b0c-a1ce-9a294be2be03

In the example above, we only create one tunnel for the socket, but nothing stops you from creating multiple tunnels (origins) per socket. In that case, mysocket will load balance over all available tunnels.

Demo time:
exposing a local Gif service to the Internet

Alright, time to look at a simple demo and get our hands dirty. For this demo, I developed a small proof of concept python web service that shows a random Gif from Giphy each time a visitor loads the webpage.

I’m running this service locally on a VM on my laptop. The goal is to make this service publicly available, and overcoming the two levels of NAT and my ISP that hands out dynamic IP addresses. At the end of this demo, I’m able to share a static DNS name with my users, and you’ll be able to try it!

Watch the video below for a live demo.

First, let’s download the demo python code and start the Gif web service

wget https://gist.githubusercontent.com/atoonk/0bfc784feb66ffc03541462fbc945df7/raw/812d839f60ea7bfd98a530a3fc549137abe0b329/gif_service.py

#make sure to update the API key in gif_service.py

python3 ./gif_service.py

Alright, now we have this service running, but it’s only reachable from my local network. Next up, we create a socket and a tunnel.

mysocketctl socket create \
	--name "my local http Gif service" \ 
	--type http

Ok, that returns a socket_id and our static DNS name! In my case, the DNS name is: wandering-shape-7752.edge.mysocket.io

Next up, we’ll use the socket_id to create the tunnel object:

mysocketctl tunnel create \
	--socket_id 1fab407c-a49d-4c5e-8287-8b138b7549c0

Cool, now all we need to do is start the tunnel connection.

mysocketctl tunnel connect \
	--port 8000 \
	--socket_id 1fab407c-a49d-4c5e-8287-8b138b7549c0 \
	--tunnel_id 3f46a01f-ef5b-4b0c-a1ce-9a294be2be03

This will securely connect the local gif web service listening on port 8000, to the mysocket infrastructure and make it available as https://wandering-shape-7752.edge.mysocket.io

Wrapping up

In this blog post, we looked at the two main mysocket primitives, Sockets, and Tunnels. We saw how users can create a Socket object and get a static DNS name and possibly even a dedicated TCP port. In the demo, we used the tunnel connect feature to make port 8000 on my local VM available to the Internet. With that, we made the Internet a little bit better by adding yet another Gif service, that just happens to run on a VM hosted on my laptop.

Static DNS names for your mysocket.io services (and a new gif service) ]]>
<![CDATA[Introducing Mysocket.io]]>http://toonk.io/introducing-mysocket-io/5fbb2ca251a5ff44fb64d8b7Mon, 23 Nov 2020 03:40:48 GMTIn this blog, I’d like to introduce a new project I’m calling Mysocket.io. Before we dive in, a bit of background.Introducing Mysocket.io

Loyal readers know I enjoy building global infrastructure services that need to be able to carry a significant amount of traffic and a large number of requests. Building services like these often require us to solve several challenges. Things to consider include: high availability, scaling, DDoS proofing, monitoring, logging, testing, deployments, user-facing & backend APIs, policy management (user preferences) and distribution, life-cycling, etc. All this while keeping an eye on cost and keeping complexity to a minimum (which really is human, operational cost).

To experiment with these topics, it’s necessary to have a project to anchor these experiments to. Something I can continuously work on, and while doing so, improve the project as a whole. Now, there are many projects I started over time, but one that I've worked most on recently, and wanted to share with a wider audience, is mysocket.io. A service that provides secure public endpoints for services that are otherwise not publicly reachable.

Introducing Mysocket.io

A typical example case that mysocket.io can help with is a web service running on your laptop, which you’d like to make available to a team member or client. Or ssh access to servers behind NAT or a firewall, like a raspberry pi on your home network or ec2 instances behind NAT.

Introducing Mysocket.io
make your localhost app available from anywhere
Introducing Mysocket.io
Provide SSH access to your home server behind NAT.

More details

Alright, a good way to share more details and is to do a quick demo! You can see a brief overview in this video, or even better, try it yourself by simply following the four easy steps below.

If you’re interested or curious, feel free to give it a spin and let me know what worked or didn’t, or even better, how it can be improved. Getting started will take you just one minute. Just follow these simple steps.
#Install client, using python's package manager (pip)
pip3 install mysocketctl

#Create account
mysocketctl account create \
    --name "your_name" \
    --email "your_email_address" \
    --password "a_secure_password" \
    --sshkey "$(cat ~/.ssh/id_rsa.pub)"
mysocketctl login  \
    --email "your_email_address" \
    --password "a_secure_password" 
#Launch your first global socket ;)
mysocketctl connect \
    --port 8000 \
    --name "my test service"

Architecture overview

Ok, so how does it work? The process for requesting a “global socket” starts with an API call. You can do this by directly interfacing with the RESTful API, or by using the mysocketctl tool. This returns a global mysocket object, which has a name, port number(s), and some other information.

Users can now use this socket object to create tunnel objects. These tunnels are then used to connect your local service to the global mysocket.io infrastructure. By stitching these two TCP sessions together, we made your local service globally available.

Introducing Mysocket.io
Creating a Socket, a Tunnel and connecting to mysocket.io

The diagram below provides a high-level overview of the service data-plane. On the left, we have the origin service. This could be your laptop, your raspberry pi at home, or even a set of containers in a Kubernetes cluster. The origin service can be behind a very strict firewall or even NAT. All it needs is outbound network access. We can then set up a secure encrypted tunnel to any of the mysocket.io servers around the world.

Introducing Mysocket.io
Mysocket.io dataplane


The Mysocket.io services use AWS’ global accelerator. With this, I’m making both the tunnel servers and proxy services anycasted. This solves some of the load balancing and high availability challenges. The mysocket tunnel and proxy servers are located in North America, Europe, and Asia.

Once the tunnel is established, the connection event is signaled to all other nodes in real-time, ensuring that all edge nodes know where the tunnel for the service is.


One of my goals is to make Mysocket super easy to use. One way to do that is to have good documentation. I invite you to check out our readthedocs.io documentation here https://mysocket.readthedocs.io/

It’s divided into two sections:

  1. General information about mysocket.io and some of the concepts.
  2. Information and user guides for the mysocketctl command-line tool.

The documentation itself and mysocketctl tool are both opensource so feel free to open pull requests or open issues if you have any questions.

You may have noticed there’s a website as well. I wanted to create a quick landing page, so I decided to play with Wix.com. They make it super easy;  I may have gone overboard a bit ;)  All that was clicked together in just one evening, pretty neat.

More to come

There’s a lot more to tell and plenty more geeky details to dive into. More importantly, we can continue to build on this and make it even better (ping me if you have ideas or suggestions)!
So stay tuned. That’s the plan for subsequent Blog posts soon, either in this blog or the mysocket.io blog.


<![CDATA[AWS and their Billions in IPv4 addresses]]>Earlier this week, I was doing some work on AWS and wanted to know what IP addresses were being used. Luckily for me, AWS publishes this all here https://ip-ranges.amazonaws.com/ip-ranges.json. When you go through this list, you’ll quickly see that AWS has a massive asset

http://toonk.io/aws-and-their-billions-in-ipv4-addresses/5f88bf632257c9dcdf92feeeTue, 20 Oct 2020 16:01:11 GMT

Earlier this week, I was doing some work on AWS and wanted to know what IP addresses were being used. Luckily for me, AWS publishes this all here https://ip-ranges.amazonaws.com/ip-ranges.json. When you go through this list, you’ll quickly see that AWS has a massive asset of IPv4 allocations. Just counting quickly I noticed a lot of big prefixes.

However, the IPv4 ranges on that list are just the ranges that are in use and allocated today by AWS. Time to dig a bit deeper.

IPv4 address acquisitions by AWS

Over the years, AWS has acquired a lot of IPv4 address space. Most of this happens without gaining too much attention, but there were a few notable acquisitions that I’ll quickly summarize below.

2017: MIT selling 8 million IPv4 addresses to AWS

In 2017 MIT sold half of its allocation to AWS. This range holds about 8 million IPv4 addresses.

2018: GE sells to AWS

In 2018 the IPv4 prefix was transferred from GE to AWS. With this, AWS became the proud owner of its first /8! That’s sixteen million new IPv4 addresses to feed us hungry AWS customers. https://news.ycombinator.com/item?id=18407173

2019: AWS buys AMPRnet

In 2019 AWS bought a /10 from AMPR.org, the Amateur Radio Digital Communications (ARDC). The IPv4 range was an allocation made to the Amateur Radio organization in 1981 and known as the AMPRNet. This sell caused a fair bit of discussion, check out the nanog discussion here.

Just this month, it became public knowledge AWS paid $108 million for this /10. That’s $25.74 per IP address.

These are just a few examples. Obviously, AWS has way more IP addresses than the three examples I listed here. The IPv4 transfer market is very active. Check out this website to get a sense of all transfers: https://account.arin.net/public/transfer-log

All AWS IPv4 addresses

Armed with the information above it was clear that not all of the AWS owned ranges were in the JSON that AWS published. For example, parts of the range are missing. Likely because some of it is reserved for future use.

I did a bit of digging and tried to figure out how many IPv4 addresses AWS really owns. A good start is the Json that AWS publishes. I then combined that with all the ARIN, APNIC, and RIPE entries for Amazon I could find. A few examples include:


Combining all those IPv4 prefixes, removing duplicates and overlaps by aggregating them results in the following list of unique IPv4 address owned by AWS: https://gist.github.com/atoonk/b749305012ae5b86bacba9b01160df9f#all-prefixes

The total number of IPv4 addresses in that list is just over 100 Million (100,750,168). That’s the equivalent of just over six /8’s, not bad!

If we break this down by allocation size, we see the following:

1x /8     => 16,777,216 IPv4 addresses
1x /9     => 8,388,608 IPv4 addresses
4x /10    => 16,777,216 IPv4 addresses
5x /11    => 10,485,760 IPv4 addresses
11x /12   => 11,534,336 IPv4 addresses
13x /13   => 6,815,744 IPv4 addresses
34x /14   => 8,912,896 IPv4 addresses
53x /15   => 6,946,816 IPv4 addresses
182x /16  => 11,927,552 IPv4 addresses
<and more>

A complete breakdown can be found here: https://gist.github.com/atoonk/b749305012ae5b86bacba9b01160df9f#breakdown-by-ipv4-prefix-size

Putting a valuation on AWS’ IPv4 assets

Alright.. this is just for fun…

Since AWS is (one of) the largest buyers of IPv4 addresses, they have spent a significant amount on stacking up their IPv4 resources. It’s impossible, as an outsider, to know how much AWS paid for each deal. However, we can for fun, try to put a dollar number on AWS’ current IPv4 assets.

The average price for IPv4 addresses has gone up over the years. From ~$10 per IP a few years back to ~$25 per IP nowadays.
Note that these are market prices, so if AWS would suddenly decide to sell its IPv4 addresses and overwhelm the market with supply, prices would drop. But that won’t happen since we’re all still addicted to IPv4 ;)

Anyway, let’s stick with $25 and do the math just for fun.

100,750,168 ipv4 addresses x $25 per IP = $2,518,754,200

Just over $2.5 billion worth of IPv4 addresses, not bad!

Peeking into the future

It’s clear AWS is working hard behind the scenes to make sure we can all continue to build more on AWS. One final question we could look at is: how much buffer does AWS have? ie. how healthy is their IPv4 reserve?

According to their published data, they have allocated roughly 53 Million IPv4 addresses to existing AWS services. We found that all their IPv4 addresses combined equates to approximately 100 Million IPv4 addresses. That means they still have ~47 Million IPv4 addresses, or 47% available for future allocations. That’s pretty healthy! And on top of that, I’m sure they’ll continue to source more IPv4 addresses. The IPv4 market is still hot!

<![CDATA[100G networking in AWS, a network performance deep dive]]>

Loyal readers of my blog will have noticed a theme, I’m interested in the continued move to virtualized network functions, and the need for faster networking options on cloud compute. In this blog, we’ll look at the network performance on the juggernaut of cloud computing, AWS.

AWS is

http://toonk.io/aws-network-performance-deep-dive/5f889c6f2257c9dcdf92feb0Thu, 15 Oct 2020 19:06:42 GMT100G networking in AWS, a network performance deep dive

Loyal readers of my blog will have noticed a theme, I’m interested in the continued move to virtualized network functions, and the need for faster networking options on cloud compute. In this blog, we’ll look at the network performance on the juggernaut of cloud computing, AWS.

AWS is the leader in the cloud computing world, and many companies now run parts of their services on AWS. The question we’ll try to answer in this article is: how well suited is AWS’ ec2 for high throughput network functions.

I’ve decided to experiment with adding a short demo video to this blog. Below you will find a quick demo and summary of this article. Since these videos are new and a bit of an experiment, let me know if you like it.

100G networking

It’s already been two years since AWS announced the C5n instances, featuring 100 Gbps networking. I’m not aware of any other cloud provider offering 100G instances, so this is pretty unique. Ever since this was released I wondered exactly what, if any, the constraints were. Can I send/receive 100g line rate (144Mpps)? So, before we dig into the details, let’s just check if we can really get to 100Gbs.

100gbs testing.. We’re gonna need a bigger boat..

There you have it, I was able to get to 100Gbs between 2 instances! That’s exciting. But there are a few caveats. We’ll dig into all of them in this article, with the aim to understand exactly what’s possible, what the various limits are, and how to get to 100g.

Understand the limits

Network performance on Linux is typically a function of a few parameters. Most notably, the number of TX/RX queues available on the NIC (network card). The number of CPU cores, ideally at least equal to the number of queues. The pps (packets per second) limit per queue. And finally, in virtual environments like AWS and GCP, potential admin limits on the instance.

100G networking in AWS, a network performance deep dive

Doing networking in software means that processing a packet (or a batch of them) uses a number of CPU cycles. It’s typically not relevant how many bytes are in a packet. As a result, the best metric to look at is the: pps number (related to our cpu cycle budget). Unfortunately, the pps performance numbers for AWS aren’t published so, we’ll have to measure them in this blog. With that, we should have a much better understanding of the network possibilities on AWS, and hopefully, this saves someone else a lot of time (this took me several days of measuring) ;)

Network queues per instance type

The table below shows the number of NIC queues by ec2 (c5n) Instance type.

100G networking in AWS, a network performance deep dive

In the world of ec2, 16 vCPUs on the C5n 4xl instance means 1 Socket, 8 Cores per socket, 2 Threads per core.

On AWS, an Elastic Network Adapter (ENA) NIC has as many queues as you have vCPUs. Though it stops at 32 queues, as you can see with the C5n 9l and C5n 18xl instance.

Like many things in computing, to make things faster, things are parallelized. We see this clearly when looking at CPU capacity, we’re adding more cores, and programs are written in such a way that can leverage the many cores in parallel (multi-threaded programs).

Scaling Networking performance on our servers is done largely the same. It’s hard to make things significantly faster, but it is easier to add more ‘workers’, especially if the performance is impacted by our CPU capacity. In the world of NICs, these ‘workers’ are queues. Traffic send and received by a host is load-balanced over the available network queues on the NIC. This load balancing is done by hashing (typically the 5 tuples, protocol, source + destination address, and port number). Something you’re likely familiar with from ECMP.

100G networking in AWS, a network performance deep dive
So queues on a NIC are like lanes on a highway, the more lanes, the more cars can travel the highway. The more queues, the more packets (flows) can be processed.

Test one, ENA queue performance

As discussed, the network performance of an instance is a function of the number of available queues and cpu’s. So let’s start with measuring the maximum performance of a single flow (queue) and then scale up and measure the pps performance.

In this measurement, I used two c5n.18xlarge ec2 instances in the same subnet and the same placement zone. The sender is using DPDK-pktgen (igb_uio). The receiver is a stock ubuntu 20.04 LTS instance, using the ena driver.

The table below shows the TX and RX performance between the two c5n.18xlarge ec2 instances for one and two flows.

100G networking in AWS, a network performance deep dive

With this, it seems the per queue limit is about 1Mpps. Typically the per queue limit is due to the fact that a single queue (soft IRQ) is served by a single CPU core. Meaning, the per queue performance is limited by how many packets per second a single CPU core can process. So again, what you typically see in virtualized environments is that the number of network queues goes up with the number of cores in the VM. In ec2 this is the same, though it’s maxing out at 32 queues.

Test two, RX only pps performance

Now that we determined that the per queue limit appears to be roughly one million packets per second, it’s natural to presume that this number scales up horizontally with the number of cores and queues. For example, the C5n 18xl comes with 32 nic queues and 72 cores, so in theory, we could naively presume that the (RX/TX) performance (given enough flows) should be 32Mpps. Let’s go ahead and validate that.

The graph below shows the Transmit (TX) performance as measured on a c5n.18xlarge. In each measurement, I gave the packet generator one more queue and vcpu to work with. Starting with one TX queue and one VCPu, incrementing this by one in each measurement until we reached 32 vCPU and 32 queues (max). The results show that the per TX queue performance varied between 1Mpps to 700Kpps. The maximum total TX performance I was able to get however, was ~8.5Mpps using 12 TX queues. After that, adding more queues and vCPu’s didn’t matter, or actually degraded the performance. So this indicates that the performance scales horizontally (per queue), but does max out at a certain point (varies per instance type), in this case at 8.5 Mpps

100G networking in AWS, a network performance deep dive
c5n.18xlarge per TX queue performance

In this next measurement, we’ll use two packet generators and one receiver. I’m using two generators, just to make sure the limit we observed earlier isn’t caused by limitations on the packet generator. Each traffic generator is sending many thousands of flows, making sure we leverage all the available queues.

100G networking in AWS, a network performance deep dive
RX pps per C5N instance type

Alright, after a few minutes of reading (and many, many hours, well really days.. of measurements on my end), we now have a pretty decent idea of the performance numbers. We know how many queues each of the various c5n instance types have.

We have seen that the per queue limit is roughly 1Mpps. And with the table above, we now see how many packets per second each instance is able to receive (RX).

Forwarding performance

If we want to use ec2 for virtual network functions, then just receiving traffic isn’t enough. A typical router or firewall should both receive and send traffic at the same time. So let’s take a look at that.

For this measurement, I used the following setup. Both the traffic generator and receiver were C5n-18xl instances. The Device Under Test (DUT) was a standard Ubuntu 20.04 LTS instance using the ena driver. Since the earlier observed pps numbers weren’t too high, I determined it’s safe to use the regular Linux kernel to forward packets.

100G networking in AWS, a network performance deep dive
test setup
100G networking in AWS, a network performance deep dive
pps forwarding performance

The key takeaway from this measurement is that the TX and RX numbers are similar as we’d seen before for the instance types up to (including) the C5n 4xl. For example, earlier we saw the C5n 4xl could receive up to ~3Mpps. This measurement shows that it can do ~3Mpps simultaneously on RX and TX.

However, if we look at the C5n 9l, we can see it was able to process RX+ TX about 6.2Mpps. Interestingly, earlier we saw it was also able to receive (rx only) ~6Mpps. So it looks like we hit some kind of aggregate limit. We observed a similar limit for the C5n 18xl instance.

In Summary.

In this blog, we looked at the various performance characteristics of networking on ec2. We determined that the performance of a single queue is roughly 1Mpps. We then saw how the number of queues goes up with the higher end instances up until 32 queues maximum.

We then measure the RX performance of the various instances as well as the forwarding (RX + TX aggregate) performance. Depending on the measurement setup (RX, or TX+RX) we see that for the largest instance types, the pps performance maxes out at roughly 6.6Mpps to 8.3Mpps. With that, I think that the C5n 9l hits the sweet spot in terms of cost vs performance.

So how about that 100G test?

Ah yes! So far, we talked about pps only. How does that translate that to gigabits per second?
Let’s look at the quick table below that shows how the pps number translates to Gbs at various packet sizes.

100G networking in AWS, a network performance deep dive

These are a few examples to get to 10G at various packet sizes. This shows that in order to support line-rate 10G at the smallest packet size, the system will need to be able to do ~14.88 Mpps. The 366 byte packet size is roughly the equivalent average of what you’ll see with an IMIX test, for which the systems needs to be able to process ~3,4Mpps to get to 10G line rate.

If we look at the same table but then for 100gbps, we see that at the smallest packet size, an instance would need to be able to process is over 148Mpps. But using 9k jumbo frames, you only need 1.39Mpps.

100G networking in AWS, a network performance deep dive

And so, that’s what you need to do to get to 100G networking in ec2. Use Jumbo frames (supported in ec2, in fact, for the larger instances, this was the default). With that and a few parallel flows you’ll be able to get to 100G “easily”!

A few more limits

One more limitation I read about while researching, but didn’t look into myself. It appears that some of the instances have time-based limits on the performance. This blog calls it Guaranteed vs. Best Effort. Basically, you’re allowed to burst for a while, but after a certain amount of time, you’ll get throttled. Finally, there is a per-flow limit of 10Gbs. So if you’re doing things like IPSEC, GRE, VXLAN, etc, note that you will never go any faster than 10g.

Closing thoughts

Throughout this blog, I mentioned the word ‘limits’ quite a bit, which has a bit of a negative connotation. However, it’s important to keep in mind that AWS is a multi-tenant environment, and it’s their job to make sure the user experience is still as much as possible as if the instance is dedicated to you. So you can also think of them as ‘guarantees’. AWS will not call them that, but in my experience, the throughput tests have been pretty reproducible with, say a +/- 10% measurement margin.

All in all, it’s pretty cool to be able to do 100G on AWS. As long as you are aware of the various limitations, which unfortunately aren’t well documented. Hopefully, this article helps some of you with that in the future.
Finally, could you use AWS to run your virtual firewalls, proxies, VPN gateways, etc? Sure, as long as you’re aware of the performance constraints. And with that design a horizontally scalable design, according to AWS best practices. The one thing you really do need to keep an eye on is the (egress) bandwidth pricing, which, when you started doing many gigabits per second, can add up.

- Andree

<![CDATA[Building a global anycast service in under a minute]]>This weekend I decided to take another look at Stackpath and their workload edge compute features. This is a relatively new feature, in fact, I wrote about it in Feb 2109 when it was just released. I remember being quite enthusiastic about the potential but also observed some things that

http://toonk.io/building-a-global-anycast-service-in-under-a-minute/5f531ce73680d46e73ec5fdfSun, 21 Jun 2020 05:15:00 GMT

This weekend I decided to take another look at Stackpath and their workload edge compute features. This is a relatively new feature, in fact, I wrote about it in Feb 2109 when it was just released. I remember being quite enthusiastic about the potential but also observed some things that were lacking back then. Now, one and a half years later, it seems most of those have been resolved, so let’s take a look!

I’ve decided to experiment with adding a small demo video to these blogs.
Below you will find a quick 5min demo of the whole setup. Since these videos are new and a bit of an experiment, let me know if you like it.
Demo: Building a global anycast service in under a minute


Stackpath support two types of workloads (in addition to serverless), VM and container-based deployments. Both can be orchestrated using API’s and Terraform. Terraform is an “Infrastructure as code” tool. You simply specify your intent with Terraform, apply it, and you’re good to go. I’m a big fan of Terraform, so we’ll use that for our test.

One of the cool things about Stackpath is that they have built-in support for Anycast, for both their VM and Container service. I’m going to use that feature and the Container service to build this highly available, low latency web service. It’s super easy, see for your self on my github here.

Docker setup

Since I’m going to use the container service, we need to create a Docker container to work with. This is my Dockerfile

FROM python:3
WORKDIR /usr/src/app
COPY ./mywebserver.py .
CMD [ “python”, “./mywebserver.py” ]

The mywebserver.py program is a simple web service that prints the hostname environment variable. This will help us determine which node is servicing our request when we start our testing.

After I built the container, I uploaded it to my Dockerhub repo, so that Stackpath can pull it from there.


Now it’s time to define our infrastructure using terraform. The relevant code can be found on my github here. I’ll highlight a few parts:

On line 17 we start with defining a new workload, and I’m requesting an Anycast IP for this workload. This means that Stackpath will load balance (ECMP) between all nodes in my workload (which I’m defining later).

resource “stackpath_compute_workload” “my-anycast-workload” {   
    name = “my-anycast-workload”
    slug = “my-anycast-workload”   
    annotations = {       
        # request an anycast IP       
        “anycast.platform.stackpath.net” = “true”   

On line 31, we define the type of workload, in this case, a container. As part of that we’re opening the correct ports, in my case port 8000 for the python service.

container {   
    # Name that should be given to the container   
    name = “app”   
    port {      
        name = “web”      
        port = 8000      
        protocol = “TCP”      
        enable_implicit_network_policy = true   

Next up we define the container we’d like to deploy (from Dockerhub)

# image to use for the container
image = “atoonk/pythonweb:latest”

In the resources section we define the container specifications. In my case I’m going with a small spec, of one CPU core and 2G of ram.

resources {
   requests = {
      “cpu” = “1”
      “memory” = “2Gi”

We now get to the section where we define how many containers we’d like per datacenter and in what datacenters we’d like this service to run.

In the example below, we’re deploying three containers in each datacenter, with the possibility to grow to four as part of auto-scaling. We’re deploying this in both Seattle and Dallas.

target {
    name         = "global"
    min_replicas = 3
    max_replicas = 4
    scale_settings {
      metrics {
        metric = "cpu"
        # Scale up when CPU averages 50%.
        average_utilization = 50
    # Deploy these instances to Dallas and Seattle
    deployment_scope = "cityCode"
    selector {
      key      = "cityCode"
      operator = "in"
      values   = [
        "DFW", "SEA"

Time to bring up the service.

Now that we’ve defined our intent with terrraform, it’s time to bring this up. The proper way to do this is:

terraform init
terraform plan
terraform apply
After that, you’ll see the containers come up, and our anycasted python service will become available. Since the containers come up rather quickly, you should have all six containers in the two datacenters up and running in under a minute.

Testing the load balancing.

I’ve deployed the service in both Seattle and Dallas, and since I am based in Vancouver Canada, I expect to hit the Seattle datacenter as that is the closest datacenter for me.

$ for i in `seq 1 10`; do curl ; done


The results above show that I am indeed hitting the Seattle datacenter, and that my requests are being load balanced over the three instances in Seattle, all as expected.

Building a global anycast service in under a minute
In the portal, I can see the per container logs as well

In Summary

Compared to my test last year with Stackpath, there has been a nice amount of progress. It’s great to now be able to do all of this with just a Terraform file. It’s kind of exciting you can bring up a fully anycast service in under a minute with only one command! By changing the replicate number in the Terraform file we can also easily grow and shrink our deployment if needed.
In this article we looked at the container service only, but the same is possible with Virtual machines, my github repo has an example for that as well.

Finally, don’t forget to check the demo recording and let me know if you’d like to see more video content.

<![CDATA[Building an XDP (eXpress Data Path) based BGP peering router]]>Over the last few years, we’ve seen an increase in projects and initiatives to speed up networking in Linux. Because the Linux kernel is slow when it comes to forwarding packets, folks have been looking at userland or kernel bypass networking. In the last few blog posts, we’ve

http://toonk.io/building-an-xdp-express-data-path-based-bgp-peering-router/5f53ccd83680d46e73ec6039Sun, 19 Apr 2020 00:00:00 GMT

Over the last few years, we’ve seen an increase in projects and initiatives to speed up networking in Linux. Because the Linux kernel is slow when it comes to forwarding packets, folks have been looking at userland or kernel bypass networking. In the last few blog posts, we’ve looked at examples of this, mostly leveraging DPDK to speed up networking. The trend here is, let’s just take networking away from the kernel and process them in userland. Great for speed, not so great for all the Kernel network stack features that now have to be re-implemented in userland.

The Linux Kernel community has recently come up with an alternative to userland networking, called XDP, Express data path, it tries to strike a balance between the benefits of the kernel and faster packet processing. In this article, we’ll take a look at what it would take to build a Linux router using XDP. We will go over what XDP is, how to build an XDP packet forwarder combined with a BGP router, and of course, look at the performance.

XDP (eXpress Data Path)

XDP (eXpress Data Path) is an eBPF based high-performance data path merged in the Linux kernel since version 4.8. Yes, BPF, the same Berkeley packet filter as you’re likely familiar with from tcpdump filters, though that’s now referred to as Classic BPF. Enhanced BPF has gained a lot of popularity over the last few years within the Linux community. BPF allows you to connect to Linux kernel hook points, each time the kernel reaches one of those hook points, it can execute an eBPF program. I’ve heard some people describe eBPF as what Java script was for the web, an easy way to enhance the ’web’, or in this case, the kernel. With BPF you can execute code without having to write kernel modules. XDP, as part of the BPF family, operates early on in the Kernel network code. The idea behind XDP is to add an early hook in the RX path of the kernel and let a user-supplied eBPF program decide the fate of the packet. The hook is placed in the NIC driver just after the interrupt processing and before any memory allocation needed by the network stack itself. So all this happens before an SKB (the most fundamental data structure in the Linux networking code) is allocated. Practically this means this is executed before things like tc and iptables.

A BPF program is a small virtual machine, perhaps not the typical virtual machines you’re familiar with, but a tiny (RISC register machine) isolated environment. Since it’s running in conjunction with the kernel, there are some protective measures that limit how much code can be executed and what it can do. For example, it can not contain loops (only bounded loops), there are a limited number of eBPF instructions and helper functions. The maximum instruction limit per program is restricted to 4096 BPF instructions, which, by design, means that any program will terminate quickly. For kernel newer than 5.1, this limit was lifted to 1 million BPF instructions.

When and Where is the XDP code executed

XDP programs can be attached to three different points. The fastest is to have it run on the NIC itself, for that you need a smartnic and is called offload mode. To the best of my knowledge, this is currently only supported on Netronome cards. The next attachment opportunity is essentially in the driver before the kernel allocates an SKB. This is called “native” mode and means you need your driver to support this, luckily most popular drivers do nowadays.

Finally, there is SKB or Generic Mode XDP, where the XDP hook is called from netif _ receive _ skb(), this is after the packet DMA and skb allocation are completed, as a result, you lose most of the performance benefits.

Assuming you don’t have a smartnic, the best place to run your XDP program is in native mode as you’ll really benefit from the performance gain.

XDP actions

Now that we know that XDP code is an eBPF C program, and we understand where it can run, now let’s take a look at what you can do with it. Once the program is called, it receives the packet context and from that point on you can read the content, update some counters, potentially modify the packet, and then the program needs to terminate with one of 5 XDP actions:

This does exactly what you think it does; it drops the packet and is often used for XDP based firewalls and DDOS mitigation scenarios.
Similar to DROP, but indicates something went wrong when processing. This action is not something a functional program should ever use as a return code.
This will release the packet and send it up to the kernel network stack for regular processing. This could be the original packet or a modified version of it.
This action results in bouncing the received packet back out the same NIC it arrived on. This is usually combined with modifying the packet contents, like for example, rewriting the IP and Mac address, such as for a one-legged load balancer.
The redirect action allows a BPF program to redirect the packet somewhere else, either a different CPU or different NIC. We’ll use this function later to build our router. It is also used to implement AF_XDP, a new socket family that solves the highspeed packet acquisition problem often faced by virtual network functions. AF_XDP is, for example, used by IDS’ and now also supported by Open vSwitch.

Building an XDP based high performant router

Alright, now that we have a better idea of what XDP is and some of its capabilities, let’s start building! My goal is to build an XDP program that forwards packets at line-rate between two 10G NICs. I also want the program to use the regular Linux routing table. This means I can add static routes using the “ip route” command, but it also means I could use an opensource BGP daemon such as Bird or FRR.

We’ll jump straight to the code. I’m using the excellent XDP tutorial code to get started. I forked it here, but it’s mostly the same code as the original. This is an example called “xdp_router” and uses the bpf_fib_lookup() function to determine the egress interface for a given packet using the Linux routing table. The program then uses the action bpf_redirect_map() to send it out to the correct egress interface. You can see code here. It’s only a hundred lines of code to do all the work.

After we compile the code (just run make in the parent directory), we load the code using the ./xdp_loader program included in the repo and use the ./xdp_prog_user program to populate and query the redirect_params maps.

#pin BPF resources (redirect map) to a persistent filesystem
mount -t bpf bpf /sys/fs/bpf/

# attach xdp_router code to eno2
./xdp_loader -d eno2 -F — progsec xdp_router

# attach xdp_router code to eno4
./xdp_loader -d eno4 -F — progsec xdp_router

# populate redirect_params maps
./xdp_prog_user -d eno2
./xdp_prog_user -d eno4

Test setup

So far, so good, we’ve built an XDP based packet forwarder! For each packet that comes in on either network interface eno2 or eno4 it does a route lookup and redirects it to the correct egres interface, all in eBPF code. All in a hundred lines of code, Pretty awesome, right?! Now let’s measure the performance to see if it’s worth it. Below is the test setup.

Building an XDP (eXpress Data Path) based BGP peering router
test setup

I’m using the same traffic generator as before to generate 14Mpps at 64Bytes for each 10G link. Below are the results:

Building an XDP (eXpress Data Path) based BGP peering router
XDP forwarding Test results

The results are amazing! A single flow in one direction can go as high as 4.6 Mpps, using one core. Earlier, we saw the Linux kernel can go as high as 1.4Mpps for one flow using one core.

14Mpps in one direction between the two NICs require four cores. Our earlier blog showed that the regular kernel would need 16 cores to do this work!

Building an XDP (eXpress Data Path) based BGP peering router
Test result — XDP forwarding using XDP_REDIRECT, 5 cores to forward 29Mpps

Finally, for the bidirectional 10,000 flow test, forwarding 28Mpps, we need five cores. All tests are significantly faster than forwarding packets using the regular kernel and all that with minor changes to the system.

Just so you know

Since all packet forwarding happens in XDP, packets redirected by XDP won’t be visible to IPtables or even tcpdump. Everything happens before packets even reach that layer, and since we’re redirecting the packet, it never moves up higher the stack. So if you need features like ACLs or NAT, you will have to implement that in XDP (take a look at https://cilium.io/).

A word on measuring cpu usage.
To control and measure the number of CPU cores used by XDP, I’m changing the number of queues the NIC can use. I increase the number of queues on my XL710 Intel NIC incrementally until I get a packet loss-free transfer between the two ports on the traffic generator. For example, to get 14Mpps in one direction from port 0 to 1 on the traffic generator through our XDP router, which was forwarding between eno2 and eno4, I used the following settings:

ethtool -L eno2 combined 4
ethtool -L eno4 combined 4

For the 28Mpps testing, I used the following

ethtool -L eno2 combined 4
ethtool -L eno4 combined 4
A word of caution
Interestingly, increasing the number of queues, and thus using more cores, appears to, in some cases, have a negative impact on the efficiency. Ie. I’ve seen scenarios when using 30 queues, where the unidirectional 14mps test with 10,000 flows appear to use almost no CPU (between 1 and 2) while the same test bidirectionally uses up all 30 cores. When restarting this test, I see some inconsistent behavior in terms of CPU usage, so not sure what’s going on, I will need to spend a bit more time on this later.

XDP as a peering router

The tests above show promising results, but one major difference between a simple forwarding test and a real life peering router is the number of routes in the forwarding table. So the questions we need to answer was how the bpf_fib_lookup function will perform when there are more than just a few routes in the routing table. More concretely, could you use Linux with XDP as a full route peering router?
To answer this question, I installed bird as a bgp daemon on the XDP router. Bird has a peering session with an exabgp instance, which I loaded with a full routing table using mrt2exabgp.py and a MRT files from RIPE RIS.
Just to be a real peering router, I also filtered out the RPKI invalid routes using rtrsub. The end result is a full routing table with about 800k routes in the Linux FIB.

Building an XDP (eXpress Data Path) based BGP peering router
Test result — XDP router with a ful routing table. 5 cores to forward 28Mpps
After re-running the performance tests with 800k bgp routes in the FIB, I observed no noticeable decrease in performance.
This indicates that a larger FIB table has no measurable impact on the XDP helper bpf_fib_lookup(). This is exciting news for those interested in a cheap and fast peering router.

Conclusion and closing thoughts.

We started the article with a quick introduction to eBPF and XDP. We learned that XDP is a subset of the recent eBPF developments focused specifically on the hooks in the network stack. We went over the different XDP actions and introduced the redirect action, which, together with the bpf_fib_lookup helper allows us to build the XDP router.

When looking at the performance, we see this we can speed up packet forwarding in Linux by roughly five times in terms of CPU efficiency compared to regular kernel forwarding. We observed we needed about five cores to forward 28Mpps bidirectional between two 10G NICs.

When we compare these results with the results from my last blog, DPDK and VPP, we see that XDP is slightly slower, ie. 3 cores (vpp) vs 5 cores (XDP) for the 28Mpps test. However, the nice part about working with XDP was that I was able to leverage the Linux routing table out of the box, which is a major advantage.

The exciting part is that this setup integrates natively with Netlink, which allowed us to use Bird, or really any other routing daemon, to populate the FIB. We also saw that the impact of 800K routes in the fib had no measurable impact on the performance.

The fib_lookup helper function allowed us to build a router and leverage well-known userland routing daemons. I would love to also see a similar helper function for conntrack, or perhaps some other integration with Netfilter. It would make building firewalls and perhaps even NAT a lot easier. Punting the first packet to the kernel, and subsequent packets are handled by XDP.
Wrapping up, we started with the question can we build a high performant peering router using XDP? The answer is yes! You can build a high performant peering router using just Linux and relying on XDP to accelerate the dataplane. While leveraging the various open-source routing daemons to run your routing protocols. That’s exciting!
<![CDATA[Kernel bypass networking with FD.io and VPP.]]>http://toonk.io/kernel-bypass-networking-with-fd-io-and-vpp/5f53d4c53680d46e73ec6097Sun, 05 Apr 2020 18:14:00 GMT

Over the last few years, I have experimented with various flavors of userland, kernel-bypass networking. In this article, we’ll take FD.IO for a spin.

We will compare the result with the results of my last blog in which we looked at how much a vanilla Linux kernel could do in terms of forwarding (routing) packets. We observed that on Linux, to achieve 14Mpps we needed roughly 16 and 26 cores for a unidirectional and bidirectional test. In this article, we’ll look at what we need to accomplish this with FD.io

Userland networking

The principle of Userland networking is that the networking stack is no longer handled by the kernel, but instead by a userland program. The Linux kernel is incredibly feature-rich, but for fast networking, it also requires a lot of cores to deal with all the (soft) interrupts. Several of the userland networking projects rely on DPDK to achieve incredible numbers. One reason why DPDK is so fast is that it doesn’t rely on Interrupts. Instead, it’s a poll mode driver. Meaning it’s continuously spinning at 100% picking up packets from the NIC. A typical server nowadays comes with quite a few CPU cores, and dedicating one or more cores for picking packets of the NIC is, in some cases, entirely worth it. Especially if the server needs to process lots of network traffic.

So DPDK provides us with the ability to efficiently and extremely fast, send and receive packets. But that’s also it! Since you’re not using the kernel, we now need a program that takes the packets from DPDK and does something with it. Like for example, a virtual switch or router.


Kernel bypass networking with FD.io and VPP.

FD.IO is an open-source software dataplane developed by Cisco. At the heart of FD.io is something called Vector Packet Processing (VPP).

The VPP platform is an extensible framework that provides switching and routing functionality. VPP is built on a ‘packet processing graph.’ This modular approach means that anyone can ‘plugin’ new graph nodes. This makes extensibility rather simple, and it means that plugins can be customized for specific purposes.

FD.io can use DPDK as the drivers for the NIC and can then process the packets at a high performant rate that can run on commodity CPU. It’s important to remember that it is not a fully-featured router, ie. it doesn’t really have a control plane; instead, it’s a forwarding engine. Think of it as a router line-card, with the NIC and the DPDK drivers as the ports. VPP allows us to take a packet from one NIC to another, transform it if needed, do table lookups, and send it out again. There are API’s that allow you to manipulate the forwarding tables. Or you can use the CLI to, for example, configure static routes, VLAN, vrf’s etc.

Test setup

I’ll use mostly the same test setup as in my previous test. Again using two n2.xlarge.x86 servers from packet.com and our DPDK traffic generator. The set up is as below.

Kernel bypass networking with FD.io and VPP.
Test setup

I’m using the VPP code from the FD.io master branch and installed it on a vanilla Ubuntu 18.04 system following these steps.

Test results — Packet forwarding using VPP

Now that we have our test setup ready to go, it’s time to start our testing!
To start, I configured VPP with “vppctl” like this, note that I need to set static ARP entries since the packet generator doesn’t respond to ARP.

set int ip address TenGigabitEthernet19/0/1
set int ip address TenGigabitEthernet19/0/3
set int state TenGigabitEthernet19/0/1 up
set int state TenGigabitEthernet19/0/3 up
set ip neighbor TenGigabitEthernet19/0/1 e4:43:4b:2e:b1:d1
set ip neighbor TenGigabitEthernet19/0/3 e4:43:4b:2e:b1:d3

That’s it! Pretty simple right?

Ok, time to look at the results just like before we did a single flow test, both unidirectional and bidirectional, as well as a 10,000 flow test.

Kernel bypass networking with FD.io and VPP.
VPP forwarding test results

Those are some remarkable numbers! With a single flow, VPP can process and forward about 8Mpps, not bad. The perhaps more realistic test with 10,000 flows, shows us that it can handle 14Mpps with just two cores. To get to a full bi-directional scenario where both NICs are sending and receiving at line rate (28 Mpps per NIC) we need three cores and three receive queues on the NIC. To achieve this last scenario with Linux, we needed approximately 26 cores. Not bad, not bad at all!

Kernel bypass networking with FD.io and VPP.
Traffic generator on the left, VPP server on the right. This shows the full line-rate bidirectional test: 14Mpps per NIC, while VPP uses 3 cores.

Test results — NAT using VPP

In my previous blog we saw that when doing SNAT on Linux with iptables, we got as high as 3Mpps per direction needing about 29 CPUs per direction. This showed us that packet rewriting is significantly more expensive than just forwarding. Let’s take a look at how VPP does nat.

To enable nat on VPP, I used the following commands:

nat44 add interface address TenGigabitEthernet19/0/3
nat addr-port-assignment-alg default
set interface nat44 in TenGigabitEthernet19/0/1 out TenGigabitEthernet19/0/3 output-feature

My first test is with one flow only in one direction. With that, I’m able to get 4.3Mpps. That’s’ exactly half of what we saw in the performance test without nat. It’s no surprise this is slower due to the additional work needed. Note that with Linux iptables I was seeing about 1.1Mpps.

A single flow for nat isn’t super representative of a real-life nat example where you’d be translating many sources. So for the next measurements, I’m using 255 different source IP addresses and 255 destination IP addresses as well as different port numbers; with this setup, the nat code is seeing about 16k sessions. I can now see the numbers go to 3.2Mpps; more flows mean more nat work. Interestingly, this number is exactly the same as I saw with iptables. There is however one big difference, with iptables the system was using about 29 cores. In this test, I’m only using two cores. That’s a low number of workers, and also the reason I’m capped. To remove that cap, I added more cores and validated that the VPP code scales horizontally. Eventually, I need 12 cores to run 14Mpps for a stable experience.

Kernel bypass networking with FD.io and VPP.
VPP forwarding with NAT test results

Below is the relevant VPP config to control the number of cores used by VPP. Also, I should note that I isolated the cores I allocated to VPP so that the kernel wouldn’t schedule anything else on it.

cpu {
    main-core 1
    # CPU placement:
    corelist-workers 3–14
    # Also added this to grub: isolcpus=3-31,34-63
dpdk {
   dev default {
      # RSS, number of queues
      num-rx-queues 12
      num-tx-queues 12
      num-rx-desc 2048
      num-tx-desc 2048
   dev 0000:19:00.1
   dev 0000:19:00.3
plugins {
   plugin default { enable }
   plugin dpdk_plugin.so { enable }
nat {
   translation hash buckets 1048576
   translation hash memory 268435456
   user hash buckets 1024
   max translations per user 10000


Kernel bypass networking with FD.io and VPP.
Photo by National Cancer Institute on Unsplash

In this blog, we looked at VPP from the FD.io project as a userland forwarding engine. VPP is one example of a kernel bypass method for processing packets. It works closely with and further augments DPDK.

We’ve seen that the VPP code is feature-rich, especially for a kernel bypass packet forwarder. Most of all, it’s crazy fast.

We need just three cores to have two NICs forward full line-rate (14Mpps) in both directions. Comparing that to the Linux kernel, which needed 26 cores, we see an almost 9x increase in performance.
We noticed that the results were even better when using nat. In Linux, I wasn’t able to get any higher than 3.2Mpps for which I needed about 29 cores. With VPP we can do 3.2Mpps with just two cores and get to full line rate nat with 12 cores.

I think FD.io is an interesting and exciting project, and I’m a bit surprised it’s not more widely used. One of the reasons is likely that there’s a bit of a learning curve. But if you need high-performance packet forwarding, it’s certainly something to explore! Perhaps this is the start of your VPP project? if so, let me know!


<![CDATA[Linux Kernel and Measuring network throughput.]]>http://toonk.io/linux-kernel-and-measuring-network-throughput/5f53d62b3680d46e73ec60b5Sun, 29 Mar 2020 18:21:00 GMT

In my last blog, I wrote about how we can use dpdk pktgen for performance testing. Today I spent some time on some baseline testing to see what we can expect out of a vanilla Linux system nowadays when used as a router. Over the last two years I’ve been playing a fair bit with kernel bypass networking and hope to write about it in the near future. The promise of kernel bypass networking is higher performance, to determine how much of performance increase over the Kernel we need to establish a baseline first, we’ll do that in this article.

Test setup

Linux Kernel and Measuring network throughput.
n2.large.x86 CPU specs.

I’m using two n2.xlarge.x86 servers from packet.com. With its two Numa nodes, 16cores per socket, 32 cores in total, 64 with hyper-threading, this is a very beefy machine! It also comes with a quad-port Intel x710 NIC, giving us 4 x 10Gbs. Packet allows you to create custom vlans and assign network ports to a vlan. I’ve created two vlans and assigned one NIC to each vlan. The setup looks like below.

Linux Kernel and Measuring network throughput.
Test setup

The Device Under Test (DUT), is a vanilla Ubuntu 19.04 system running a 5.0.0–38-generic kernel. The only minor tune I’ve done is to set the NIC rx ring to 4096. And I enabled ip forwarding ( net.ipv4.ip_forward=1)

Using the traffic generator, I’m sending as many packets possible and observe when packets stop coming back at the same rate, which indicates packet-loss. I record the point that happens as the maximum throughput. I’m also keeping a close eye on the CPU usage, to get a sense of how many CPU cores (hyper threads) are needed to serve the traffic.

Test 1 — packet forwarding on Linux

The first test was easy. I’m simply sending packets from to and vice versa, through the DUT (Device under Test), which is routing the packets between the two interfaces eno2 and eno4.
Note that that I did both a one directional test ( > and a bidirectional test ( > AND >
I also tested with just one flow, and with 10,000 flows.

Linux Kernel and Measuring network throughput.
Receive Side Scaling (RSS)

This is important as the NIC is doing something called Receive Side Scaling (RSS), which will load balance different flows on to different NIC receive Queues. Each queue is then served by a different core, meaning the system scales horizontally. But, keep in mind, you may still be limited by what a single core can do depending on your traffic patterns.

Ok, show me the results! Keep in mind that we’re talking mostly about Packets Per Second (PPS) as that is the major indicator of the performance, it’s not super relevant how much data is caried in each packet. In the world of Linux networking, it really comes down to, how many interrrupts per second the system can process.

Linux Kernel and Measuring network throughput.
Test results for test 1

In the results above, you can see that one flow can go as high as 1.4Mpps. At that point, the core serving that queue is maxed out (running 100%), and can not process any more packets and will start dropping. The single flow forwarding performance is good to know for DDOS use-cases or large single flow network streams such as ESP. For services like these, the performance is as good as a single queue / cpu can handle.

When doing the same test with 10,000 flows, I get to 14 Mpps, full 10g line rate at the smallest possible packet size (64B), yay! At this point I can see all cores doing a fair amount of work. This is expected and is due to the hashing of flows over different queues. Looking at the CPU usage, I estimate that you’d need roughly 16 cores at 100% usage to serve this amount of packets (interrupts).

Linux Kernel and Measuring network throughput.
14M pps, unidirectional test.

Interestingly, I wasn’t able to get to full line rate when doing the bidirectional test. Meaning both NICs both sending and receiving simultaneously. Although I am getting reasonably close at 12Mpps (24Mpps total per NIC). When eyeballing the cpu usage and amount of idle left over, I’d expect you’d need roughly 26 cores at 100% usage to do that.

Test 2 - Introducing a simple stateful iptables rule

In this test we’re adding two simple iptables rules to the DUT to see what the impact is. The hypothesis here is that since we’re now going to ask the system to invoke conntrack and do stateful session mapping, we’re starting to execute more code, which could impact the performance and system load. This test will show us the impact of that.

The Iptables rules added were:

iptables -I FORWARD -d -m conntrack — ctstate RELATED,ESTABLISHED -j ACCEPT
iptables -I FORWARD -d -m conntrack — ctstate RELATED,ESTABLISHED -j ACCEPT

Linux Kernel and Measuring network throughput.
Test results for test 2, impact of conntrack

The results for the single flow performance test look exactly the same, that’s good. The results for the 10,000 flows test, look the same as well when it comes to packet per second. However, we do need a fair amount of extra CPU’s to do the work. Good thing, our test system has plenty.
So you can still achieve (close) to full line rate with a simple stateful iptables rule, as long as you have enough cpu’s. Note that in this case, the state table had 10,000 state entries. I didn’t test with more iptables rules.

Test 3 - Introducing a NAT rule

In this test, we’re starting from scratch as we did in test 1 and I’m adding a simple nat rule which causes all packets going through the DUT to be rewritten to a new source IP. These are the two rules:

iptables -I POSTROUTING -t nat -d -s -j SNAT — to -I POSTROUTING -t nat -d -s -j SNAT — to

The results below are quite different than what we saw earlier.

Linux Kernel and Measuring network throughput.
Test results for test 3

The results show that rewriting the packets is quite a bit more expensive than just allowing or dropping a packet. For example, if we look at the unidirectional test with 10,000 flows, we see that we dropped from 14M pps (test 1) to 3.2 Mpps, we also needed 13 cores more to do this!

Linux Kernel and Measuring network throughput.
This is what a (unhappy) 64core system looks like when trying to forward and NAT 5.9M pps

For what it’s worth, i did do a quick measurement with using nftables instead of iptables, but saw no significant changes in NAT performance.


One of the questions I had starting this experiment was: can Linux route at line-rate between two network interfaces? The answer is yes, we saw 14Mpps (unidirectional), as long as there are sufficient flows, and you have enough cores (~16). The bidirectional test made it to 12Mpps (24Mpps total per NIC) with 26cores at 100%.

We also saw that with the addition of two stateful Iptables rules, I was still able to get the same throughput, but needed extra CPU to do the work. So at least it scales horizontally.

Finally, we saw the rather dramatic drop in performance when adding SNAT rules to test. With SNAT the maximum I was able to get out of the system was 5.9Mpps; this was for 20k sessions (10k per direction).

So yes, you can build a close to line rate router in Linux, as long as you have sufficient cores and don’t do too much packet manipulations. All in all, an interesting test, and now we have a starting benchmark for future (kernel bypass / userland) networking experiments on Linux!

<![CDATA[Building a high performance - Linux Based Traffic generator with DPDK]]>

Often in my, now 20 years, networking career, I had to do some form of network performance testing. Use-cases varied, from troubleshooting a customer problem to testing new network hardware, and nowadays more and more Virtual network functions and software-based ‘bumps in the wire’.

I’ve always enjoyed playing with

http://toonk.io/building-a-high-performance-linux-based-traffic-generator-with-dpdk/5f53d8003680d46e73ec60d6Wed, 18 Mar 2020 18:32:00 GMTBuilding a high performance - Linux Based Traffic generator with DPDK

Often in my, now 20 years, networking career, I had to do some form of network performance testing. Use-cases varied, from troubleshooting a customer problem to testing new network hardware, and nowadays more and more Virtual network functions and software-based ‘bumps in the wire’.

I’ve always enjoyed playing with hardware-based traffic generators. My first experience with for example an IXIA hardware testing goes back to I think 2003, at the Amsterdam Internet Exchange where we were testing brand new Foundry 10G cards. These hardware-based testers were super powerful and a great tool to validate new gear, such as router line cards, firewalls, and IPsec gear. However, we don’t always have access to these hardware-based traffic generators, as they tend to be quite expensive or only available in a lab. In this blog, we will look at a software-based traffic generator that anyone can use - based on DPDK. As you’re going through this remember that the scripts and additional info can be found on my GitHub page here.

Building a high performance - Linux Based Traffic generator with DPDK

DPDK, The Data Plane Development Kit, is an Open source software project started by Intel and now managed by the Linux Foundation. It provides a set of data plane libraries and network interface controller polling-mode drivers that are running in userspace. Ok, let’s think about that for a sec, what does that mean? Userspace networking is something you likely increasingly hear and read about. The main driver behind userspace networking (aka Kernel bypass) has to do with the way Linux has built its networking stack; it is built as part of a generic, multi-purpose, multi-user OS. Networking in Linux is powerful and feature-rich, but it’s just one of the many features of Linux, and so; as a result, the networking stack needs to play fair and share its resources with the rest of the kernel and userland programs. The end result is that getting a few (1 to 3) million packets per second through the Linux networking stack is about what you can do on a standard system. That’s not enough to fill up a 10G link at 64 bytes packets, which is the equivalent of 14M packets per second (pps). This is the point where the traditional interrupt-driven (IRQs) way of networking in Linux starts to limit what is needed, and this is where DPDK comes in. With DPDK and userland networking programs, we take away the NIC from the kernel and give it to a userland DPDK program. The DPDK driver is a pull mode driver (PMD), which means that, typically, one core per nic always uses a 100% CPU, it’s in a busy loop always pulling for packets. This means that you will see that core running at 100%, regardless of how many packets are arriving or being sent on that nic. This is obviously a bit of waste, but nowadays, with plenty of cores and the need for high throughput systems, this is often a great trade-off, and best of all, it allows us to get to the 14M pps number on Linux.

Ok, high performance, we should all move to DPDK then, right?! Well, there’s just one problem… Since we’re now bypassing the kernel, we don’t get to benefit from the rich Linux features such as Netfilter and not even some of what we now think are basic features like a TCP/IP stack. This means you can’t just run your Nginx, Mysql, Bind, etc, socket-based applications with DPDK as all of these rely on the Linux Socket API and the Kernel to work. So although DPDK gives us a lot of speed and performance by bypassing the kernel, you also lose a lot of functionality.

Now there are quite a few DKDK based applications nowadays, varying from network forwarders such as software-based routers and switches as well as TCP/IP stacks such as F-stack.

In this blog, we’re going to look at DPDK-pktgen, a DPDK based traffic generator maintained by the DPDK team. I’m going to walk through installing DPDK, setting up SR-IOV, and running pktgen; all of the below was tested on a Packet.com server of type x1.small.x86 which has a single Intel X710 10G nic and a 4 core E3–1578L Xeon CPU. I’m using Ubuntu 18.04.4 LTS.

Installing DPDK and Pktgen

First, we need to install the DPKD libraries, tools, and drivers. There are various ways to install DPDK and pktgen; I elected to compile the code from source. There are a few things you need to do; to make it easier, you can download the same bash script I used to help you with the installation.

Solving the single NIC problem

One of the challenges with DPDK is that it will take full control of the nic. To use DPDK, you will need to release the nic from the kernel and give it to DPDK. Given we only have one nic, once we give it to DKDK, I’d lose all access (remember there’s no easy way to keep on using SSH, etc., since it relies on the Linux kernel). Typically folks solve this by having a management NIC (for Linux) and one or more NICs for DPDK. But I have only one NIC, so we need to be creative: we’re going to use SR-IOV to achieve the same. SR-IOV allows us to make one NIC appear as multiple PCI slots, so in a way, we’re virtualizing the NIC.

To use SR-IOV, we need to enable iommu in the kernel (done in the DPDK install script). After that, we can set the number of Virtual Functions (the number of new PCI NIC) like this.

echo 1 > /sys/class/net/eno1/device/sriov_numvfs
ip link set eno1 vf 0 spoofchk off
ip link set eno1 vf 0 trust on

dmesg -t will show something like this:

[Tue Mar 17 19:44:37 2020] i40e 0000:02:00.0: Allocating 1 VFs.
[Tue Mar 17 19:44:37 2020] iommu: Adding device 0000:03:02.0 to group 1
[Tue Mar 17 19:44:38 2020] iavf 0000:03:02.0: Multiqueue Enabled: Queue pair count = 4
[Tue Mar 17 19:44:38 2020] iavf 0000:03:02.0: MAC address: 1a:b5:ea:3e:28:92
[Tue Mar 17 19:44:38 2020] iavf 0000:03:02.0: GRO is enabled
[Tue Mar 17 19:44:39 2020] iavf 0000:03:02.0 em1_0: renamed from eth0

We can now see the new PCI device and nic name:

root@ewr1-x1:~# lshw -businfo -class network | grep 000:03:02.0
pci@0000:03:02.0 em1_0 network Illegal Vendor ID

Next up we will unbind this NIC from the kernel and give it to DPDK to manage:

/opt/dpdk-20.02/usertools/dpdk-devbind.py -b igb_uio 0000:03:02.0

We can validate that like this (note em2 is not connected and not used):

/opt/dpdk-20.02/usertools/dpdk-devbind.py -s
Network devices using DPDK-compatible driver
0000:03:02.0 'Ethernet Virtual Function 700 Series 154c' drv=igb_uio unused=iavf,vfio-pci,uio_pci_generic
Network devices using kernel driver
0000:02:00.0 'Ethernet Controller X710 for 10GbE backplane 1581' if=eno1 drv=i40e unused=igb_uio,vfio-pci,uio_pci_generic
0000:02:00.1 'Ethernet Controller X710 for 10GbE backplane 1581' if=em2 drv=i40e unused=igb_uio,vfio-pci,uio_pci_generic

Testing setup

Now that we’re ready to start testing, I should explain our simple test setup. I’m using two x1-small servers; one is the sender (running dpdk-pktgen), the other is a vanilla Ubuntu machine. What we’re going to test is the ability for the receiver Kernel, sometimes referred to as Device Under Test (DUT), to pick up the packets from the NIC. That’s all; we’re not processing anything, the IP address the packets are sent to isn’t even configured on the DUT, so the kernel will drop the packets asap after picking it up from the NIC.

Building a high performance - Linux Based Traffic generator with DPDK
test setup

Single flow traffic

Ok, time to start testing! Let’s run pktgen and generate some packets! My first experiment is to figure out how much I can send in a single flow to the target machine before it starts dropping packets.

Note that you can find the exact config in the GitHub repo for this blog. The file pktgen.pkt contains the commands to configure the test setup. Things that I configured were the mac and IP addresses, ports and protocols, and the sending rate. Note that I’m testing from to These are on /31 networks, so I’m setting the destination mac address to that of the default gateway. With the config as defined in pktgen.pkt I’m sending the same 64 byte packets (5 tuple, UDP > ) over and over.

I’m using the following to start pktgen.

/opt/pktgen-20.02.0/app/x86_64-native-linuxapp-gcc/pktgen - -T -P -m "2.[0]" -f pktgen.pkt

After adjusting the sending rate properties on the sender and monitoring with ./monitorpkts.sh on the receiver, we find that a single flow (single queue, single-core) will run clean on this receiver machine up until about 120k pps. If we up the sending rate higher than that, I’m starting to observe packets being dropped on the receiver. That’s a bit lower than expected, and even though it’s one flow, I can see the CPU that is serving that queue having enough idle time left. There must be something else happening…

The answer has to do with the receive buffer ring on the receiver network card. It was too small for the higher packet rates. After I increased it from 512 to 4096. I can now receive up to 1.4Mpps before seeing drops, not bad for a single flow!

ethtool -G eno1 rx 4096

Multi flow traffic

Pktgen also comes with the ability to configure it for ranges. Examples of ranges include source and destination IP addresses as well as source and destination ports. You can find an example in the pktgen-range.pkt file. For most environments, this is a more typical scenario as your server is likely to serve many different flows from many different IP addresses. In fact, the Linux system relies on the existence of many flows to be able to deal with these higher amounts of traffic. The Linux kernel hashes and load-balances these different flows to the available receive queues on the nic. Each queue is then served by a separate Interrupt thread, allowing the kernel to parallelize the work and leverage the multiple cores on the system.

Below you’ll find a screenshot from when I was running the test with many flows. The receiver terminals can be seen on the left, the sender on the right. The main thing to notice here is that on the receiving node, all available CPU’s are being used, note the ksoftirqd/X processes. Since we are using a wide range of source and destination ports, we’re getting proper load balancing over all cores. With this, I can now achieve 0% lost packets up to about 6Mpps. To get to 14Mpps, 10g line rate @64Bytes packets, I’d need more CPUs.

Building a high performance - Linux Based Traffic generator with DPDK

IMIX test

Finally, we’ll run a basic IMIX test, using the dpdk-pktgen pcap feature. Internet Mix or IMIX refers to typical Internet traffic. When measuring equipment performance using an IMIX of packets, the performance is assumed to resemble what can be seen in “real-world” conditions.

The imix pcap file contains 100 packets with the sizes and ratio according to the IMIX specs.

tshark -r imix.pcap -V | grep 'Frame Length'| sort | uniq -c|sort -n
9 Frame Length: 1514 bytes (12112 bits)
33 Frame Length: 590 bytes (4720 bits)
58 Frame Length: 60 bytes (480 bits)

I need to rewrite the source and destination IP and MAC addresses so that they match my current setup, this can be done like this:

tcprewrite \
 - enet-dmac=44:ec:ce:c1:a8:20 \
 - enet-smac=00:52:44:11:22:33 \
 - pnat=, \
 - infile=imix.pcap \
 - outfile=output.pcap
For more details also see my notes here : https://github.com/atoonk/dpdk_pktgen/blob/master/DPDKPktgen.md

We then start the packetgen app and give it the pcap

/opt/pktgen-20.02.0/app/x86_64-native-linuxapp-gcc/pktgen - -T -P -m "2.[0]" -s 0:output.pcap

I can now see I’m sending and receiving packets at a rate of 3.2M pps at 10Gbs, well below the maximum we saw earlier. This means that the Device Under Test (DUT) is capable of receiving packets at 10Gbs using an IMIX traffic pattern.

Building a high performance - Linux Based Traffic generator with DPDK
Result of IMIX test with a PCAP as the source. Receiver (DUT) on the left, sender window on the right.


In this article, we looked at getting DPDK up and running, talked a bit about what DPDK is, and used its pktgen traffic generator application. A typical challenge when using DPDK is that you lose the network interface, meaning that the kernel can no longer use it. In this blog, we solved this using SR-IOV, which allowed us to create a second logical interface for DPDK. Using this interface, I was able to generate 14Mpps without issues.

On the receiving side of this test traffic, we had another Linux machine (no DPDK), and we tested its ability to receive traffic from the NIC (after which the kernel dropped it straight away). We saw how the packets per second number is limited by the rx-buffer, as well as the ability for the CPU cores to pick up the packets (soft interrupts). We saw a single core was able to do about 1,4Mpps. Once we started leveraging more cores, by creating more flows, we started seeing dropped packets at about 6M pps. If we would have had more CPU’s we’d likely be able to do more than that.

Also note that throughout this blog, I spoke mostly of packets per second and not much in terms of bits per second. The reason for this is that every new packet on the Linux receiver (DUT) creates an interrupt. As a result, the number of interrupts the system can handle is the most critical indicator of how many bits per second the Linux system can handle.

All in all, pktgen and dpdk require a bit of work to set up, and there is undoubtedly a bit of a learning curve. I hope the scripts and examples in the GitHub repo will help with your testing and remember: with great power comes great responsibility.

Building a high performance - Linux Based Traffic generator with DPDK
<![CDATA[TCP BBR - Exploring TCP congestion control]]>

One of the oldest protocols and possibly the most used protocol on the Internet today is TCP. You likely send and receive hundreds of thousands or even over a million TCP packets (eeh segments?) a day. And it just works! Many folks believe TCP development has finished, but that’s

http://toonk.io/tcp-bbr-exploring-tcp-congestion-control/5f53dac53680d46e73ec610dSat, 15 Feb 2020 00:00:00 GMT

One of the oldest protocols and possibly the most used protocol on the Internet today is TCP. You likely send and receive hundreds of thousands or even over a million TCP packets (eeh segments?) a day. And it just works! Many folks believe TCP development has finished, but that’s incorrect. In this blog will take a look at a relatively new TCP congestion control algorithm called BBR and take it for a spin.

Alright, we all know the difference between the two most popular transport protocols used on the Internet today. We have UDP and TCP. UDP is a send and forget protocol. It is stateless and has no congestion control or reliable delivery support. We often see UDP used for DNS and VPNs. TCP is UDP’s sibling and does provide reliable transfer and flow control, as a result, it is quite a bit more complicated.

People often think the main difference between TCP and UDP is that TCP gives us guaranteed packet delivery. This is one of the most important features of TCP, but TCP also gives us flow control. Flow control is all about fairness, and critical for the Internet to work, without some form of flow control, the Internet would collapse.

Over the years, different flow control algorithms have been implemented and used in the various TCP stacks. You may have heard of TCP terms such as Reno, Tahoe, Vegas, Cubic, Westwood, and, more recently, BBR. These are all different congestion control algorithms used in TCP. What these algorithms do is determining how fast the sender should send data while adapting to network changes. Without these algorithms, our Internet pipes would soon be filled with data and collapse.


Bottleneck Bandwidth and Round-trip propagation time (BBR) is a TCP congestion control algorithm developed at Google in 2016. Up until recently, the Internet has primarily used loss-based congestion control, relying only on indications of lost packets as the signal to slow down the sending rate. This worked decently well, but the networks have changed. We have much more bandwidth than ever before; The Internet is generally more reliable now, and we see new things such as bufferbloat that impact latency. BBR tackles this with a ground-up rewrite of congestion control, and it uses latency, instead of lost packets as a primary factor to determine the sending rate.

TCP BBR - Exploring TCP congestion control
Source: https://cloud.google.com/blog/products/gcp/tcp-bbr-congestion-control-comes-to-gcp-your-internet-just-got-faster

Why is BBR better?

There are a lot of details I’ve omitted, and it gets complicated pretty quickly, but the important thing to know is that with BBR, you can get significantly better throughput and reduced latency. The throughput improvements are especially noticeable on long haul paths such as Transatlantic file transfers, especially when there’s minor packet loss. The improved latency is mostly seen on the last mile path, which is often impacted by Bufferbloat (4 seconds ping times, anyone?). Since BBR attempts not to fill the buffers, it tends to be better in avoiding buffer bloat.

TCP BBR - Exploring TCP congestion control
Photo by Zakaria Zayane on Unsplash

let’s take BBR for a spin!

BBR has been in the Linux kernel since version 4.9 and can be enabled with a simple sysctl command. In my tests, I’m using two Ubuntu machines and Iperf3 to generate TCP traffic. The two servers are located in the same data center; I’m using two Packet.com servers type: t1.small, which come with a 2.5Gbps NIC.

The first test is a quick test to see what we can get from a single TCP flow between the two servers. This shows 2.35Gb/s, which sounds about right, good enough to run our experiments.

The effect of latency on TCP throughput
In my day job, I deal with machines that are distributed over many dozens of locations all around the world, so I’m mostly interested in the performance between machines that have some latency between them. In this test, we are going to introduce 140ms round trip time between the two servers using Linux Traffic Control (tc). This is roughly the equivalent of the latency between San Francisco and Amsterdam. This can be done by adding 70ms per direction on both servers like this:

tc qdisc replace dev enp0s20f0 root netem latency 70ms

If we do a quick ping, we can now see the 140ms round trip time

root@compute-000:~# ping
PING ( 56(84) bytes of data.
64 bytes from icmp_seq=1 ttl=61 time=140 ms
64 bytes from icmp_seq=2 ttl=61 time=140 ms
64 bytes from icmp_seq=3 ttl=61 time=140 ms

Ok, time for our first tests, I’m going to use Cubic to start, as that is the most common TCP congestion control algorithm used today.

sysctl -w net.ipv4.tcp_congestion_control=cubic

A 30 second iperf shows an average transfer speed of 347Mbs. This is the first clue of the effect of latency on TCP throughput. The only thing that changed from our initial test (2.35Gbs) is the introduction of 140ms round trip delay. Let’s now set the congestion control algorithm to bbr and test again.

sysctl -w net.ipv4.tcp_congestion_control=bbr

The result is very similar, the 30seconds average now is 340Mbs, slightly lower than with Cubic. So far no real changes.

The effect of packet loss on throughput

We’re going to repeat the same test as above, but with the addition of a minor amount of packet loss. With the command below, I’m introducing 1,5% packet loss on the server (sender) side only.

tc qdisc replace dev enp0s20f0 root netem loss 1.5% latency 70ms

The first test with Cubic shows a dramatic drop in throughput; the throughput drops from 347Mb/s to 1.23 Mbs/s. That’s a ~99.5% drop and results in this link basically being unusable for today’s bandwidth needs.

If we repeat the exact same test with BBR we see a significant improvement over Cubic. With BBR the throughput drops to 153Mbs, which is a 55% drop.

The tests above show the effect of packet loss and latency on TCP throughput. The impact of just a minor amount (1,5%) of packet loss on a long latency path is dramatic. Using anything other than BBR on these longer paths will cause significant issues when there is even a minor amount of packet loss. Only BBR maintains a decent throughput number at anything more than 1,5% loss.

The table below shows the complete set of results for the various TCP throughput tests I did using different congestion control algorithms, latency and packet loss parameters.

TCP BBR - Exploring TCP congestion control
Throughput Test results with various congestion control algorithms
Note: the congestion control algorithm used for a TCP session is only locally relevant. So, two TCP speakers can use different congestion control algorithms on each side of the TCP session. In other words: the server (sender), can enable BBR locally; there is no need for the client to be BBR aware or support BBR.

TCP socket statistics

As you’re exploring tuning TCP performance, make sure to use socket statistics, or ss, like below. This tool displays a ton of socket information, including the TCP flow control algorithm used, the round trip time per TCP session as well as the calculated bandwidth and actual delivery rate between the two peers.

TCP BBR - Exploring TCP congestion control

When to use BBR

Both Cubic and BBR perform well for these longer latency links when there is no packet loss, and BBR really shines under (moderate) packet loss. Why is that important? You could argue why you would want to design for these packet loss situations. For that, let’s think about a situation where you have multiple data centers around the world, and you rely on transit to connect the various data centers (possibly using your own Overlay VPN). You likely have a steady stream of data between the various data centers, think of logs files, ever-changing configuration or preference files, database synchronization, backups, etc. All major Transit providers at times suffer from packet loss due to various reasons. If you have a few dozen of these globally distributed data centers, depending on your Transit providers and the locations of your POPs you can expect packet loss incidents between a set of data centers several times a week. In situations like this BBR will shine and help you maintain your SLO’s.

I’ve mostly focused on the benefits of BBR for long haul links. But CDNs and various application hosting environments will also see benefits. In fact, Youtube has been using BBR for a while now to speed up their already highly optimized experience. This is mostly due to the fact that BBR ramps up to the optimal sending rate aggressively, causing your video stream to load even faster.

Downsides of BBR

It sounds great right, just execute this one sysctl command, and you get much better throughput resulting in your users to get a better experience. Why would you not do this? Well, BBR has received some criticism due to its tendency to consume all available bandwidth and pushing out other TCP streams that use say Cubic or different congestion algorithms. This is something to be mindful of when testing BBR in your environment. BBRv2 is supposed to resolve some of these challenges.

All in all, I was amazed by the results. It looks to me this is certainly worth taking a closer look at. You won’t be the first, in addition to Google, Dropbox and Spotify are two other examples where BBR is being used or experimented with.

<![CDATA[Building A Smarter AWS Global Accelerator]]>http://toonk.io/building-a-smarter-aws-global-accelerator/5f53e04f3680d46e73ec613fSun, 01 Dec 2019 00:00:00 GMT

In my last blog post, we looked at Global Accelerator, a global load balancer provided by AWS. I think Global Accelerator is an excellent tool for folks building global applications in AWS as it will help them directing traffic to the right origin locations or servers. This is great for high volume applications, as well as providing improved availability.

In this blog post, we’ll take a look at how we could build our own global accelerator by building on other previous blog posts (building anycast applications on packet). I’ve been thinking a lot about Global Accelerator and while it provides a powerful data plane, I think it would benefit from a smarter control plane. A control plane that provides load balancing with more intelligence, by taking into account the capacity, load and round trip time to each origin. In this blog post, we’ll evaluate and demonstrate what that could look like by implementing a Smarter Global Accelerator ourselves.

Typical Architecture

Many applications nowadays are delivered using the architecture below. Clients always hit one of the nearest edge nodes (the blue diamonds). Which edge node depends on the way traffic is directed to the edge node, typically either using DNS based load balancing, or straight anycast, this is what AWS Global Accelerator uses.

Building A Smarter AWS Global Accelerator
Typical application delivery architecture

The edge node then needs to determine what origin server to send the request to (assuming there is no caching or cache misses). This is how your typical CDN works, but also how for example, Google and Facebook deliver their applications. In the case of a simple CDN there could be one or more origin servers. In the case of Facebook, the choice is more which of their ‘core’ or ‘larger’ datacenters to send the request to.

With AWS Global Accelerator you can configure listeners (the diamond) in a region to send a certain percentage of traffic to an origin group, ‘Endpoint Groups’ in AWS speak. This is a static configuration, which is not ideal. Additionally, if an origin (the green box in the diagram) reaches its capacity, you will need to update the configuration. This is the part we’re going to make smarter.

In an ideal world, each edge node (the diamond) routes the requests to the closest origin based on the latency between the edge node and the origin. It should also be aware of the total load the origin is under, and how much load the origin can handle. An individual edge node doesn’t know how many other edge nodes there are, and how much each of them is sending each origin. So we need a centralized brain and a feedback loop.

Building a Closed-loop system

To have the system continuously adapt to the changing environment, we need to have access to several operational metrics. We need to know how many requests each edge node (load balancer) is receiving, with that we can infer the total, global, number of incoming requests per second. We also need to know the capacity of each origin since we want to make sure we don’t send more traffic to an origin than it can handle. Finally, we need to know the health and latency between each edge node and each origin. Most of these metrics are dynamic, so we need to continuously publish (or poll) the health information and request per second information.

Now that we have all the input data, we can feed this into our software. The software essentially a scheduler, solving a constraint-based assignment problem. The output is a list with all edge nodes and a weight assignment per edge node for each origin. A simple example could look like this:

 - Edge node Amsterdam:
    - Origin EU DC:  90%
    - Origin US-WEST DC:  0%
    - Origin US-EAST DC:  10%
    - Origin Asia DC:  0%
 - Edge node New York:
    - Origin EU DC: 0%
    - Origin US-WEST DC:  0%
    - Origin US-EAST DC:  100%
    - Origin Asia DC:  0%

In the above example for listener, the Amsterdam edge node will send 90% of the requests to the EU origin, while the remaining 10% is directed to the next closest DC, US-EAST. This means this the EU datacenter is at capacity and is offloading traffic to the next closest origin.

The New York edge node is sending all traffic to the EU-EAST datacenter as there is enough capacity at this point and no need to offload traffic.

Our closed-loop system will re-calculate and publish the results every few seconds so that we can respond to changes quickly.

Let’s start building

I’m going to re-use much of what we’ve built earlier, in this experiment I’m again using Packet.net and their BGP anycast support to build the edge nodes. Please see this blog for details. I’m using Linux LVS as a load balancer for this setup. Each edge node is publishing the needed metrics to prometheus, a time-series database, every 15 seconds. With this, we now have a handful of edge nodes, fully anycasted, and access to all needed metrics per edge node in a centralized system.

Building A Smarter AWS Global Accelerator
Swagger file, built using Flask-RESTPlus

The other thing that is needed is a centralized source of truth. For this, I wrote a Flask based REST API. This API allows us to create new load balancers, add origins, etc. We can also ask this same API for all load balancers, its origins and the health and operational metrics.

The next thing we needed is a script that every few seconds talks to the API to retrieve the latest configuration. With that information, each edge node can update the load balancer configuration, such as create new load balancer listeners and update the origin details such as weight per origin. We now have everything in place and can start testing.

Building A Smarter AWS Global Accelerator
JSON definition for each load balancer

Observations and tuning

I started testing by generating many get requests to one Listener that has two origins, one digitalocean VM in the US and one in Europe. Since all testing was performed from one location, it was hitting one edge datacenter, that has two edge nodes. Those edge nodes would send the traffic to the closest origin, which is the US origin. Now imagine this origin server was hitting its maximum capacity and I want to protect it from being overloaded and start sending some traffic to the other origin. To do that I set the maximum load number for the US origin to 200 requests per second (also see JSON above).

Below you’ll see an interesting visualization of this measurement. At t=0 the total traffic load for both origins is 0, no traffic is coming in at all, which also meant that both origins are well below their max capacity. This means that the US load balancers are configured to send all requests to the US origin as they have the lowest latency and are below the max threshold.

After we generate the traffic, all requests are sent to the US origin to start. As the metrics start coming in the system detects that the total number of incoming requests is exceeding 200, as a result, the load balancer configuration will be updated to start sending traffic to both origin servers, with the intent to not send more than 200 requests per second to the US origin.

Building A Smarter AWS Global Accelerator

Speed vs. accuracy.

One of the things we want to prevent is big sudden swings in traffic. To achieve that I built in a dampening factor that limits the load shifting (ie. load balancer configuration) change to 2 percent per origin for each 15-second interval. Note; that if you have two origins, this means a 2% change per origin, so 4% swing per 15 secsond cycle. This means it will take a bit longer for the system to respond to major changes but will give us substantially more stability, meaning less oscillation between origins and will allow for the system to stabilize. In my initial version, I had no dampening and the system never stabilized and showed significant unwanted sudden traffic swings.

The graph shows an interesting side effect of my testing setup. Since I’m testing from the US west coast and start offloading more and more requests to the EU origin, this means that on average, a single curl will take longer due to the increased round trip time. As a result, the total number of requests goes down. Which is fun, because it means the software needs to adapt constantly. Every time we change the origin weights slightly, the total number or requests changes slightly. This causes some oscillation, but it’s also exactly the oscillation a closed-loop system is designed for, and it works well as long as we have a dampening factor. It also shows that in some instances the total number of inbound requests increases if your website (or any app) is responding faster. Though I’m not sure if that’s representative of a real-world scenario. Still, this was a fun side effect that put a bit of extra stress on the software.

Closing thoughts

This project was fun as it allowed me to combine several of my interests. One of them is global traffic engineering, ie. how do we get traffic to where we’d like it to be processed. We also go to re-use the lessons learned and experience gained from a few of the previous blog posts, specifically how to build anycasted applications with Packet and a deep-dive into AWS Global Accelerator. I got to improve my Python skills by building a Restful API in Flask and make sure it was properly documented using the OpenAPI spec.
Finally, building the actual scheduler was a fun challenge, and it took me a while to figure out how to best solve the assignment problem before I was really happy with the outcome. The result is what could be a Smarter Global Accelerator.