Andree's Musings

Navigating Infrastructure Outages: Battle Scars and Lessons Learned

Andree Toonk — Mon, 08 Jul 2024 03:59:23 GMT

One of my great passions is infrastructure operations. My first few jobs were in network engineering, and later on, as the cloud became more prevalent, that turned into what we now call infrastructure or cloud operations. Probably the most stressful times were during outages. Over the last two decades, I've been part of many outages, some worse than others, but I wear my operational battle scars proudly. In this blog, I'll share some learnings and observations from these incidents.

What triggered me to write this article

I could write many articles about the "epic" incidents I've been part of. While stressful at the time, they make for good war stories afterward. In recent years, companies like Facebook and Cloudflare have started publicly sharing their outage retrospectives (timelines, root causes, and lessons learned), which I enjoy reading. It's free learning!

A few days ago, Rogers posted its public incident summary, covering the typical aspects of an outage retrospective. The outage itself was significant. Rogers is Canada's second-largest ISP and cell phone carrier. Two years ago, they experienced a country-wide 24-hour outage that impacted all their phone (including 911) and internet customers. Yes, hard down for a full day! So naturally, as both a Rogers customer and internet operations enthusiast, when they posted their report (2 years later!), I had to read it.

The Rogers outage

Though not the primary focus of this article, let's briefly review the outage. During planned maintenance, Rogers staff removed a network routing policy, which caused (BGP, or OSPF/ISIS) routes to be flooded internally. This overwhelmed the control plane, causing all kinds of challenges, eventually causing it to crash or otherwise render its core routers inoperable. A logical next step would be to try and undo that change, however, the routers themselves were no longer manageable due to the control plane issues, and to make matters worse, the out-of-band (OOB) management network was also down. As it turns out, the out-of-band network somehow depended on the production data network, so when that went down, so did the out-of-band network, making troubleshooting and remediation impossible. Eventually, engineers had to be dispatched to manually get console access.

Thoughts on the Rogers outage

Outages suck. They’re super stressful and, as the Rogers incident shows, can have significant real-world impacts. Having these retrospectives after the fact, and getting to the root cause, lessons learned, etc., is mandatory for any organization that wants to continuously improve itself.

One thing that stood out for me was the news article from CBC.ca (a major news outlet in Canada), the headline of which was: “Human error triggered massive 2022 Rogers service outage, report finds.” While it's arguably true that the change made kicked off a sequence of events, declaring that as the root cause is short-sighted. Many of these outages are like airplane crash investigations, where multiple factors contribute to the failure, sometimes starting years ahead or being poorly designed to start with.

Reflecting on Outages: Can We Eliminate Them? Should We Want To?

It’s always easy to dunk on an outage from the sidelines and to be clear, that’s not my intent. I’ve learned firsthand that Shit happen! So, with the Rogers outage in mind, but also from past experience, let’s extract some generic lessons.

First off, I’d say that the goal of eliminating outages altogether is likely too ambitious for most of us. Yes, part of an outage retrospective process should be to prevent the exact same outage from happening in the future, but architectures evolve, technologies change, and even a slight change in parameters can lead to a similar outage again.

So, although we can limit the chance of outages, you won’t be able to eliminate them 100%. In fact, given the cost of complete outage elimination and depending on your industry (say Facebook vs. flying a passenger airplane), you may not necessarily want to aim for that. There are real financial costs, process costs, agility costs, or time-to-value costs by trying to completely achieve outage elimination.

So, let’s accept that outages will happen and instead use that insight to focus on limiting their damage and impact! This approach allows us to prioritize effective strategies, which I'll discuss next.

Time to Detection - aka Be The First To Know!

Sufficient monitoring helps to reduce time to detect and pinpoint root causes. Sounds obvious, right? Yet, almost every outage retrospective will have an improvement action item called “more monitoring.” That’s an indication that we’re always forgetting some metric for some reason we never appear to have all the checks and metrics. That’s why I’m a big fan of synthetic monitoring for availability.

This check should mimic a typical user experience. So even if you swap out a core component, you’ll still be able to see the impact without adding more metrics. Because you can’t easily miss something, it’s the first check you should add, and most importantly, should alert on. So whether it’s a ping or a simple HTTP test, this will give you instant confidence your service is working. I call this the “be the first to know” principle. You don’t want to hear from your customers that your service is broken; you’re responsible for it, and you should be the first to know!

Limit blast radius.

Can we contain the impact when something breaks? The Rogers outage is a prime example of the worst case: one change broke the entire network. Logical separation, whether geographical or functional, could have prevented this. Modern-day cloud providers like AWS actively encourage us to think about this. With AWS, you’re always thinking about Multi-AZ or Multi-region. Some go further by going multi-cloud (think multi-vendor in network land).

However, transitioning from a single region to multi-region or multi-cloud can become exponentially complex and costly. The same is true in network land, where the path from a single vendor to multi-vendor can become quite complex easily.

Certain services are easier to make multi-region than others (I’m not even thinking about multi-cloud yet). Take your core database, for example. This comes with real complexity costs, and it's understandable why this might not always happen.

I get it, infrastructure teams are often considered cost centers. While high availability and reliability are touted as priorities, urgent needs, feature development, and cost pressures often take precedence. However, never let a good disaster go to waste! After a significant outage, budgets and priorities often shift. This is your opportunity to advocate for investments in resilience. The value proposition is strongest in the immediate aftermath, so seize the moment.

Management access

The outages that still haunt me and gave me my “Ops PTSD” are those where we lost management access during the outage. This is what happened to Rogers, and a similar incident happened to one of Facebook’s largest outages. It’s a real nightmare scenario: you can’t troubleshoot or remediate the issue since you’ve lost the ability to do anything to the device. Often even the ability to just shut it down or otherwise take it out of rotation. Until you restore some form of management access to the device, you’re stuck.

This problem is more present for folks running physical hardware than those running, say, on AWS. So, if you run your own data centers, never compromise on out-of-band network access, without thinking about the worst case. Make sure you have OOB access and that it is isolated from your in-band network access. It's not always easy, but if you don't have this, your outage time will go from minutes to many hours (see Rogers and Facebook's example). The feeling of helplessness, even if you know what the issue is, and it only takes one command to fix the issue, is awful.

Rollback scenarios

Another valuable tool in your toolbox for reducing the impact of an outage is to make sure its duration is limited. You made the change, you quickly see things are bad, and you simply roll back your change. You can see that for this to work, you need management access though (see above).

In the early days, with Cisco IOS, we’d do it like this:

# reload in 5
# conf t
# < do your change >
# no reload

This meant that unless you executed the last command, the device would reload 5 minutes after your change. Assuming you wouldn't lose access, you would execute no reload, which would cancel the reload, since it wasn’t necessary. Juniper later introduced a more elegant solution with "commit confirmed," which auto-rolls back changes if not confirmed within a set time frame.

[edit]
user@host# commit confirmed 
commit confirmed will be automatically rolled back in 10 minutes unless confirmed
commit complete
#commit confirmed will be rolled back in 10 minutes
[edit]
# < On-call blows Up! >
user@host# rollback 1
user@host# commit

If your environment is more of a software development team, then look at using blue-green deployments. Whatever your choice of making changes or deploying new versions of your software, it’s essential to have the ability to quickly roll back to the latest, known working version. Knowing that you can quickly roll back is great for confidence. Obviously, this depends on our earlier topics: “be the first to know” and management access.

Team culture - Don’t be a hero

Last but not least, we have to talk about team culture. One of the good things about moving from the traditional Ops model to DevOps (you build it, you own it) is that you no longer throw a new version of the software over the wall and ask your ops team to deploy it sight unseen.

Inevitably, at some point, you ship a broken version and your Ops person (this was me) has no clue what changed or how to troubleshoot it. This is extremely stressful and, frankly, unfair.

Nowadays, most development teams own and deploy their own changes and have modern deployment pipelines to deploy them. But even then, the software is complex, and not everyone in the team understands all components. So, when there’s an outage, it should be easy to bring in the rest of the team. Don’t be shy about calling your team members, even in the middle of the night. In a healthy organization, no one should want to wake up the next morning hearing there was a 4-hour outage and someone on your team tried to troubleshoot the bug you introduced. So have a WhatsApp group, or whatever tool you use, just for your team, your own out-of-band channel for emergency help.

Outages are overwhelming. You get a ton of alerts that all need to be acknowledged, and you get questions from support teams, customers, senior management, etc. And then you’re also expected to troubleshoot. No one person can do this. You need a few folks to troubleshoot, manage the alerts, and coordinate the whole process, including communication. So, don’t be a hero. Call in your teammates; together you’ll be able to deal with this. Remember, one team, one dream.

Wrap up

Outages will inevitably occur. Let's acknowledge this reality and take proactive steps to limit their impact and duration.

Be mindful of the knee-jerk management response: “We need more change management process!” Unless you’re a real YOLO shop, this is rarely the answer.

Instead, use the “be the first to know” principle, ensure access to your gear at all times, and have tools ready to swiftly roll back changes to a last-known-good state. Last but not least, embrace the opportunity to learn with your team through retrospectives and emerge even stronger. With a healthy team dynamic, outages can surprisingly become valuable bonding experiences. By being prepared and working together, you'll turn outages into mere blips on the radar.

High-Speed Packet Transmission in Go: From net.Dial to AF_XDP

Andree Toonk — Mon, 11 Mar 2024 05:00:24 GMT

Recently, I developed a Go program that sends ICMP ping messages to millions of IP addresses. Obviously I wanted this to be done as fast and efficiently as possible. So this prompted me to look into the various methods of interfacing with the network stack and sending packets, fast! It was a fun journey, so in this article, I’ll share some of my learnings and document them for my future self :) You'll see how we get to 18.8Mpps with just 8 cores. There’s also this Github repo that has the example code, making it easy to follow along.

The use case

Let’s start with a quick background of the problem statement. I want to be able to send as many packets per second from a Linux machine. There are a few use cases, for example, the Ping example I mentioned earlier, but also maybe something more generic like dpdk-pktgen or even something Iperf. I guess you could summarize it as a packet generator.

I’m using the Go programming language to explore the various options. In general, the explored methods could be used in any programming language since these are mostly Go-specific interfaces around what the Linux Kernel provides. However, you may be limited by the libraries or support that exist in your favorite programming language.

Let’s start our adventure and explore the various ways to generate network packets in Go. I’ll go over the options, and we’ll end with a benchmark, showing us which method is the best for our use case. I’ve included examples of the various methods in a Go package; you can find the code here. We’ll use the same code to run a benchmark and see how the various methods compare.

The net.Dial method

The net.Dial method is the most likely candidate for working with network connections in Go. It’s a high-level abstraction provided by the standard library's net package, designed to establish network connections in an easy-to-use and straightforward manner. You would use this for bi-directional communication where you can simply read and write to a Net.Conn (socket) without having to worry about the details.

In our case, we’re primarily interested in sending traffic, using the net.Dial method that looks like this:

conn, err := net.Dial("udp", fmt.Sprintf("%s:%d", s.dstIP, s.dstPort))
if err != nil {
	return fmt.Errorf("failed to dial UDP: %w", err)
}
defer conn.Close()

After that, you can simply write bytes to your conn like this

conn.Write(payload)

You can find our code for this in the file af_inet.go

That’s it! Pretty simple, right? As we’ll see, however, when we get to the benchmark, this is the slowest method and not the best for sending packets quickly. Using this method, we can get to about 697,277 pps

Raw Socket

Moving deeper into the network stack, I decided to use raw sockets to send packets in Go. Unlike the more abstract net.Dial method, raw sockets provide a lower-level interface with the network stack, offering granular control over packet headers and content. This method allows us to craft entire packets, including the IP header, manually.

To create a raw socket, we’ll have to make our own syscall, give it the correct parameters, and provide the type of traffic we’re going to send. We’ll then get back a file descriptor. We can then read and write to this file descriptor. This is what it looks like at the high level; see rawsocket.go for the complete code.

fd, err := syscall.Socket(syscall.AF_INET, syscall.SOCK_RAW, syscall.IPPROTO_RAW)
if err != nil {
	log.Fatalf("Failed to create raw socket: %v", err)
}
defer syscall.Close(fd)

// Set options: here, we enable IP_HDRINCL to manually include the IP header
if err := syscall.SetsockoptInt(fd, syscall.IPPROTO_IP, syscall.IP_HDRINCL, 1); err != nil {
	log.Fatalf("Failed to set IP_HDRINCL: %v", err)
}

That’s it, and now we can read and write our raw packet to file descriptor like this

err := syscall.Sendto(fd, packet, 0, dstAddr)

Since I’m using IPPROTO_RAW, we’re bypassing the transport layer of the kernel's network stack, and the kernel expects us to provide a complete IP packet. We do that using the BuildPacket function. It’s slightly more work, but the neat thing about raw sockets is that you can construct whatever packet you want.

We’re telling the kernel just to take our packet, it has to do less work, and thus, this process is faster. All we’re really asking from the network stack is to take this IP packet, add the ethernet headers, and hand it to the network card for sending. It comes as no surprise, then, that this option is indeed faster than the Net.Dial option. Using this method, we can reach about 793,781 pps, about 100k PPS more than the net.Dial method.

The AF_INET Syscall Method

Now that we’re used to using syscalls directly, we have another option. In this example, we create a UDP socket directly like below

fd, err := syscall.Socket(syscall.AF_INET, syscall.SOCK_DGRAM, syscall.IPPROTO_UDP)

After that we can simply write our payload to it using the Sendto method like before.

err = syscall.Sendto(fd, payload, 0, dstAddr)

It looks similar to the raw socket example, but a few differences exist. The key difference is that in this case we’ve created a socket of type UDP, which means we don’t need to construct the complete packet (IP and UDP header) like before. When using this method, the kernel manages the construction of the UDP header based on the destination IP and port we specify and handles the encapsulation process into an IP packet.

In this case, the payload is just the UDP payload. In fact, this method is similar to the Net.Dial method before, but with fewer abstractions.

Compared to the raw socket method before, I’m now seeing 861,372 pps—that’s a 70k jump. We’re getting faster each step of the way. I’m guessing we get the benefit of some UDP optimizations in the kernel.

The Pcap Method

It may be surprising to see Pcap here for sending packets. Most folks know pcap from things like tcpdump or Wireshark to capture packets. But it’s also a fairly common way to send packets. In fact, if you look at many of the Go-packet or Python Scappy examples, this is typically the method listed to send custom packets. So, I figured I should include it and see its performance. I was skeptical, but was pleasantly surprised when I saw the pps numbers!

First, let’s take a look at what this looks like in Go; again, for the complete example, see my implementation in pcap.go here

We start by creating a Pcap handle like this:

handle, err := pcap.OpenLive(s.iface, 1500, false, pcap.BlockForever)
if err != nil {
	return fmt.Errorf("could not open device: %w", err)
}
defer handle.Close()

Then we create the packet manually, similar to the Raw socket method earlier, but in this case, we include the Ethernet headers.
After that, we can write the packet to the pcap handle, and we’re done!

err := handle.WritePacketData(packet)

To my surprise, this method resulted in quite a performance win. We surpassed the one million packets per second mark by quite a margin: 1,354,087 pps—almost a 500k pps jump!

Note that, towards the end of this article, we’ll look at a caveat, but good to know that this method stops working well when sending multiple streams (go routines).

The af_packet method

As we explore the layers of network packet crafting and transmission in Go, we next find the AF_PACKET method. This method is popular for IDS systems on Linux, and for good reasons!

It gives us direct access to the network device layer, allowing for the transmission of packets at the link layer. This means we can craft packets, including the Ethernet header, and send them directly to the network interface, bypassing the higher networking layers. We can create a socket of type AF_PACKET using a syscall. In Go this will look like this:

fd, err := syscall.Socket(syscall.AF_PACKET, syscall.SOCK_RAW, int(htons(syscall.ETH_P_IP)))

This line of code creates a raw socket that can send packets at the Ethernet layer. With AF_PACKET, we specify SOCK_RAW to indicate that we are interested in raw network protocol access. By setting the protocol to ETH_P_IP, we tell the kernel that we’ll be dealing with IP packets.

After obtaining a socket descriptor, we must bind it to a network interface. This step ensures that our crafted packets are sent out through the correct network device:

addr := &syscall.SockaddrLinklayer{
	Protocol: htons(syscall.ETH_P_IP),
	Ifindex:  ifi.Index,
}

Crafting packets with AF_PACKET involves manually creating the Ethernet frame. This includes setting both source and destination MAC addresses and the EtherType to indicate what type of payload the frame is carrying (in our case, IP). We’re using the same BuildPacket function as we used for the Pcap method earlier.

The packet is then ready to be sent directly onto the wire:

syscall.Sendto(fd, packet, 0, addr)

The performance of the AF_PACKET method turns out to be almost identical to that achieved with the pcap method earlier. A quick Google, shows that libpcap, the library underlying tools like tcpdump and the Go pcap bindings, uses AF_PACKET for packet capture and injection on Linux platforms. So that explains the performance similarities.

Using the AF_XDP Socket

We have one more option to try. AF_XDP is a relatively recent development and promises impressive numbers! It is designed to dramatically increase the speed at which applications can send and receive packets directly from and to the network interface card (NIC) by utilizing a fast path through the traditional Linux network stack. Also see my earlier blog on XDP here.

AF_XDP leverages the XDP (eXpress Data Path) framework. This capability not only provides minimal latency by avoiding kernel overhead but also maximizes throughput by enabling packet processing at the earliest possible point in the software stack.

The Go standard library doesn’t natively support AF_XDP sockets, and I was only able to find one library to help with this. So it’s all relatively new still.

I’m using this library github.com/asavie/xdp and this is how you can initiate an AF_XDP socket.

xsk, err := xdp.NewSocket(link.Attrs().Index, s.queueID, nil)

Note that we need to provide a NIC queue; this is a clear indicator that we’re working at a lower level than before. The complete code is a bit more complicated than the other options, partially because we need to work with a user-space memory buffer (UMEM) for packet data. This method reduces the kernel's involvement in packet processing, cutting down the time packets spend traversing system layers. By crafting and injecting packets directly at the driver level. So instead of pasting the code, please look at my code here.

The results look great; using this method, I’m now able to generate 2,647,936 pps. That’s double the performance we saw with AF_PACKET! Whoohoo!

Wrap-up and some takeaways

First off, this was fun to do and learn! We looked at the various options to generate packets from the traditional net.Dial method, to raw sockets, pcap, AF_PACKET and finally AF_XDP. The graph below shows the numbers per method (all using one CPU and one NIC queue). AF_XDP is the big winner!

The various ways to send network traffic in Go

If interested, you can run the benchmarks yourself on a Linux system like below:

./go-pktgen --dstip 192.168.64.2 --method benchmark \
 --duration 5 --payloadsize 64 --iface veth0

+-------------+-----------+------+
|   Method    | Packets/s | Mb/s |
+-------------+-----------+------+
| af_xdp      |   2647936 | 1355 |
| af_packet   |   1368070 |  700 |
| af_pcap     |   1354087 |  693 |
| udp_syscall |    861372 |  441 |
| raw_socket  |    793781 |  406 |
| net_conn    |    697277 |  357 |
+-------------+-----------+------+

The important number to look at is packets per second as that is the limitation on software network stacks. The Mb/s number is simply the packet size x the PPS number you can generate. It’s interesting to see the easy 2x jump from the traditional net.Dial approach to using AF_PACKET. And then another 2x jump when using AF_XDP. Certainly good to know if you’re interested in sending packets fast!

The benchmark tool above uses one CPU and, thus, one NIC queue by default. The user can, however, elect to use more CPUs, which will start multiple go routines to do the same tests in parallel. The screenshot below shows the tool running with 8 streams (and 8 CPUs) using AF_XDP, generating 186Gb/s with 1200 byte packets (18.8Mpps)! That’s really quite impressive for a Linux box (and not using DPDK). Faster than what you can do with Iperf3 for example.

Some caveats and things I’d like to look at in the future

Running multiple streams (go routines) using the PCAP method doesn’t work well. The performance degrades significantly. The comparable AF_PACKET method, on the other hand, works well with multiple streams and go routines.

The AF_XDP library I’m using doesn’t seem to work well on most hardware NICs. I opened a GitHub issue for this and hope it will be resolved. It would be great to see this be more reliable as it kind of limits more real-world AF_XDP Go applications. I did most of my testing using veth interfaces; i’d love to see how it works on a physical NIC and a driver with XDP support.

It turns out that for AF_PACKET, there's a zero-copy mode facilitated by the use of memory-mapped (mmap) ring buffers. This feature allows user-space applications to directly access packet data in kernel space without the need for copying data between the kernel and user space, effectively reducing CPU usage and increasing packet processing speed. This means that, in theory, the performance of AF_PACKET and AF_XDP could be very similar. However, it appears the Go implementations of AF_PACKET do not support zero-copy mode or only for RX and not TX. So I wasn’t able to use that. I found this patch but unfortunately couldn’t get it to work within an hour or so, so I moved on. If this works, this will likely be the preferred approach as you don’t have to rely on AF_XDP support.

Finally, I’d love to include DPDK support in this pktgen library. It’s the last one missing. But that’s a whole beast on its own, and I need to rely on good Go DPDK libraries. Perhaps in the future!

That’s it; you made it to the end! Thanks for reading!

Cheers
-Andree

AWS IPv4 Estate Now Worth $4.5 Billion

Andree Toonk — Sun, 17 Sep 2023 21:36:54 GMT

Three years ago, I wrote a blog titled “AWS and their Billions in IPv4 addresses “in which I estimated AWS owned about $2.5 billion worth of IPv4 addresses. AWS has continued to grow incredibly, and so has its IPv4 usage. In fact, it’s grown so much that it will soon start to charge customers for IPv4 addresses! Enough reason to check in again, three years later, to see what AWS’ IPv4 estate looks like today.

A quick 2020 recap

Let’s first quickly summarize what we learned when looking at AWS’s IPv4 usage in 2020. First, in 2020, we observed that the total number of IPv4 addresses we could attribute to AWS was just over 100 Million (100,750,168). That’s the equivalent of just over six /8 blocks.

Second, for fun, we tried to put a number on it; back then, I used $25 per IP, bringing the estimated value of their IPv4 estate to Just over $2.5 billion.

Third, AWS publishes their actively used IPv4 addresses in a JSON file. The JSON file contained references to roughly 53 Million IPv4 addresses. That meant they still had ~47 Million IPv4 addresses, or 47%, available for future allocations. That’s pretty healthy!

The 2023 numbers

Okay, let’s look at the current data. Now, three years later, what does the IPv4 estate for AWS look like? I used the same scripts and methods as three years ago and found the following.

First, we observe that AWS currently owns 127,972,688 IPv4 addresses. Ie. almost 128 million IPv4 addresses. That’s an increase of 27 million IPv4 addresses. In other words, AWS added the equivalent of 1.6 /8’s or 415 /16’s in three years!

Second, what’s it worth? This is always tricky and just for fun. Let’s first assume the same $25 per IPv4 address we used in 2020.

127,972,688 ipv4 addresses x $25 per IP = $3,199,317,200.

So, with the increase of IPv4 addresses, the value went up to ~$3.2 Billion. That’s a $700 million increase since 2020.

However, if we consider the increase in IPv4 prices over the last few years, this number will be higher. Below is the total value of 127M IPv4 addresses at different market prices.

Total number of IPv4 addresses: 127,972,688
value at $20 per IP: $2,559,453,760
value at $25 per IP: $3,199,317,200
value at $30 per IP: $3,839,180,640
value at $35 per IP: $4,479,044,080
value at $40 per IP: $5,118,907,520
value at $50 per IP: $6,398,634,400

IPv4 prices over time — Source: ipv4.global

Based on this data from ipv4.global, the average price for an IPv4 address is currently ~$35 dollars. With that estimate, we can determine the value of AWS’s IPv4 estate today to about 4.5 Billion dollars. An increase of 2 Billion compared to three years ago!

Thirdly, let’s compare the difference between the IPv4 data we found and what’s published in the JSON file AWS makes available. In the JSON today, we count about 73 million IPv4 addresses (72,817,397); three years ago, that was 53 Million. So, an increase of 20 million in IPv4 addresses allocated to AWS services.

Finally, when we compare the ratio between what Amazon owns and what is allocated to AWS according to the JSON data, we observe that about 57% (72817397 / 127972688) of the IPv4 addresses have been (publicly) allocated to AWS service. They may still have 43% available for future use. That’s almost the same as three years ago when it was 47%.
(Note: this is an outsider’s perspective; we should likely assume not everything is used for AWS).

Where did the growth come from

A quick comparison between the results from three years ago and now shows the following significant new additions to AWS’ IPv4 estate,

Two new /11 allocations: 13.32.0.0/11 and 13.192.0.0/11. This whole 13/8 block was formerly owned by Xerox.
(Note: it appears AWS owned 13.32.0.0/12 already in 2020).
Two new /12 allocations: 13.224.0.0/12 (see above as well). It appears they continued purchasing from that 13/8 block.
I’m also seeing more consolidation in the 16.0.0.0/8 block. AWS used to have quite a few /16 allocations from that block, which are now consolidated into three /12 allocations: 16.16.0.0/12 16.48.0.0/12, and 16.112.0.0/12
Finally, the 63.176.0.0/12 allocation is new.

AWS is starting to charge for IPv4 addresses

In August of this year, AWS announced they will start charging their customers for IPv4 addresses as of 2024.

Effective February 1, 2024 there will be a charge of $0.005 per IP per hour for all public IPv4 addresses, whether attached to a service or not (there is already a charge for public IPv4 addresses you allocate in your account but don’t attach to an EC2 instance).

That’s a total of $43.80 per year per IPv4 address; that’s a pretty hefty number! The reason for this is outlined in the same AWS blog:

As you may know, IPv4 addresses are an increasingly scarce resource and the cost to acquire a single public IPv4 address has risen more than 300% over the past 5 years. This change reflects our own costs and is also intended to encourage you to be a bit more frugal with your use of public IPv4 addresses

The 300% cost increase to acquire an IPv4 address is interesting and is somewhat reflected in our valuation calculation above (though we used a more conservative number).

So, how much money will AWS make from this new IPv4 charge? The significant variable here is how many IP addresses are used at any given time by AWS customers. Let’s explore a few scenarios, starting with a very conservative estimate, say 10% of what is published in their IPv4 JSON is in use for a year. That’s 7.3 Million IPv4 addresses x $43.80, almost $320 Million a year. At 25% usage, that’s nearly $800 Million a year. And at 31% usage, that’s a billion dollars!

Notice that I’m using a fairly conservative number here, so it’s not unlikely for AWS to make between $500 Million to a Billion dollars a year with this new charge!

The data

You can find the data I used for this analysis on the link below. There, you’ll also find all the IPv4 prefixes and a brief summary. https://gist.github.com/atoonk/d8bded9d1137b26b3c615ab614222afd
Similar data from 2020 can be found here.
PS. Let me know if someone knows the LACNIC or AFRINIC AWS resources, as those are not included in this data set.

Wrap up

In this article, we saw how, over the last three years, AWS grew its IPv4 estate with an additional 27 million IP addresses to now owning 128 Million IPv4 addresses. At a value of $35 per IPv4 address, the total value of AWS’ IPv4 estate is ~4.5 Billion dollars. An increase of $2 billion compared to what we looked at three years ago!

Regarding IPv4 capacity planning, it seems like the unallocated IPv4 address pool (defined as not being in the AWS JSON) has remained stable, and quite a bit of IPv4 addresses are available for future use.

All this buying of IPv4 addresses is expensive, and in response to the increase in IPv4 prices, AWS will soon start to charge its customers for IPv4 usage. Based on my estimates, It’s not unlikely that AWS will generate between $500 million and $1 billion in additional revenue with this new charge. Long live IPv4!

Cheers
Andree

Diving into AI: An Exploration of Embeddings and Vector Databases

Andree Toonk — Mon, 01 May 2023 04:09:12 GMT

With the help of ChatGTP, AI has officially changed everything we do. It’s only been a few months since chatGPT was released, and like many, I’ve been exploring how to best use it. I used it as a coding buddy, to brainstorm, to help with writing, etc.

The potential of this new technology blows me away. But as a technology enthusiast, I’d love to understand better how it works, or at least better understand some of the common terms and underlying technology. I keep hearing about Large Language Models (LLMs), vector databases, training, models, etc. I’d love to learn more about it, and what better way to just dive in, get my hands dirty and experiment with it?

So in today’s blog, I’m sharing some learnings on one of these building blocks, called embeddings. Why embeddings? Well, originally, I planned to learn more about Vector databases, but I quickly learned that in order to understand these better, I should start with vectors and embeddings.

What is an embedding

This is the definition from the OpenAI website

An embedding is a vector (list) of floating point numbers. The distance between two vectors measures their relatedness. Small distances suggest high relatedness, and large distances suggest low relatedness.

Hhm, ok, but what does that really mean? Imagine you have a word, say hamburger. In order to use this in an LLM (large language model) like GPT, the LLM needs to know what it means. To do that, we can turn the word hamburger into an embedding. An embedding is essentially a (set of) numerical representations of a word, indicating its meaning. We call this the Vector.

With embeddings, we can now represent the words as vectors:

dog: [0.2, -0.1, 0.5, …]
cat: [0.1, -0.3, 0.4, …]
fish: [-0.3, 0.6, -0.1, …]

Notice that it’s a representation of the meaning (semantics) of the word. For example, the word embeddings for “dog” and “puppy” would be close together in the vector space because they share a similar meaning and often appear in similar contexts. In contrast, the embeddings for “dog” and “car” would be farther apart because their meanings and contexts are quite different.

It is this “Word embeddings” technology that enables semantic search, which goes beyond simple keyword matching to understand the meaning and context behind a query. “Semantic” refers to the similarity in meaning between words or phrases.

For example, traditional string matching would fail to connect the “searching for something to eat” query with the sentence “the mouse is looking for food.” However, with semantic search powered by word embeddings, a search engine recognizes that both phrases share a similar meaning, and it would successfully find the sentence.

Ok, great. Now that we somewhat understand how this works, how does it really work?

Turning words or sentences into embeddings

A word or sentence can be turned into an embedding (a vector representation) using the OpenAI API. To get an embedding, send your text string to the embeddings API endpoint along with a choice of embedding model ID (e.g., text-embedding-ada-002). The response will contain an embedding you can extract, save, and use.

In my case, I’m using the Python API. Using this API, you can simply use the code below to turn the word hamburger into an embedding.

from openai.embeddings_utils import get_embedding
hamburger_embedding = get_embedding("hamburger", engine='text-embedding-ada-002')

# will look like something like [-0.01317964494228363, -0.001876765862107277, …

If you have a text document, you would turn all the words or sentences from that document into embeddings. Once you’ve done that, you essentially have a semantic representation of the document as a series of vectors. These vectors capture the meaning and context of the individual words or sentences.

Finding Similarities

Once you have embeddings for words or sentences, you can use them to find semantic similarities. A common approach to measuring the similarity between two embeddings is by calculating how close the vectors are to each other.

Calculating the distance between vectors is done by calculating the cosine similarity; if you’re really interested, you can read about that here. https://en.wikipedia.org/wiki/Cosine_similarity

Luckily the Python module that OpenAI ships has an implementation of this cosine_similarity, and you can simply use it like this:

import openai
from openai.embeddings_utils import get_embedding, cosine_similarity
openai.api_key = ""
embedding1 = get_embedding("the kids are in the house",engine='text-embedding-ada-002')
embedding2 = get_embedding("the children are home",engine="text-embedding-ada-002")
cosine_similarity(embedding1, embedding2)

Which prints: 0.9387390865828703, meaning they’re very close.

A real-life example

Below is a slightly longer example. It reads the document called words.csv, which looks like this:

text
"red"
"potatoes"
"soda"
"cheese"
"water"
"blue"
"crispy"
"hamburger"
"coffee"
"green"
"milk"
"la croix"
"yellow"
"chocolate"
"french fries"
"latte"
"cake"
"brown"
"cheeseburger"
"espresso"
"cheesecake"
"black"
"mocha"
"fizzy"
"carbon"
"banana"
"sunshine"
"orange carrot"
"sun"
"hay"
"cookies"
"fish"

The script below then calculates the embeddings for all these words and adds it all to a Panda data frame. Next, it will take a search term (hotdog) and calculates what words are closest to the word hotdog.

import openai
import pandas as pd
import numpy as np
from openai.embeddings_utils import get_embedding, cosine_similarity

openai.api_key = ""

# read the data
df = pd.read_csv('words.csv')

# Lamda to add embedding column
df['embedding'] = df['text'].apply(lambda x: get_embedding(x, engine='text-embedding-ada-002'))
# Safe it to a csv file, for caching later. So we dont need to call the API all the time
# You'd store this in a vector database
df.to_csv('word_embeddings.csv')
df = pd.read_csv('word_embeddings.csv')

# Convert the string representation of the embedding to a numpy array
# neeeded since we wrote it to a csv file
df['embedding'] = df['embedding'].apply(eval).apply(np.array)

# Hotdog is not in the CSV. Let calculate the embedding for it
search_term = "hotdog"
search_term_vector = get_embedding(search_term, engine="text-embedding-ada-002")

# now we can calculate the similarity between the search term and all the words in the CSV 
df["similarities"] = df['embedding'].apply(lambda x: cosine_similarity(x, search_term_vector))
# print the top 5 most similar words
print(df.sort_values("similarities", ascending=False).head(5))

The code above prints this:

Unnamed: 0          text                                          embedding  similarities
7            7     hamburger  [-0.01317964494228363, -0.001876765862107277, ...      0.913613
18          18  cheeseburger  [-0.01824556663632393, 0.00504859397187829, 0....      0.886365
14          14  french fries  [0.0014257068978622556, -0.016548126935958862,...      0.853839
3            3        cheese  [-0.0032112577464431524, -0.0088559715077281, ...      0.838909
13          13     chocolate  [0.0015315973432734609, -0.012976923026144505,...      0.830742

Pretty neat, right?! It calculated that a hotdog is most similar to a hamburger, cheeseburger, and fries!

Let’s do one more thing! In the example below, we add the embeddings for milk and coffee together, just like a simple math addition.

We then again calculate what this new embedding is most similar to (hint, what do you call a drink that adds coffee to milk?).

# Let's make a copy of the data frame we created earlier, so we can compare the embeddings of two words
food_df = df.copy()
milk_vector = food_df['embedding'][10]
coffee_vector = food_df['embedding'][8]

# lets add the two vectors together
milk_coffee_vector = milk_vector + coffee_vector

# now calculate the similarity between the combined vector and all the words in the CSV
food_df["similarities"] = food_df['embedding'].apply(lambda x: cosine_similarity(x, milk_coffee_vector))

print(food_df.sort_values("similarities", ascending=False).head(5))

The result is this

Unnamed: 0 text embedding similarities
8 8 coffee [-0.0007212135824374855, -0.01943901740014553,… 0.959562
10 10 milk [0.0009238893981091678, -0.019352708011865616,… 0.959562
15 15 latte [-0.015634406358003616, -0.003936054650694132,… 0.905960
19 19 espresso [-0.02250547707080841, -0.012807613238692284, … 0.898178
22 22 mocha [-0.012473775073885918, -0.026152553036808968,… 0.889710

Ha! Yes, it’s obviously similar to coffee and milk, as that’s what we started with, but next up, we see a latte! That’s pretty cool, right? Coffee + Milk = Late 😀

Vector database

Now that we’ve seen how embeddings work and how they can be used to find semantic similarities, let’s talk about vector databases. In our example, we saw that calculating the embeddings was done using an API call to the OpenAI API. This can be slow and will cost you credits. That’s why, in the example code, we saved the calculated embeddings to a CSV file for caching purposes.

While this approach works for small-scale experiments, it may not be practical for large amounts of data or production environments where performance and scalability are important. This is where vector databases come in.

There are a few popular ones; a well-known one is Pinecone, but even Postgres can be used as a vector database. These vector databases are specifically designed for storing, managing, and efficiently searching through large amounts of embeddings. They are optimized for high-dimensional vector data and can handle operations such as nearest neighbor search, which is crucial for finding the most similar items to a given query.

Wrap up

In this exploration of the technology behind LLMs and AI, I delved into some of the foundational building blocks that power these advanced systems; specifically, we looked at embeddings and vectors. My initial curiosity about vector databases and their potential applications for my own data led me to first understand the underlying principles and the importance of vectors. It’s pretty cool to see how easy it was to get going, thanks to the existing API’s and libraries.

Perhaps in another weekend adventure, I’ll look further into the next logical topic: vector databases. I’d also love to explore Langchain, a fascinating framework for developing applications powered by language models.

That’s it for now; thanks for reading!

Cheers
Andree

IPv4 for sale - WIDE and APNIC selling 43.0.0.0/8

Andree Toonk — Wed, 13 Oct 2021 03:03:18 GMT

^ⓘ May 23 2022, article updates.
I've made two updates to this blog.
The first update: adds information about the prefixes that hadn't been sold yet when this article was first published. As well as the move of some of the prefixes from AWS to the Chinese operators of AWS (SINET and NWCD)
The Second update: describes the exact amount APIDT made with the sale of this address space. APIDT recently filed its annual report, which outlines the exact amount.

In one of my recent blog posts, we looked at the ~two billion dollars worth of IPv4 assets AWS has collected over the last few years. One of the lessons from that story is that the IPv4 market is still very much alive and that folks are still happy to sell IPv4 resources and cloud providers are willing to pay big money to grow their business.

It’s always interesting to trace where all the sold IPv4 networks came from and who owned them previously. I’ve seen all kinds of entities sell their IPv4 resources. Examples include local police departments, libraries, car manufactures, and even large telecom hardware vendors. So I’m not sure if there’s a real pattern, other than those folks that received a large allocation decades ago and no longer need it, so it’s better to sell it.

One of the recent large transfers that caught my attention was the sale of (the majority of) the 43.0.0.0/8 IPv4 space. Since it’s such a large range and the story behind it is a bit unique, let’s take a closer look.

The Asia Pacific Internet Development Trust (APIDT)

In March of last year (2020), APNIC and the Japanese WIDE project announced the establishment of the Asia Pacific Internet Development Trust (APIDT). The announcement on the APNIC blog reads:

APIDT was created in line with the wishes of the Founder of the WIDE Project, Professor Jun Murai of Keio University, following his decision on the future use of IPv4 address space in the 43/8 block, announced today.

The article then explains the purpose of the Trust.

The Trust will use the proceeds of the sale of the address space to create a fund to benefit Internet development in the Asia Pacific region. This will include funding technical skills development and capacity building, improvements to critical Internet infrastructure, supporting research and development, and improving the community’s capability to build an open, global, stable and secure Internet.

So APIDT, a joint initiative of the WIDE Project and APNIC, is a newly created Trust that will be funded by selling the unallocated (87.5%) portion of 43.0.0.0/8. The proceeds of the sale will be used in support of Internet development in the APNIC region.

Selling the 43/8 addresses space

APIDT has recently completed the sale of the IPv4 address holdings in three separate stages via a sale by tender process. The IPv4 blocks sold in each stage were as follows:

Stage 1 offered the 43.0.0.0/9 block, subdivided into eight /12 blocks.
Stage 2 offered the 43.128.0.0/10 block, subdivided into sixteen /14 blocks.
Stage 3 offered the 43.192.0.0/11 block, subdivided into thirty-two /16 blocks.

The whole process took a while, the results of stage 1 and stage 2 were publicly visible in the RIR database last year (2020). For whatever reason, stage 3 took quite a bit longer and is still not fully public.

So who bought all the IPv4 addresses?

The lucky winners and, likely, big spenders are perhaps not surprisingly some of the major cloud providers.

Stage 1 - 43.0.0.0/9 block, subdivided into eight /12 blocks

The big winner of the first stage of the tender was Alibaba. They now own all of the eight /12 blocks. These have now been allocated as two /10 to Alibaba, 43.0.0.0/10 and 43.64.0.0/10.

Stage 2 — 43.128.0.0/10 block, subdivided into sixteen /14 blocks

Again a winner takes all result here. The winner of stage two is the other major cloud player in the Asia Pacific region: Tencent Cloud Computing.

The sixteen /14 blocks are now allocated to Tencent as one large /10 allocation: 43.128.0.0/10

Stage 3 — 43.192.0.0/11 block, subdivided into thirty-two /16 blocks

This one took a while but as of June 21st, 2021 we can see that just over 50% of this block, 43.192.0.0/12, has been allocated.

The first half, 43.192.0.0/12 goes to.. surprise surprise, AWS

The final two /16’s in this block are going to
43.222.0.0/16 NEC Corporation
43.223.0.0/16 LINE Corporation

Update: May 23, 2022

All of 43.192.0.0/11, with the exception of 43.222.0.0/16 (NEC Corporation) and 43.223.0.0/16 (LINE Corporation) has been allocated to AWS.

Interestingly there already was some movement around the first few /16 blocks. These used to be allocated to AWS. However, double-checking this today again, I noticed that some of these are now registered to companies that go by the name of "Sinnet Technology" and "Ningxia West Cloud Data". After some more digging, it turns out that these are the companies that operate AWS in China

From the AWS website: https://www.amazonaws.cn/en/about-aws/china/faqs/
Beijing Sinnet Technology Co., Ltd.("Sinnet") operates and provides services from the Amazon Web Services China (Beijing) Region, and Ningxia Western Cloud Data Technology Co., Ltd. (“NWCD”) operates and provides services from the Amazon Web Services China (Ningxia) Region, each in full compliance with Chinese regulations.

For a full list of what IPv4 block was sold to whom, also see this Google Sheet:
https://docs.google.com/spreadsheets/d/13Kzg8pIJKiIV33WsRwEjvVD9myhDiAFZvOlZ1eGyW4s/edit?usp=sharing

How much money was made?

That question is hard to answer as the dollar figures for the winning bids have not been made public (as far as I know). However, we can make an estimate based on the known market prices.

There’s limited public information on what price IPv4 addresses are sold for. Typically the per IP address depends on the size of the IP ranges that are being sold. One public benchmark is the transaction between AMPR.org and AWS in 2019. In that transaction, AWS paid $108 Million for 44.192.0.0/10. That’s $25.74 per IP address.

A more recent public source is the ipv4marketgroup website. According to them, the per IP price for IPv4 blocks of size of /17 or larger (/16, /15, etc) currently range between $38 to $40 USD per IP. The graph below shows how prices have steadily increased over the years.

source: https://ipv4marketgroup.com/ipv4-pricing/

APIDT sold off the equivalent of 14,680,064 IPv4 addresses. So if we estimate the price at say $38 per IP, the total value of the sold IP addresses would be around $558 Million. While on the higher end, at $40 per IP the value would be $587 Million.
At a more conservative price of $34 per IP, the combined value would still be $499 Million (USD).

Update: May 23, 2022

As of February 2022, APIDT has filed its first financial report to the Australian charities regulator, here (see page 20)

This annual financial report shows that the total amount raised with the sale of the IP addresses was: US$440,716,493. This means that the per IP address price was almost exactly $30 (USD)

Wrapping up

The sale of the 43/8 block shows the IPv4 market is still alive and thriving. We observe the market continues to find new resources to sell. In the example of APIDT, we now know that the funds made available for the Trust as a result of the sale was 440 million (US) dollars. I’d say that’s a very healthy fund for the purpose of supporting Internet development in the Asia Pacific region.

The Risks and Dangers of Amplified Routing Loops.

Andree Toonk — Tue, 13 Apr 2021 12:00:00 GMT

This article will take a closer look at network loops and how they can be abused as part of DDoS attacks. Network loops combined with existing reflection-based attacks can create a traffic amplification factor of over a thousand. In this article, we’ll see how an attacker will only need 50mb/s to fill up a 100gb/s link. I'll demonstrate this in a lab environment.

This blog is also a call to action for all network engineers to clean up those lingering network loops as they aren’t just bad hygiene but a significant operational DDoS risk.

Network Loops

All network engineers are familiar with network loops. A network loop causes an individual packet to bounce around in a network while consuming valuable network resources (bandwidth and PPS). There are various reasons IP networks can have loops, typically caused by configuration mistakes. The only real “protection” against network loops is the Time To Live (TTL) check. However, the Time To Live check is less of protection against loops, and more protection against them looping forever.

The Time to Live (TTL) refers to the amount of time or “hops” that a packet is set to exist inside a network before being discarded by a router. The TTL is an 8-bit field in the IP header, and so it has a maximum value of 255. The typical TTL value on most operating systems is 64, which should work fine in most cases.

Most network engineers know loops are bad hygiene. I think it’s much less understood (or thought about) what the operational risk of a loop is. In my experience, most loops are for IP addresses that aren’t actually in use (otherwise, it would be an outage), so typically, solving this ends up on the backburner.

A Simple Loop example

Let’s look at a simple example. The diagram below shows two routers; imagine those being your core or edge routers in your datacenter.

Typical loop

For whatever reason, they both believe 192.0.2.0/24 is reachable via the other router. As a result, packets for that destination will bounce between the two routers. Depending on the TTL value, the bandwidth used for that one packet will be:

packet size in bytes x 8 x TTL.

Meaning for a 512 byte packet and a TTL of 60, the amount bandwidth used is 245Kbs. In other words, the 512 byte packet turned into 307,20 bytes (60x) before it was discarded. You could think of this number 60 as the amplification factor.

But.. but.. Network loops are rare

That depends on your definition of rare. The good folks at Qrator monitor for loops on the Internet. According to the Qrator measurements, there are roughly twenty million unique loops (measured as unique router pairs).

source: Qrator Radar. Number of routing loops globally

Combining loops and common Amplification attacks.

Now that we’ve covered the risk of loops and their potential > 100x (the TLL) amplification factor, this may remind you of the traditional DDoS attacks.

The record high DDOS attacks you read about in the news are now hitting hundreds of Gigabits per second and peaking into the Terabits. All these large volume metric attacks are mostly the same type of attack and are commonly known as amplification or reflection attacks.

They rely on an attacker sending a small packet with a spoofed source IP, which is then reflected and amplified to the attack target. The reflectors/amplifiers are typically some kind of open UDP service that takes a small request and yields a large answer. Typical examples are DNS, NTP, SSDP, and LDAP. The large attacks you read about in the news typically combine a few of these services.

By now, it may be clear what the danger of combining these two types of scenarios can be. Let’s look at an example. A typical DNS amplification query could be something like this, and RRSIG query for irs.gov

dig -t RRSIG irs.gov @8.8.8.8

This query is 64 bytes on the wire. The resulting answer is large and needs to be sent in two packets; the first packet is 1500 bytes, the second packet 944 bytes. So in total, we have an amplification factor of (1500+944)/64 = 38.

IP (tos 0x0, ttl 64, id 32810, offset 0, flags [none], 
proto UDP (17), length 64)
    192.168.0.30.57327 > 8.8.8.8.53: 42548+ [1au] RRSIG? irs.gov. (36)

IP (tos 0x0, ttl 123, id 15817, offset 0, flags [+], 
proto UDP (17), length 1500)
    8.8.8.8.53 > 192.168.0.30.57327: 42548 8/0/1 irs.gov. RRSIG, irs.gov. 
    RRSIG, irs.gov. RRSIG, irs.gov. RRSIG, irs.gov. RRSIG[|domain]
    
IP (tos 0x0, ttl 123, id 15817, offset 1448, flags [none], 
proto UDP (17), length 944)
    8.8.8.8 > 192.168.0.30: ip-proto-17

Note: There are many different types of amplification attacks. This is just a modest and straightforward DNS example. Also, note that common open reflectors, such as Public DNS resolvers typically have smart mechanisms to limit suspicious traffic to reduce the negative impact these services could have.

The tcpdump output above shows that when the answer arrives back from Google’s DNS service, the TTL value is 123; this is higher than most other public DNS resolvers (most appear to default to 64).

If we combine this attack with the ‘loop’ factor we looked at previously (determined by the TTL value), we have the total amplification factor.

Adding up the numbers

Ok, so let’s continue to work on the DNS amplification example. The amplification number of 38 and a TTL of 123, would result in a total amplification number of :

38 * (123 / 2) = 2,337

Note that I’m dividing the TTL number by two so that we get a per receive (RX) and transmit(tx) number.

For now, let’s use 2,337 as a reasonable total amplification number. What kind of traffic would an attacker need to generate 10G or 100G of traffic? One would need just about 5Mbs/s to saturate a 10g link and say ~50Mbs to saturate a 100Gbs link! These numbers are low enough to generate from a simple home connection. Now imagine what an attacker with bad intentions and access to a larger botnet could do…

Let’s double-check this with a Demo

To make sure this is indeed all possible and the math adds up, I decided to build a lab to reproduce the scenario we looked at above.

Sequence of events

The lab contains four devices:

An attacker: initiating a DNS-based reflection attack. The (spoofed) source IP is set to 192.0.2.53
A DNS resolver: receiving the DNS queries (with a spoofed source IP) and replying with an answer that is 38 times larger than the original question. The IP TTL value in the DNS answer is 123.
A router pair (rtr1 — rtr2): Both routers have a route for 192.0.2.0/24 pointing to each other. As a result, the DNS answer with a destination IP of 192.0.2.53 will bounce back and forth between the two routers until the TTL expires.

In the screenshot above, we see the attacker on the top left sending queries at a rate of 5.9Mbs. On the bottom left, we see the DNS resolver receiving traffic at 5.9Mbs from the client and answering the queries at a rate of ~173mbs. The IP packets with the DNS responses have a TTL of 123.

We see the router pair on the right-hand side: rtr1 on the top right and rtr2 on the bottom right. As you can see, both devices are sending and receiving at 10Gb/s. So, in this case, we observe how the client (attacker) turned 6mb/s into 10Gb/s.

Wrapping up

In this blog, we looked at the danger of network loops. I hope it’s clear that loops aren’t just a hygiene or cosmetic issue but instead expose a significant vulnerability that should be cleaned up ASAP.

We saw that loops are by no means rare and that there are millions of router pairs with network loops. In fact, according to Qrator data, over 30% of all Autonomous Systems (ASns), including many of the big cloud providers, have networks with loops in them.

We observed that an attacker can easily saturate a 10G link with 85Mbs (at a TTL of 240) without any UDP amplification. Or if combined with a typical UDP amplification attack, 6Mbs of seed traffic will result in 10G on a looped path, or 60Mb/s could potentially fill up a 100Gbs path!

Not all loops are the same

Most loops happen between two adjacent routers; quite a few of those appear to occur between an ISP’s router and the customer router. I have also seen loops happen involving up to eight hops (routers) spanning various metro areas while looping between Europe and the US. These transatlantic loops are expensive and hard to scale up quickly. As a result, loops on links like these will have a more significant impact.

Call to Action

I hope this article convinced you to check your network for loops and make sure you won’t be affected by attacks like these. Consider signing up for the free Qrator service, and you’ll get alerted when new loops (or other issues) are detected in your network.

Introducing SSH zero trust, Identity aware TCP sockets

Andree Toonk — Tue, 16 Feb 2021 18:57:54 GMT

In this article, we’ll look at Mysocket’s zero-trust cloud-delivered, authenticating firewall. Allowing you to replace your trusted IP ranges with trusted identities.

Last month we introduced our first zero trust features by introducing the concept of Identity Aware Sockets. It’s been great to see folks giving this a spin and start using it as a remote access alternative for the traditional VPN.

Most services out there today are HTTP based, typically served over HTTPS. However, there are a few other commonly used services that are not HTTP based and, as a result, up until today, didn’t benefit from our identity-aware sockets. In this article, we’ll introduce Zero trust support for non-HTTP based service, with the introduction Identity aware TCP sockets. Specifically, we’ll look at providing zero trust services for SSH as an example.

Determining the user’s identity, authentication, and authorization

Turning your mysocket services into an identity-aware socket is as simple as adding the — cloud_authentication flag to mysocketctl when creating the service. While doing so, you have the ability to add a list of email domains and/or a list of email addresses. Now each time a user tries to access your service, a browser will pop up asking the user to authenticate. Once authentication is finished, we know the user’s identity, and if that identity matches the list of authorized users, only then will the user be let through.

Creating an Identity-aware TCP socket

If you think about what is happening here, you’ll realize that what we have here is a per session, authenticating firewall. Only after the user is authenticated and authorized do we allow the network traffic through. Notice that this is much more advanced than your traditional firewall; now, every network flow has an identity. That’s powerful!

This flow of redirecting users to authenticate and then back to the service was do-able because it’s done in the browser and largely built of HTTP session management. Now we’d like to extend this with non-HTTP services, so we’ll need to find an alternative for the HTTP session part.
The solution for this comes with the help of Mutual TLS (MTLS). MTLS forces the client to authenticate itself when talking to the server. This is achieved by presenting a signed client certificate to the server.

Identity aware TCP sockets

With the introduction of identity-aware TCP sockets, the mysocket edge proxies act as an authenticating firewall. Since we are relying on client TLS certificates, all traffic is securely tunneled over a TLS connection.

As you can see in the flow below, there are a few actions to take before the user can get through. To make this a seamless experience for the users of your service, we’ve extended the mysocketctl command-line tool with the required functionality that kicks of the authentication flow. It starts the authentication process; after that, it requests a client certificate (your ticket in), and then it sets up the TLS tunnel for you. After that, users can send traffic over this authenticated and encrypted tunnel. In its simplest form, it will look something like this:

echo "hello" | mysocketctl client tls \ 
    --host muddy-pond-7106.edge.mysocket.io

In the example above, we’re sending the string hello, to the service served by muddy-pond-7106.edge.mysocket.io.

Traffic flow

Before the string “hello” arrives at the service protected by mysocket, the mysocketctl client will take care of the authentication flow, requests the TLS client certificate, and sends whatever comes in over stdin to the mysocket edge services.

SSH zero trust

Now that we understand the high-level flow let’s look at a more practical example. In this example, we have a server for which I’d like to make the SSH service available to only a subset of users. The ssh is on a private network such as your corporate network, your home network, or even in a private VPC or just firewalled off from the Internet.

First, we’ll provision the service using mysocketctl on the server-side and set up the tunnel.

mysocketctl connect \  
	--name 'remote access for my ssh server' \ 
    --cloudauth \  
    --allowed_email_addresses'contractor@gmail.com,john@example.com' \ 
    --allowed_email_domains 'mycorp.com' \ 
    --port 22 --host localhost \ 
    --type tls

In this example, I'm creating a mysocket service of type TLS, and we enable cloud authentication. This will force the user to present a valid client TLS certificate. The certificate can only ever be handed out to users that authenticate with a mycorp.com email address or using the specific email addresses contractor@gmail.com or john@example.com.
The same command will also set up a secure tunnel to the closest mysocket tunnel servers and expose the ssh service running on port 22.

The result is that this SSH service is now accessible to allowed users only, as crimson-thunder-8434.edge.mysocket.io:38676
Only inbound traffic with a valid client TLS ticket will be let through. Valid TLS client certificates will only ever be issued to users with a mycorp.com domain or the two contractor email addresses we specified.

Setting up an SSH session

Ok, time to test this and connect to this ssh service. Remember that we need a valid TLS client certificate. These are issued only with a valid token, and the token is only handed out to authorized users. To make all of this easier, we’ve extended the mysocketctl tool to take care of this workflow. The example below shows how we use the ssh ProxyCommand to make that easier for us, like this.

ssh ubuntu@crimson-thunder-8434.edge.mysocket.io \ 
	-o 'ProxyCommand=mysocketctl client tls --host %h'

This will tell ssh to send all ssh traffic through this mysocketctl client command. This will start the authentication process, fetch the TLS client certificate for us, set up the TLS tunnel to the mysocket edge server, and transport the ssh traffic through this authenticated tunnel. The user can now log in to the ssh server using whatever method you’re used to.

With this, we’ve made our private ssh server accessible from the Internet, while the authenticating mysocket firewall is only allowing in session from client identities we approved beforehand. No VPN needed. Pretty cool, right?

Mysocket SSH Certificate authorities.

SSH is quite similar to TLS in terms of workflow. It too supports authenticating users using signed certificates.

So we decided to expand on this functionality. In addition to an API endpoint that is responsible for signing TLS certificates, we also created one for signing SSH keys.

If we build on the example above, the user can now, in addition to requesting a TLS client certificate, also request a signed SSH certificate. Our SSH certificate signing service will only sign the signing request if the user is authenticated and authorized, using the same logic as before.

Setting up the server

In order to use this, we’ll need to make a few minor changes to the SSH server. The configuration changes below are needed to enable authentication using CA keys.

echo "TrustedUserCAKeys /etc/ssh/ca.pub" >>/etc/ssh/sshd_config

echo "AuthorizedPrincipalsFile %h/.ssh/authorized_principals" >>/etc/ssh/sshd_config

echo "mysocket_ssh_signed" > /home/ubuntu/.ssh/authorized_principals

Finally, also make sure to get the Public key for the CA (mysocketctl socket show) and copy it into the ca.pub file (/etc/ssh/ca.pub)

Now the server is configured to work with and allow authentication based on signed SSH keys from the mysocket certificate authority. Note that all signed certificates will have two principles, the email address of the authenticated user, as well as ‘mysocket_ssh_signed’. In the example configuration above, we told the server to map users with the principle ‘mysocket_ssh_signed’ to the local user ubuntu.

Now we’re ready to connect, but instead of making the ssh command even longer, I’m going to add the following to my ssh config file ~/.ssh/config

Host *.edge.mysocket.io
    ProxyCommand bash -c 'mysocketctl client ssh-keysign --host %h; ssh -tt -o IdentitiesOnly=yes -i ~/.ssh/%h %r@%h.mysocket-dummy >&2 <&1'

Host *.mysocket-dummy
    ProxyCommand mysocketctl client tls --host %h

The above will make sure that for all ssh sessions to *.edge.mysocket.io we start the authentication flow, fetch a TLS client certificate, and set up the TLS tunnel. We’ll also submit an SSH key signing request, which will result in a short-lived signed SSH certificate that will be used for authenticating the SSH user.

Now the user can just SSH like this, and the whole workflow will kick-off.

ssh ubuntu@crimson-thunder-8434.edge.mysocket.io

For those interested, the ssh certificate will end up in your ~/.ssh/ directory and will look like this.

$ ssh-keygen -Lf ~/.ssh/nameless-thunder-8896.edge.mysocket.io-cert.pub
/Users/andreetoonk/.ssh/nameless-thunder-8896.edge.mysocket.io-cert.pub:
        Type: ecdsa-sha2-nistp256-cert-v01@openssh.com user certificate
        Public key: ECDSA-CERT SHA256:0u6TICEhrISMCk7fbwBi629In9VWHaDG1IfnXoxjwlg
        Signing CA: ECDSA SHA256:MEdE6L0TUS0ZZPp1EAlI6RZGzO81A429lG7+gxWOonQ (using ecdsa-sha2-nistp256)
        Key ID: "atoonk@gmail.com"
        Serial: 5248869306421956178
        Valid: from 2021-02-13T12:15:20 to 2021-02-13T12:25:20
        Principals:
                atoonk@gmail.com
                mysocket_ssh_signed
        Critical Options: (none)
        Extensions:
                permit-X11-forwarding
                permit-agent-forwarding
                permit-port-forwarding
                permit-pty
                permit-user-rc

With this, users can SSH to the same server as before, but the cool thing is that the server won’t need to know any traditional known credentials for its users. Things like passwords or a public key entry in the authorized_keys file belong to the past. Instead, with the help of mysocketctl, the user will present a short-lived signed ssh certificate, which the server will trust.

With this, we achieved true Single Sign-on (SSO) for your SSH servers. Since the certificates are short-lived, five minutes in the past(to allow for time drift) to five minutes in the future, we can be sure that for each log-in the authentication and authorization flow was successful.

Give it a try yourself using my SSH server

If you’re curious about what it looks like for the user and want to give it a try? I have a test VM running on 165.232.143.236, it has firewall rules to prevent SSH traffic from the Internet, but using mysocket, anyone with a gmail.com account can access it. I encourage you to give it a spin, it will take you less than a minute to set up, just copy-paste the one-time setup config.

Onetime setup

If you’re using a Mac laptop as your client, you’ll need the mysockectl tool which will request the short-lived certs and will setup up the TLS tunnel. To install the client just copy-paste the below (for Mac only, see download.edge.mysocket.io for other platforms).

curl -o mysocketctl https://download.edge.mysocket.io/darwin_amd64/mysocketctl 

chmod +x ./mysocketctl
sudo mv ./mysocketctl /usr/local/bin/

To make it easy to use we’ll add the following to our ssh client config file. This is a one-time setup and will make sure the ssh traffic to *.edge.mysocket.io is sent through the mysocketctl client tool.

cat <> ~/.ssh/config
Host *.edge.mysocket.io
 ProxyCommand bash -c ‘mysocketctl client ssh-keysign --host %h; ssh -tt -o IdentitiesOnly=yes -i ~/.ssh/%h %r@%h.mysocket-dummy >&2 <&1’
Host *.mysocket-dummy
 ProxyCommand mysocketctl client tls --host %h
EOF

Now you should be able to ssh to my test server using

ssh testuser@frosty-feather-1130.edge.mysocket.io

When the browser pops up, make sure to use the “log in with Google” option, as this socket has been configured to only allow identities that have a Gmail.com email address.

Wrapping up

In this post, we showed how we continued to build on our previous work with “identity-aware sockets”. We Introduced support for identity-aware TCP sockets, by leveraging TLS tunnels and Mutual TLS for authentication.

I like to think of this as a cloud-delivered, authenticating firewall. With this, we can make your services available to the Internet on a very granular basis, and make sure that each flow has an identity attached to it. Ie, we know exactly, on a per TCP flow basis, what identity (user) is using this flow. That’s a really powerful feature when compared to a traditional firewall, where we had to allow SSH traffic from certain network ranges that were implicitly trusted. What we can now do with these identity-aware sockets is rewrite these firewall rules and replace the trusted IP ranges with trusted identities. This is incredibly powerful for those that need strict compliance, and need to answer things like, who (not an IP, but an identity) connected to what when.

We looked at how this can be used to provide zero trust remote access to your SSH servers. And how it can be further extended by using the new SSH key signing service.

That’s it for now, I hope you found this interesting and useful. As always, if you have any questions or feedback, feel free to reach out.

Hungry for more? check out all our demo’s on Youtube here

Introducing Identity Aware Sockets: Enabling Zero Trust access for your Private services

Andree Toonk — Thu, 14 Jan 2021 17:19:18 GMT

In this blog post, we’ll introduce an exciting new feature that, with the help of Mysocket, allows you to deploy your own Beyond Corp setup.

What is Zero Trust

The main concept behind Zero Trust is that users shouldn’t just be trusted because they are on your network. This implicit trust problem is something we typically see with, for example, corporate VPNs. With most corporate VPN’s once a user is authenticated, the user becomes part of the corporate network and, as a result, has access to many of the resources within the corporate infrastructure. In other words, once you’re on the VPN, you’re within the walls of the castle, you’re trusted, and you have lots of lateral access.

The world is changing, and the once traditional approach of trusting devices on your network are over. A few years ago, Google started the journey to their implementation of Zero trust called Beyond Corp. One of the core building blocks to get to a Zero Trust model is Identity aware application proxies. These proxies can provide strict access control on a per-application granularity, taking the users’ identity and contexts such as location and device status into account.

Identity aware proxies

As of today, the Mysocket proxies have support for OpenID Connect, and with that, your sockets are identity-aware. This means that mysocket users can now enable authentication for their services and provide authorization rules.

mysocketctl connect \
   --port 3000 \
   --name "My Identity aware socket" \
   --cloudauth \
   --allowed_email_domains "mycorp.com" \
   --allowed_email_addresses "contractor1@gmail.com,john@doe.com"

The example above shows how we can enable authentication by using the “cloudauth” CLI parameter. With cloudauth enabled, users will be asked to authenticate before access to your service is granted. Currently, we support authentication using Google, Facebook, Github, or locally created accounts.

The authentication flow uses OpenID connect to interact with the various Identity providers (IDP); thus, we can easily add more Identity providers in the future. We’re also looking at SAML as a second authentication flow. Please let us know if you have a need for more IDP’s or SAML, and we’ll work with you.

Authorizations rules

In addition to what the Identity Service Provider (IDP) provides, mysocket also provides two authorization rules. The — allowed_email_domain allows users to specify a comma-separated list of email domains. If your users are authenticating using their mycorp.com email address, then by adding this domain as an allowed_email_domain, will make sure only users with that domain have will be granted access.

Since multiple identities providers are supported, it’s easy to extend access to contractors or other 3rd party users. To provide access to external contractors that are not part of your mycorp.com domain, we can use the allowed_email_addresses parameter to add individual identities. This is great because now you can provide these contractors access without creating corporate accounts for them.

These are just two authorization rules; we’re planning to add more types of rules in the near future. Additional authorization rules that come to mind are, Geo-based rules (only allow access from certain countries or regions) or time of day type rules. If you require these types of rules or have suggestions for additional authorization rules, please let us know!

VPN replacement

One of the unique features of mysocket is that the origin server initiates the connection to the Mysocket edge. This means that the origin servers can be on a highly secure network that only allows outbound connections. Meaning the origins can be hosted behind strict firewall rules or even behind NAT, like for example, a Private AWS VPC. With this, your origin server remains private and hidden.

With the addition of authentication and authorization to Mysocket, we can now, on a very granular basis, provide access to your private services. Combining the secure outbound tunnel property and the identity-aware sockets, we can now look at this to provide an alternative to VPNs, while providing much more granular access to private or corporate resources.

Example use case

Imagine a scenario where you work with a contractor that needs access to one specific private application, say an internal wiki, ticket system, or the git server in your corporate network. With the traditional VPN setup, this means we’d need to provide the contractor with a VPN account. Typically this means the contractor is now part of the corporate network, has a corporate user account, and now has access to much more than just the application needed.

Instead, what we really want is to only provide access to the one application and be very granular in who has access. With the addition of identity-aware sockets, this is now possible.

Demo time!

Alright, let’s give this a spin, demo time! In this demo, we’re making a Grafana instance that’s on a private network and behind two layers of NAT available to our employees as well as a contractor.

We’ll start by setting up the socket and tunnel using the “mysocketctl connect” command.

This works great for demo’s; for more permanent setups, it’s recommended to use “mysocketctl socket create” and “mysocketctl tunnel create” so that you have a permanent DNS name for your service.

mysocketctl connect \
   --port 3000 \
   --name "My Identity aware socket" \
   --cloudauth \
   --allowed_email_domains "mycorp.com" \
   --allowed_email_addresses "andree@toonk.io,john@doe.com"

With this, we created a socket on the mysocket.io infrastructure, enabled authentication, and provided a list of authorization rules. The same command also created the secure tunnel to the closest mysocket.io tunnel server, and we’re forwarding port 3000 on localhost to the newly created socket.

Next, we launch a Grafana container. For fun, I’m passing in my AWS cloudwatch credentials, so I can create some dashboards for my AWS resources. I’ve configured grafana for proxy authentication. Meaning it will trust mysocket.io to do the authentication and authorization. Grafana will use the HTTP headers added by mysocket to determine the user information.

[auth.proxy]
enabled = true
header_name = X-Auth-Email
header_property = username
auto_sign_up = true
headers = Email:X-Auth-Email

The complete example grafana.ini config file I used can be found here. Now we’re ready to launch Grafana. I’m doing this from my laptop, using Docker.

docker run -i -v grafana.ini:/etc/grafana/grafana.ini \
 -e “GF_AWS_PROFILES=default” \
 -e “GF_AWS_default_ACCESS_KEY_ID=$ACCESS_KEY_ID” \
 -e “GF_AWS_default_SECRET_ACCESS_KEY=$SECRET_ACCESS_KEY” \
 -e “GF_AWS_default_REGION=us-east-1” \
 -p 3000:3000 grafana/grafana

Grafana is now listening on localhost port 3000. The mysocket connection we created earlier is relaying authenticated and authorized traffic to that local socket. With that, we should now be able to test and see if we have access to Grafana.

Wrapping up

In this article, we introduced identity aware sockets. We saw how Mysocket users can easily enable authentication for their HTTP(S) based sockets and how OpenID connect is used for the authentication flow to Google, Facebook, or Github (for now). We then looked at how authorization rules can be added by either matching the email domain or even a list of email addresses.

With this, it’s now easy to provide access to internal applications, from any device, any time anywhere, without the need for a VPN.

Global load balancing with Kubernetes and Mysocket.io

Andree Toonk — Tue, 22 Dec 2020 06:17:15 GMT

If you’re in the world of cloud infrastructure, then you’ve heard of Kubernetes. Some of you are experts already, while some of us are just learning or getting started. In this blog, we’ll introduce a mysocket controller for Kubernetes and demonstrate how easy it is to use mysocket.io as your cloud-delivered load balancer for your Kubernetes Services. If you’re a Kubernetes user already, then it should just take a minute to get this mysocket controller setup.

See this video for a demo of the Mysocket.io integration with Kubernetes

Pods, Deployments, and Services

Before we continue, let’s review some of the main Kubernetes building blocks we will be using to make this work.

A typical workload running in Kubernetes will look something like the diagram below. It typically starts with a deployment, in which you define how many replicas (pods) you’d like.

Since pods are ephemeral and can scale in and out as needed, the Pod IP addresses will be dynamic. As a result, communicating with the pods in your deployment from other Deployments would require constant service discovery, which may be challenging for certain apps. To solve this, Kubernetes has the concept of a Service. A service acts as a logical network abstraction for all the pods in a workload and is a way to expose an application running on a set of Pods as a network service.

In the diagram below the service is reachable via 10.245.253.152 on port 8000. The Service will make sure traffic is distributed over all the healthy endpoints for this service.

Taking your Service Global

Now that we know how a service makes your workload available within the cluster, it’s time to take your workload global! Kubernetes has a few ways to do this, typically using an ingress service. We’re going to use the architecture as outlined in the diagram below. We’ll use Mysocket.io to create a secure tunnel between our ‘myApp’ Service and the Mysocket cloud. From there on, it will be globally available via its anycasted edge nodes.

To make this work, we’ll deploy a controller pod that runs the mysocketd container. The Container Image can be found on Docker Hub, and the corresponding Dockerfile on our github repo here.

The mysocketd controller does two things:

1) it subscribes to the Kubernetes API and listens for events related to services. Specifically, it will watch for service events that have the annotation mysocket.io/enabled

If a service has the following annotation mysocket.io/enabled: “true” then the mysocketd app will start a thread for that service.

2) In the per service thread, using the Mysocket API, a new mysocket.io “Socket” object is created if needed. This will give the “myApp” service a public DNS name and public IP. Next, it will check with the mysocket.io API to see if a tunnel already exists; if not, then a new one is created.

Finally, the secure tunnel is established, and the “myApp” service is now globally available and benefits from the mysocket.io high-performance infrastructure.

Demo time. How to Deploy mysocketd to your cluster

The easiest way to get started is to download the mysocketd workload yaml file from the Mysocket Kubernetes controller repository and update the following three secrets to the ones for your mysocket account (line 14,15,16).

email:  
password:  
privatekey:

Then simply apply like this and you’re good to go!

kubectl apply -f mysocketd.yaml

Cool, now we have the controller running!

A simple demo work load

Next up, we’d like to make our myApp service available to the Internet by using the mysocket.io service. I’m going to use this deployment as our demo app. The deployment consists of three pods, with a little python web app, printing its hostname.

Next up, we’ll build a service (see definition here) that acts as an in-cluster load balancer for the demo workload, and we’ll request the Service to be enabled for Mysocket.io.

$ kubectl apply -f demo-deploy.yaml

$ kubectl apply -f demo-service.yaml

$ kubectl get all -n demo-app
NAME                            READY   STATUS    RESTARTS   AGE
pod/demo-app-6896cd4b88-5jzgn   1/1     Running   0          2m22s
pod/demo-app-6896cd4b88-78ngc   1/1     Running   0          2m22s
pod/demo-app-6896cd4b88-pzbc7   1/1     Running   0          2m22s
NAME                   TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)    AGE
service/demo-service   ClusterIP   10.245.114.194           8000/TCP   64s
NAME                       READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/demo-app   3/3     3            3           2m22s
NAME                                  DESIRED   CURRENT   READY   AGE
replicaset.apps/demo-app-6896cd4b88   3         3         3       2m22s

Simply set mysocket.io/enabled to “true” and your service will be globally available.

The same can be done for existing services: to do so, just edit your existing Service definition and add the “mysocket,io/enabled” annotation, like this:

kind: Service
metadata:
  annotations:
    mysocket.io/enabled: "true"

After applying this change, the mysocketd controller will detect that the service has requested to be connected to the Mysocket global infrastructure. The controller will create the Socket and Tunnels as needed and start the tunnel.

A quick load balancing test shows that requests to our global mysocket endpoint are now being load-balanced over the three endpoints.

$ for i in {1..60}; do  curl -s \ 
  https://blue-snowflake-1578.edge.mysocket.io/ & ;done | sort | uniq -c | sort -n  
  
  19 Server: demo-app-6896cd4b88-pzbc7  
  20 Server: demo-app-6896cd4b88-5jzgn  
  21 Server: demo-app-6896cd4b88-78ngc  
  60

Just like that, in a few seconds, your workload is globally available. Pretty neat he!? Make sure to watch the video recording of this demo as well.

Wrapping up

In this article, we looked at how we can easily connect Kubernetes Services to Mysocket.io. We saw that it was easy to use. All that is needed is to: 1) start the Mysocketd controller, and: 2) add the mysocket annotation to your Services. By doing so, we are giving the Kubernetes service a publicly reachable IP address, a TLS cert, and it instantly benefits from the mysocket.io Anycast infrastructure. Pretty powerful and easy to use.

I should add that this Mysocket Controller for Kubernetes is for now just a starting point, ie. an MVP integration with Kubernetes. There are ways to improve this, both in terms of user-friendliness and high availability (i.e. more than one pod). The code for this Kubernetes integration is open source, and we’d be happy to accept improvements. Mostly it serves as an example of what’s possible and how we can continue to build on the Mysocket services.

Last but certainly not least, I’d like to thanks Brian for his code to improve the various mysocketctl client libraries. Also, a big thanks to Bas Toonk (yes, my brother) for his help with this MVP. Most of this was his work, thanks buddy!

Finally, I hope this sparked your imagination, and you’ll all give it a whirl, and let me know your experience.

Easy Multi-region load balancing with Mysocket.io

Andree Toonk — Mon, 07 Dec 2020 00:42:46 GMT

Last week AWS had a major outage in its US-EAST1 region, lasting for most of the day, just before the big black Friday sales! Incidents like this are a great reminder of the importance of multi-region or even multi-cloud deployments for your services.

Depending on your “cloud maturity” and your products’ complexity, you may already be there or just getting started. Either way, in today’s blog, we will take a look at how we can use mysocket’s load balancing features to make deployments over multi-region easier.

A global load balancing service

In earlier blogs, we looked mostly at how the mysocket.io tunnel service can help securely connect your resources that may be behind NAT and firewalls to the Internet. In this article, we’ll look at Mysocket’s global load balancing feature.

Three types of load balancers

Mysocket today supports three different types of cloud-native load balancers.

1) Application load balancers, for your HTTP and HTTPS services.
2) Network load balancer, for your TCP services.
3) TLS Load balancer, for your TCP services, where we take care of the encryption.

Load balancer types for mysocket.io

When deploying Mysocket with your services, you now have a new front door. It just happens to be a front door that is anycasted, and as such has many doorbells around the globe made available to you. As a result, you’re always close to your users.

Demo: Creating a multi-region service in two minutes!

Alright, time to get building! In this demo, we’ll continue using the Gif service we built in our last blog. Just like your workloads, I think this is a “critical” service, so we need it to be deployed in multiple regions. The service will need a 100% uptime, which means that even if a region is down, the Gif service will need to be available to our users.

Infrastructure as code

In this example, we’re going to deploy the service on Digital ocean VM’s in two of its regions, New York and Toronto. We’re big fans of infrastructure as code and require our service to be deployed with a simple deploy script. So we’ll use Terraform to spin up the instances and use a cloud-init script to bootstrap the necessary work to install the required software, start the Gif service and connect it to the Mysocket global infrastructure.

Step 1: terraform

We’re using terraform to define what type VM’s we’d like, which regions we want them to be deployed in, and how many per region. For this demo we’re using just two VM’s in two regions. But I’m sure, by just looking at the terraform file, you can see how easy it is to deploy many VM’s per region by just changing the ‘count’ number. It’s also easy to add additional regions. So scaling our Gif service for the future will be easy ;)

Step 2: cloud-init

The VM’s we’re launching are just vanilla Ubuntu machines. So we’ll need to modify them slightly to get our software and our ‘secrets’ on the machine. To do that, we’ll use cloud-init.
Cloud-init allows us to define a set of tasks that need to be executed when the VM is created. In our case, we’re asking it to create a user ‘mysocket’, we’re adding the Giphy API key, and mysocket credentials as secrets to the mysocket home directory. Finally, we’re telling it to download and execute a bootstrap script.

Step 3: bootstrapping and starting our services.

Cloud-init’s last step was to start the bootstrap script that it download from here. Basically, all it does is install the two python packages (mysocketctl and giphy_client) and download and start two more programs. The first program is our Gif application, which will run on port 8000. The second program is a small python script that makes sure the VM registers and connects to the mysocket.io infrastructure.

The code to connect to mysocket is available on github. It’s pretty easy, and I’m sure it’s low friction to get started with, even if you have limited experience. Below are the most important parts:

#These variables were made available as secrets in cloud Init.
username = os.getenv("email")
password = os.getenv("password")
socket_id = os.getenv("socket_id")

#Login to mysocket API and get a token. We can contruct the Auth header using this token.
token = get_token(username, password)
authorization_header = {
   "x-access-token": token["token"],    
   "accept": "application/json",    
   "Content-Type": "application/json",}
   
# register the service by creating a new tunnel for this VM
tunnel = new_tunnel(authorization_header, socket_id)
# setup the tunnel to mysocket and ready to serve traffic!
ssh_tunnel(port, tunnel["local_port"], ssh_server, ssh_user)

Make sure you take a quick peak at the code, it’s pretty easy to get started with and a good “getting started” example for both terraform and cloud-init.

And with that, we now have four VM’s in two regions. All four VM’s registered with the Mysocket service and are now serving our ‘critical’ Gif app. You can see the demo service yourself here: https://fluffy-bunny-5697.edge.mysocket.io/

The example below shows how the mysocket load balancing service distributes traffic evenly over all four of the origin servers in New York and Toronto.

$ for i in {1..20}; do  curl -s \
 https://fluffy-bunny-5697.edge.mysocket.io/ | grep Server; 
done | sort | uniq -c

   5 Server: compute-000-nyc1
   5 Server: compute-000-tor1
   5 Server: compute-001-nyc1
   5 Server: compute-001-tor1

Horizontally scaling the service is easy; simply change the count number in the terraform file, or add / remove regions. Since the VM’s will call the mysocket api on boot, these new VM’s will automatically become part of the load balancing pool. Pretty neat and easy, right?!

Failover scenarios

So what happens when a server or region becomes unavailable? There are a few things that make this as painless as possible for users. First, the origin service can de-register itself; this will allow for the most graceful scenario.

For less graceful scenarios, mysocket load balancers will, after 60 seconds automatically detect that a tunnel has gone down and take the origin out of the load balancing pool. And even during this 60-second degradation before the tunnel is declared down, our load balancers will use a 10 second connect time out when connecting to origin service and automatically fail back to the remaining origins. So, all in all, failures should be hidden as much as possible from your users.

Wrapping up

In this blog, we looked at how Mysocket can be used as a global load balancer, for your multi-region load deployments. In our demo, we looked at two Digital ocean regions, but this could also be over multiple AWS regions, or even Multi-cloud, with one cluster in AWS, one in Digital Ocean, and throw in some Google cloud for good measure.

We saw how Mysocket provides users with a global anycasted ingress point and provides seamless load balancing for your services. Best of all, it only took us 90 seconds to get all of this going! I guess it’s fair to say that Mysocket makes going multi-region and even multi-cloud easier.

Static DNS names for your mysocket.io services (and a new gif service)

Andree Toonk — Tue, 01 Dec 2020 04:14:11 GMT

In my last blog post, I announced the mysocket.io service and demonstrated how to get started quickly. It’s been great to see people signing up and giving it a spin! Your feedback has been great, motivating, and has helped make the service better already.

Most users that gave mysocket a try used the mysocketctl connect (aka “quick connect”) feature. This is the easiest way to get started and instantly creates a global socket, great for quick testing. However, when you’re done and exit the program, the “connect” feature cleans up the socket. It’s easy to create new ones, but each time with a different name. Not surprisingly then, that a few users asked the question about sockets with static names.

Static names for your services

Most residential ISPs give their customers a dynamic IP address (one that changes from time to time) through DHCP. If you have a service you’d like to make available but don’t have a static IP, then this post is for you!

Perhaps you have a server at home, either a fancy setup, rack-mounted and UPS for emergency power, or perhaps just a modest Raspberry Pi. Either way, you’d like to make this available from the Internet, so you can access it when you’re not at your home network, or perhaps you are home, but your nice cooperate VPN client blocks access to your local network. Or maybe you’d like to make it available to your friend.

That’s great, but you have two challenges:

1) the server sits behind NAT, so you can’t simply connect to the server from the Internet.

2) your ISP gives you dynamic IPs, so every few days/weeks, the IP changes, so it’s hard to keep track of the IP. Ideally, you could use a static name, that no matter your dynamic IP, remains stable.

So what you’d need is a static DNS name! Good news, that’s the default on the Socket primitive.

Mysocket primitives

Let’s take a more in-depth look at the two main primitives that make up the mysocket service: Sockets and Tunnels

Socket and Tunnel primitives

A Global Socket

The first primitive we need is a socket object. This object is the public endpoint for your service and can be created like this:

mysocketctl socket create \
	--type http \
	--name "my service at home"

This, among other things, returns a static DNS name that is yours to use. Think of it as your global public endpoint for your load balancer. In this case, mysocket runs the load balancer, and it’s made highly available through our anycast setup.

In the example above, we created an HTTP/HTTPS socket. Other options are TCP and TLS sockets. In those cases, the API will return not only a static DNS name for your service but also your own dedicated static TCP port number. A typical use case for a TCP socket is making your ssh service available.

Tunnels

The second primitive is a tunnel. A tunnel object represents your origin service and the secure connection between the origin and the mysocket global infrastructure. Let’s look at the example below.

mysocketctl tunnel create \
	--socket_id 1fab407c-a49d-4c5e-8287-8b138b7549c0

When creating the tunnel object, simply pass along the socket_id from the socket you’d like to be connected to.

Putting it together

In summary, we can create a globally available service by first creating a socket of type HTTP/HTTPS, TCP or TLS. This returns a static name for your service and optionally a unique port number. After that, we create a tunnel that links the origin to the socket.

Now that we have created a Socket and Tunnel, it’s time to connect to it, spin up the dataplane, and expose the local service port. We do that using the following command:

mysocketctl tunnel connect \
	--port 8000 \
	--socket_id 1fab407c-a49d-4c5e-8287-8b138b7549c0 \
	--tunnel_id 3f46a01f-ef5b-4b0c-a1ce-9a294be2be03

In the example above, we only create one tunnel for the socket, but nothing stops you from creating multiple tunnels (origins) per socket. In that case, mysocket will load balance over all available tunnels.

Demo time:
exposing a local Gif service to the Internet

Alright, time to look at a simple demo and get our hands dirty. For this demo, I developed a small proof of concept python web service that shows a random Gif from Giphy each time a visitor loads the webpage.

I’m running this service locally on a VM on my laptop. The goal is to make this service publicly available, and overcoming the two levels of NAT and my ISP that hands out dynamic IP addresses. At the end of this demo, I’m able to share a static DNS name with my users, and you’ll be able to try it!

Watch the video below for a live demo.

First, let’s download the demo python code and start the Gif web service

wget https://gist.githubusercontent.com/atoonk/0bfc784feb66ffc03541462fbc945df7/raw/812d839f60ea7bfd98a530a3fc549137abe0b329/gif_service.py

#make sure to update the API key in gif_service.py

python3 ./gif_service.py

Alright, now we have this service running, but it’s only reachable from my local network. Next up, we create a socket and a tunnel.

mysocketctl socket create \
	--name "my local http Gif service" \ 
	--type http

Ok, that returns a socket_id and our static DNS name! In my case, the DNS name is: wandering-shape-7752.edge.mysocket.io

Next up, we’ll use the socket_id to create the tunnel object:

mysocketctl tunnel create \
	--socket_id 1fab407c-a49d-4c5e-8287-8b138b7549c0

Cool, now all we need to do is start the tunnel connection.

mysocketctl tunnel connect \
	--port 8000 \
	--socket_id 1fab407c-a49d-4c5e-8287-8b138b7549c0 \
	--tunnel_id 3f46a01f-ef5b-4b0c-a1ce-9a294be2be03

This will securely connect the local gif web service listening on port 8000, to the mysocket infrastructure and make it available as https://wandering-shape-7752.edge.mysocket.io

Wrapping up

In this blog post, we looked at the two main mysocket primitives, Sockets, and Tunnels. We saw how users can create a Socket object and get a static DNS name and possibly even a dedicated TCP port. In the demo, we used the tunnel connect feature to make port 8000 on my local VM available to the Internet. With that, we made the Internet a little bit better by adding yet another Gif service, that just happens to run on a VM hosted on my laptop.

Introducing Mysocket.io

Andree Toonk — Mon, 23 Nov 2020 03:40:48 GMT

In this blog, I’d like to introduce a new project I’m calling Mysocket.io. Before we dive in, a bit of background.

Loyal readers know I enjoy building global infrastructure services that need to be able to carry a significant amount of traffic and a large number of requests. Building services like these often require us to solve several challenges. Things to consider include: high availability, scaling, DDoS proofing, monitoring, logging, testing, deployments, user-facing & backend APIs, policy management (user preferences) and distribution, life-cycling, etc. All this while keeping an eye on cost and keeping complexity to a minimum (which really is human, operational cost).

To experiment with these topics, it’s necessary to have a project to anchor these experiments to. Something I can continuously work on, and while doing so, improve the project as a whole. Now, there are many projects I started over time, but one that I've worked most on recently, and wanted to share with a wider audience, is mysocket.io. A service that provides secure public endpoints for services that are otherwise not publicly reachable.

A typical example case that mysocket.io can help with is a web service running on your laptop, which you’d like to make available to a team member or client. Or ssh access to servers behind NAT or a firewall, like a raspberry pi on your home network or ec2 instances behind NAT.

make your localhost app available from anywhere

Provide SSH access to your home server behind NAT.

More details

Alright, a good way to share more details and is to do a quick demo! You can see a brief overview in this video, or even better, try it yourself by simply following the four easy steps below.

If you’re interested or curious, feel free to give it a spin and let me know what worked or didn’t, or even better, how it can be improved. Getting started will take you just one minute. Just follow these simple steps.

#Install client, using python's package manager (pip)
pip3 install mysocketctl

#Create account
mysocketctl account create \
    --name "your_name" \
    --email "your_email_address" \
    --password "a_secure_password" \
    --sshkey "$(cat ~/.ssh/id_rsa.pub)"
    
#login
mysocketctl login  \
    --email "your_email_address" \
    --password "a_secure_password" 
    
 
#Launch your first global socket ;)
mysocketctl connect \
    --port 8000 \
    --name "my test service"

Architecture overview

Ok, so how does it work? The process for requesting a “global socket” starts with an API call. You can do this by directly interfacing with the RESTful API, or by using the mysocketctl tool. This returns a global mysocket object, which has a name, port number(s), and some other information.

Users can now use this socket object to create tunnel objects. These tunnels are then used to connect your local service to the global mysocket.io infrastructure. By stitching these two TCP sessions together, we made your local service globally available.

Creating a Socket, a Tunnel and connecting to mysocket.io

The diagram below provides a high-level overview of the service data-plane. On the left, we have the origin service. This could be your laptop, your raspberry pi at home, or even a set of containers in a Kubernetes cluster. The origin service can be behind a very strict firewall or even NAT. All it needs is outbound network access. We can then set up a secure encrypted tunnel to any of the mysocket.io servers around the world.

Mysocket.io dataplane

Anycast

The Mysocket.io services use AWS’ global accelerator. With this, I’m making both the tunnel servers and proxy services anycasted. This solves some of the load balancing and high availability challenges. The mysocket tunnel and proxy servers are located in North America, Europe, and Asia.

Once the tunnel is established, the connection event is signaled to all other nodes in real-time, ensuring that all edge nodes know where the tunnel for the service is.

Documentation

One of my goals is to make Mysocket super easy to use. One way to do that is to have good documentation. I invite you to check out our readthedocs.io documentation here https://mysocket.readthedocs.io/

It’s divided into two sections:

General information about mysocket.io and some of the concepts.
Information and user guides for the mysocketctl command-line tool.

The documentation itself and mysocketctl tool are both opensource so feel free to open pull requests or open issues if you have any questions.

You may have noticed there’s a website as well. I wanted to create a quick landing page, so I decided to play with Wix.com. They make it super easy; I may have gone overboard a bit ;) All that was clicked together in just one evening, pretty neat.

More to come

There’s a lot more to tell and plenty more geeky details to dive into. More importantly, we can continue to build on this and make it even better (ping me if you have ideas or suggestions)!
So stay tuned. That’s the plan for subsequent Blog posts soon, either in this blog or the mysocket.io blog.

Cheers,
-Andree

AWS and their Billions in IPv4 addresses

Andree Toonk — Tue, 20 Oct 2020 16:01:11 GMT

Earlier this week, I was doing some work on AWS and wanted to know what IP addresses were being used. Luckily for me, AWS publishes this all here https://ip-ranges.amazonaws.com/ip-ranges.json. When you go through this list, you’ll quickly see that AWS has a massive asset of IPv4 allocations. Just counting quickly I noticed a lot of big prefixes.

Ever wondered what all of the AWS network ranges are? You can find them all here:https://t.co/NBaBF6w0la
That's *a lot* of big prefixes!
4x /11, 14x /12, 30x /13, 78x /14, 184x /15, 278x /16
— Andree Toonk, Adelante! (@atoonk) October 13, 2020

However, the IPv4 ranges on that list are just the ranges that are in use and allocated today by AWS. Time to dig a bit deeper.

IPv4 address acquisitions by AWS

Over the years, AWS has acquired a lot of IPv4 address space. Most of this happens without gaining too much attention, but there were a few notable acquisitions that I’ll quickly summarize below.

2017: MIT selling 8 million IPv4 addresses to AWS

In 2017 MIT sold half of its 18.0.0.0/8 allocation to AWS. This 18.128.0.0/9 range holds about 8 million IPv4 addresses.

2018: GE sells 3.0.0.0/8 to AWS

In 2018 the IPv4 prefix 3.0.0.0/8 was transferred from GE to AWS. With this, AWS became the proud owner of its first /8! That’s sixteen million new IPv4 addresses to feed us hungry AWS customers. https://news.ycombinator.com/item?id=18407173

2019: AWS buys AMPRnet 44.192.0.0/10

In 2019 AWS bought a /10 from AMPR.org, the Amateur Radio Digital Communications (ARDC). The IPv4 range 44.0.0.0/8 was an allocation made to the Amateur Radio organization in 1981 and known as the AMPRNet. This sell caused a fair bit of discussion, check out the nanog discussion here.

Just this month, it became public knowledge AWS paid $108 million for this /10. That’s $25.74 per IP address.

These are just a few examples. Obviously, AWS has way more IP addresses than the three examples I listed here. The IPv4 transfer market is very active. Check out this website to get a sense of all transfers: https://account.arin.net/public/transfer-log

All AWS IPv4 addresses

Armed with the information above it was clear that not all of the AWS owned ranges were in the JSON that AWS published. For example, parts of the 3.0.0.0/8 range are missing. Likely because some of it is reserved for future use.

I did a bit of digging and tried to figure out how many IPv4 addresses AWS really owns. A good start is the Json that AWS publishes. I then combined that with all the ARIN, APNIC, and RIPE entries for Amazon I could find. A few examples include:

https://rdap.arin.net/registry/entity/AMAZON-4
https://rdap.arin.net/registry/entity/AMAZO-4
https://rdap.arin.net/registry/entity/AT-88-Z

Combining all those IPv4 prefixes, removing duplicates and overlaps by aggregating them results in the following list of unique IPv4 address owned by AWS: https://gist.github.com/atoonk/b749305012ae5b86bacba9b01160df9f#all-prefixes

The total number of IPv4 addresses in that list is just over 100 Million (100,750,168). That’s the equivalent of just over six /8’s, not bad!

If we break this down by allocation size, we see the following:

1x /8     => 16,777,216 IPv4 addresses
1x /9     => 8,388,608 IPv4 addresses
4x /10    => 16,777,216 IPv4 addresses
5x /11    => 10,485,760 IPv4 addresses
11x /12   => 11,534,336 IPv4 addresses
13x /13   => 6,815,744 IPv4 addresses
34x /14   => 8,912,896 IPv4 addresses
53x /15   => 6,946,816 IPv4 addresses
182x /16  => 11,927,552 IPv4 addresses

A complete breakdown can be found here: https://gist.github.com/atoonk/b749305012ae5b86bacba9b01160df9f#breakdown-by-ipv4-prefix-size

Putting a valuation on AWS’ IPv4 assets

Alright.. this is just for fun…

Since AWS is (one of) the largest buyers of IPv4 addresses, they have spent a significant amount on stacking up their IPv4 resources. It’s impossible, as an outsider, to know how much AWS paid for each deal. However, we can for fun, try to put a dollar number on AWS’ current IPv4 assets.

The average price for IPv4 addresses has gone up over the years. From ~$10 per IP a few years back to ~$25 per IP nowadays.
Note that these are market prices, so if AWS would suddenly decide to sell its IPv4 addresses and overwhelm the market with supply, prices would drop. But that won’t happen since we’re all still addicted to IPv4 ;)

Anyway, let’s stick with $25 and do the math just for fun.

100,750,168 ipv4 addresses x $25 per IP = $2,518,754,200

Just over $2.5 billion worth of IPv4 addresses, not bad!

Peeking into the future

It’s clear AWS is working hard behind the scenes to make sure we can all continue to build more on AWS. One final question we could look at is: how much buffer does AWS have? ie. how healthy is their IPv4 reserve?

According to their published data, they have allocated roughly 53 Million IPv4 addresses to existing AWS services. We found that all their IPv4 addresses combined equates to approximately 100 Million IPv4 addresses. That means they still have ~47 Million IPv4 addresses, or 47% available for future allocations. That’s pretty healthy! And on top of that, I’m sure they’ll continue to source more IPv4 addresses. The IPv4 market is still hot!

100G networking in AWS, a network performance deep dive

Andree Toonk — Thu, 15 Oct 2020 19:06:42 GMT

Loyal readers of my blog will have noticed a theme, I’m interested in the continued move to virtualized network functions, and the need for faster networking options on cloud compute. In this blog, we’ll look at the network performance on the juggernaut of cloud computing, AWS.

AWS is the leader in the cloud computing world, and many companies now run parts of their services on AWS. The question we’ll try to answer in this article is: how well suited is AWS’ ec2 for high throughput network functions.

I’ve decided to experiment with adding a short demo video to this blog. Below you will find a quick demo and summary of this article. Since these videos are new and a bit of an experiment, let me know if you like it.

100G networking

It’s already been two years since AWS announced the C5n instances, featuring 100 Gbps networking. I’m not aware of any other cloud provider offering 100G instances, so this is pretty unique. Ever since this was released I wondered exactly what, if any, the constraints were. Can I send/receive 100g line rate (144Mpps)? So, before we dig into the details, let’s just check if we can really get to 100Gbs.

this is fun :) 97Gbs pic.twitter.com/6VdkR2Rlr4
— Andree Toonk, Adelante! (@atoonk) May 28, 2020

100gbs testing.. We’re gonna need a bigger boat..

There you have it, I was able to get to 100Gbs between 2 instances! That’s exciting. But there are a few caveats. We’ll dig into all of them in this article, with the aim to understand exactly what’s possible, what the various limits are, and how to get to 100g.

Understand the limits

Network performance on Linux is typically a function of a few parameters. Most notably, the number of TX/RX queues available on the NIC (network card). The number of CPU cores, ideally at least equal to the number of queues. The pps (packets per second) limit per queue. And finally, in virtual environments like AWS and GCP, potential admin limits on the instance.

Doing networking in software means that processing a packet (or a batch of them) uses a number of CPU cycles. It’s typically not relevant how many bytes are in a packet. As a result, the best metric to look at is the: pps number (related to our cpu cycle budget). Unfortunately, the pps performance numbers for AWS aren’t published so, we’ll have to measure them in this blog. With that, we should have a much better understanding of the network possibilities on AWS, and hopefully, this saves someone else a lot of time (this took me several days of measuring) ;)

Network queues per instance type

The table below shows the number of NIC queues by ec2 (c5n) Instance type.

In the world of ec2, 16 vCPUs on the C5n 4xl instance means 1 Socket, 8 Cores per socket, 2 Threads per core.

On AWS, an Elastic Network Adapter (ENA) NIC has as many queues as you have vCPUs. Though it stops at 32 queues, as you can see with the C5n 9l and C5n 18xl instance.

Like many things in computing, to make things faster, things are parallelized. We see this clearly when looking at CPU capacity, we’re adding more cores, and programs are written in such a way that can leverage the many cores in parallel (multi-threaded programs).

Scaling Networking performance on our servers is done largely the same. It’s hard to make things significantly faster, but it is easier to add more ‘workers’, especially if the performance is impacted by our CPU capacity. In the world of NICs, these ‘workers’ are queues. Traffic send and received by a host is load-balanced over the available network queues on the NIC. This load balancing is done by hashing (typically the 5 tuples, protocol, source + destination address, and port number). Something you’re likely familiar with from ECMP.

So queues on a NIC are like lanes on a highway, the more lanes, the more cars can travel the highway. The more queues, the more packets (flows) can be processed.

Test one, ENA queue performance

As discussed, the network performance of an instance is a function of the number of available queues and cpu’s. So let’s start with measuring the maximum performance of a single flow (queue) and then scale up and measure the pps performance.

In this measurement, I used two c5n.18xlarge ec2 instances in the same subnet and the same placement zone. The sender is using DPDK-pktgen (igb_uio). The receiver is a stock ubuntu 20.04 LTS instance, using the ena driver.

The table below shows the TX and RX performance between the two c5n.18xlarge ec2 instances for one and two flows.

With this, it seems the per queue limit is about 1Mpps. Typically the per queue limit is due to the fact that a single queue (soft IRQ) is served by a single CPU core. Meaning, the per queue performance is limited by how many packets per second a single CPU core can process. So again, what you typically see in virtualized environments is that the number of network queues goes up with the number of cores in the VM. In ec2 this is the same, though it’s maxing out at 32 queues.

Test two, RX only pps performance

Now that we determined that the per queue limit appears to be roughly one million packets per second, it’s natural to presume that this number scales up horizontally with the number of cores and queues. For example, the C5n 18xl comes with 32 nic queues and 72 cores, so in theory, we could naively presume that the (RX/TX) performance (given enough flows) should be 32Mpps. Let’s go ahead and validate that.

The graph below shows the Transmit (TX) performance as measured on a c5n.18xlarge. In each measurement, I gave the packet generator one more queue and vcpu to work with. Starting with one TX queue and one VCPu, incrementing this by one in each measurement until we reached 32 vCPU and 32 queues (max). The results show that the per TX queue performance varied between 1Mpps to 700Kpps. The maximum total TX performance I was able to get however, was ~8.5Mpps using 12 TX queues. After that, adding more queues and vCPu’s didn’t matter, or actually degraded the performance. So this indicates that the performance scales horizontally (per queue), but does max out at a certain point (varies per instance type), in this case at 8.5 Mpps

c5n.18xlarge per TX queue performance

In this next measurement, we’ll use two packet generators and one receiver. I’m using two generators, just to make sure the limit we observed earlier isn’t caused by limitations on the packet generator. Each traffic generator is sending many thousands of flows, making sure we leverage all the available queues.

RX pps per C5N instance type

Alright, after a few minutes of reading (and many, many hours, well really days.. of measurements on my end), we now have a pretty decent idea of the performance numbers. We know how many queues each of the various c5n instance types have.

We have seen that the per queue limit is roughly 1Mpps. And with the table above, we now see how many packets per second each instance is able to receive (RX).

Forwarding performance

If we want to use ec2 for virtual network functions, then just receiving traffic isn’t enough. A typical router or firewall should both receive and send traffic at the same time. So let’s take a look at that.

For this measurement, I used the following setup. Both the traffic generator and receiver were C5n-18xl instances. The Device Under Test (DUT) was a standard Ubuntu 20.04 LTS instance using the ena driver. Since the earlier observed pps numbers weren’t too high, I determined it’s safe to use the regular Linux kernel to forward packets.

test setup

pps forwarding performance

The key takeaway from this measurement is that the TX and RX numbers are similar as we’d seen before for the instance types up to (including) the C5n 4xl. For example, earlier we saw the C5n 4xl could receive up to ~3Mpps. This measurement shows that it can do ~3Mpps simultaneously on RX and TX.

However, if we look at the C5n 9l, we can see it was able to process RX+ TX about 6.2Mpps. Interestingly, earlier we saw it was also able to receive (rx only) ~6Mpps. So it looks like we hit some kind of aggregate limit. We observed a similar limit for the C5n 18xl instance.

In Summary.

In this blog, we looked at the various performance characteristics of networking on ec2. We determined that the performance of a single queue is roughly 1Mpps. We then saw how the number of queues goes up with the higher end instances up until 32 queues maximum.

We then measure the RX performance of the various instances as well as the forwarding (RX + TX aggregate) performance. Depending on the measurement setup (RX, or TX+RX) we see that for the largest instance types, the pps performance maxes out at roughly 6.6Mpps to 8.3Mpps. With that, I think that the C5n 9l hits the sweet spot in terms of cost vs performance.

So how about that 100G test?

Ah yes! So far, we talked about pps only. How does that translate that to gigabits per second?
Let’s look at the quick table below that shows how the pps number translates to Gbs at various packet sizes.

These are a few examples to get to 10G at various packet sizes. This shows that in order to support line-rate 10G at the smallest packet size, the system will need to be able to do ~14.88 Mpps. The 366 byte packet size is roughly the equivalent average of what you’ll see with an IMIX test, for which the systems needs to be able to process ~3,4Mpps to get to 10G line rate.

If we look at the same table but then for 100gbps, we see that at the smallest packet size, an instance would need to be able to process is over 148Mpps. But using 9k jumbo frames, you only need 1.39Mpps.

And so, that’s what you need to do to get to 100G networking in ec2. Use Jumbo frames (supported in ec2, in fact, for the larger instances, this was the default). With that and a few parallel flows you’ll be able to get to 100G “easily”!

A few more limits

One more limitation I read about while researching, but didn’t look into myself. It appears that some of the instances have time-based limits on the performance. This blog calls it Guaranteed vs. Best Effort. Basically, you’re allowed to burst for a while, but after a certain amount of time, you’ll get throttled. Finally, there is a per-flow limit of 10Gbs. So if you’re doing things like IPSEC, GRE, VXLAN, etc, note that you will never go any faster than 10g.

Closing thoughts

Throughout this blog, I mentioned the word ‘limits’ quite a bit, which has a bit of a negative connotation. However, it’s important to keep in mind that AWS is a multi-tenant environment, and it’s their job to make sure the user experience is still as much as possible as if the instance is dedicated to you. So you can also think of them as ‘guarantees’. AWS will not call them that, but in my experience, the throughput tests have been pretty reproducible with, say a +/- 10% measurement margin.

All in all, it’s pretty cool to be able to do 100G on AWS. As long as you are aware of the various limitations, which unfortunately aren’t well documented. Hopefully, this article helps some of you with that in the future.
Finally, could you use AWS to run your virtual firewalls, proxies, VPN gateways, etc? Sure, as long as you’re aware of the performance constraints. And with that design a horizontally scalable design, according to AWS best practices. The one thing you really do need to keep an eye on is the (egress) bandwidth pricing, which, when you started doing many gigabits per second, can add up.

Cheers
- Andree

Building a global anycast service in under a minute

Andree Toonk — Sun, 21 Jun 2020 05:15:00 GMT

This weekend I decided to take another look at Stackpath and their workload edge compute features. This is a relatively new feature, in fact, I wrote about it in Feb 2109 when it was just released. I remember being quite enthusiastic about the potential but also observed some things that were lacking back then. Now, one and a half years later, it seems most of those have been resolved, so let’s take a look!

I’ve decided to experiment with adding a small demo video to these blogs.
Below you will find a quick 5min demo of the whole setup. Since these videos are new and a bit of an experiment, let me know if you like it.

Demo: Building a global anycast service in under a minute

Workloads

Stackpath support two types of workloads (in addition to serverless), VM and container-based deployments. Both can be orchestrated using API’s and Terraform. Terraform is an “Infrastructure as code” tool. You simply specify your intent with Terraform, apply it, and you’re good to go. I’m a big fan of Terraform, so we’ll use that for our test.

One of the cool things about Stackpath is that they have built-in support for Anycast, for both their VM and Container service. I’m going to use that feature and the Container service to build this highly available, low latency web service. It’s super easy, see for your self on my github here.

Docker setup

Since I’m going to use the container service, we need to create a Docker container to work with. This is my Dockerfile

FROM python:3
WORKDIR /usr/src/app
COPY ./mywebserver.py .
EXPOSE 8000
ENV PYTHONUNBUFFERED 1
CMD [ “python”, “./mywebserver.py” ]

The mywebserver.py program is a simple web service that prints the hostname environment variable. This will help us determine which node is servicing our request when we start our testing.

After I built the container, I uploaded it to my Dockerhub repo, so that Stackpath can pull it from there.

Terraform

Now it’s time to define our infrastructure using terraform. The relevant code can be found on my github here. I’ll highlight a few parts:

On line 17 we start with defining a new workload, and I’m requesting an Anycast IP for this workload. This means that Stackpath will load balance (ECMP) between all nodes in my workload (which I’m defining later).

resource “stackpath_compute_workload” “my-anycast-workload” {   
    name = “my-anycast-workload”
    slug = “my-anycast-workload”   
    annotations = {       
        # request an anycast IP       
        “anycast.platform.stackpath.net” = “true”   
    }

On line 31, we define the type of workload, in this case, a container. As part of that we’re opening the correct ports, in my case port 8000 for the python service.

container {   
    # Name that should be given to the container   
    name = “app”   
    port {      
        name = “web”      
        port = 8000      
        protocol = “TCP”      
        enable_implicit_network_policy = true   
    }

Next up we define the container we’d like to deploy (from Dockerhub)

# image to use for the container
image = “atoonk/pythonweb:latest”

In the resources section we define the container specifications. In my case I’m going with a small spec, of one CPU core and 2G of ram.

resources {
   requests = {
      “cpu” = “1”
      “memory” = “2Gi”
   }
}

We now get to the section where we define how many containers we’d like per datacenter and in what datacenters we’d like this service to run.

In the example below, we’re deploying three containers in each datacenter, with the possibility to grow to four as part of auto-scaling. We’re deploying this in both Seattle and Dallas.

target {
    name         = "global"
    min_replicas = 3
    max_replicas = 4
    scale_settings {
      metrics {
        metric = "cpu"
        # Scale up when CPU averages 50%.
        average_utilization = 50
      }
    }
    # Deploy these instances to Dallas and Seattle
    deployment_scope = "cityCode"
    selector {
      key      = "cityCode"
      operator = "in"
      values   = [
        "DFW", "SEA"
      ]
    }
  }

Time to bring up the service.

Now that we’ve defined our intent with terrraform, it’s time to bring this up. The proper way to do this is:

terraform init
terraform plan
terraform apply

After that, you’ll see the containers come up, and our anycasted python service will become available. Since the containers come up rather quickly, you should have all six containers in the two datacenters up and running in under a minute.

Testing the load balancing.

I’ve deployed the service in both Seattle and Dallas, and since I am based in Vancouver Canada, I expect to hit the Seattle datacenter as that is the closest datacenter for me.

$ for i in `seq 1 10`; do curl 185.85.196.41:8000 ; done

my-anycast-workload-global-sea-2
my-anycast-workload-global-sea-0
my-anycast-workload-global-sea-2
my-anycast-workload-global-sea-0
my-anycast-workload-global-sea-1
my-anycast-workload-global-sea-1
my-anycast-workload-global-sea-2
my-anycast-workload-global-sea-1
my-anycast-workload-global-sea-2
my-anycast-workload-global-sea-0

The results above show that I am indeed hitting the Seattle datacenter, and that my requests are being load balanced over the three instances in Seattle, all as expected.

In the portal, I can see the per container logs as well

In Summary

Compared to my test last year with Stackpath, there has been a nice amount of progress. It’s great to now be able to do all of this with just a Terraform file. It’s kind of exciting you can bring up a fully anycast service in under a minute with only one command! By changing the replicate number in the Terraform file we can also easily grow and shrink our deployment if needed.
In this article we looked at the container service only, but the same is possible with Virtual machines, my github repo has an example for that as well.

Finally, don’t forget to check the demo recording and let me know if you’d like to see more video content.