Its an age old debate: routing versus switching.
It seems the debate has been answered, at least within Enterprise networking. Switching has won. But has Enterprise networking the better for it? I don't think so.
Enterprises are on a big virtualization kick. VMWare, most commonly, brings great benefits to the enterprise infrastructure. Greater utilization of hardware represents real cost savings, this can't be argued. This greater utilization can, in a well engineered virtualization solution, increase efficiency in utilization of CPU, memory, network bandwidth, and storage (bandwidth and capacity). Unfortunately, this comes at a cost.
VMWare, for enterprise use, includes the notion of VMotion, (other virtualization solutions have similar solutions, I'll be referring to VMWare almost exclusively, but the issues are the same with the others) where a virtual machine or guest can be moved, live, between different VMWare servers or hosts. This change can be made with the guests fully up and running, and they will only experience a very brief hiccup in operation. This is a great boon to enterprises as it allows the guests to be evacuated from a host server so that server can be shutdown and have physical maintenance work done on it.
So what is the cost to enterprise networking to enable this capability. Well, if you think about it, how do you arrange it so that a guest can move from one physical server to another and continue running without a hiccup? The environment that the guest sees when it reaches its destination has to be the same as the source location, or at the very least, that state must be able to be quickly recovered. For networking, with current practices, this means that the networking connectivity at the destination have to be the same as at the source. That, then, means that you have to have the same layer 2 ethernet network at the destination as the source.
This is the predominant driver for building large layer 2 ethernet networks within the enterprise. I argue that, in being the dominating design concept, enterprises are designing what is otherwise very poor network designs to accommodate VMWare.
First, though, some history of ethernet. Ethernet was originally designed as a CSMA/CD (Carrier Sense Multiple Access/Carrier Detect) technology. Even state-of-the-art ethernet based network technologies of today behave, in some ways, as if they are still on a CSMA/CD physical network. The most important behavior of the original CSMA/CD behavior is that every frame sent on an ethernet network, was seen by every network card that was connected to that ethernet network. Network Interface Cards (NICs) have built in circuitry to only pass along frames that have a destination MAC address that the NIC is servicing, so that the operating system running on the server with that NIC, only sees packets destined for that system. Ethernet switching was designed to take advantage of the fact that the switch could learn where specific MAC addresses are connected to the network. By only forwarding frames to the ports that connect to the specific NIC that has the destination address, all of the rest of those NICs don't have to filter out those extra packets, and, more importantly, can be sending or receiving data at the same time, thereby increasing the overall amount of throughput that the network can sustain. This is the foundational capability for the modern enterprise network.
So, modern ethernet network technologies have to be backward compatible with this CSMA/CD in some way. If you think about it, while it would be crazy to do, you could still have a port from your current, modern ethernet switches, that has an old-school 10 or 100 megabit ethernet hub (not switch!), or even, beyond that, a 10-base2 or 10-base5 physical network wiring which physically embodies CSMA/CD.
What does this mean for modern and next-generation ethernet technologies? One important behavior from CSMA/CD still exists. If a frame is sent, and the switches haven't yet learned where the destination MAC address connects, then the technologies have to forward those frames in a way that those frames arrive at every device in the ethernet network. Traditionally, Spanning-Tree Protocol (STP) of some variety was used to create a loop free topology that reached every device on the network (a tree, that spans to every device, thus the name), and frames with unknown destination MAC addresses were forwarded along this tree. Modern technologies that are being developed will be more intelligent about how this tree (or multiple trees) is built, but fundamentally, they still are forwarding frames along a tree structure, with all of the replication of the frames that is implied by that.
Why the history lesson? Because it has important implications for the efficiency and reliability of layer 2 ethernet networks. Whatever protocol or technology that is used to build the tree that frames are forwarded on in a layer 2 ethernet technology requires all of the switches to come to a coherent, consistent view of the topology that they are going to use. Why? Because to not do so risks creating loops in the network, and for ethernet, loops are catastrophic. Ethernet frames don't have a time-to-live header, so a loop in an ethernet network will result in ethernet frames circling the network indefinitely - until the loop is broken. The consumed bandwidth is only the first order effect. Beyond that, as more frames get inserted to the loop, eventually one or more of the links reaches its limit of capacity, the switches start buffering, latency increases, and eventually frames start getting dropped. Most likely, before you get to that point, however, the switches get brought to their knees by various frames that get caught up in the loop that the switch CPU has to do some amount of processing on. That brings down, not just connectivity that passes through the links involved in the loop, but all connectivity that passes through the switches that are involved in the loop, even where the links involved aren't. This all means that if anything goes wrong in your loop prevention protocol, if the switches involved, for any reason, come to a different idea of what the topology of the network should be, your whole layer 2 ethernet topology is at risk of crashing down.
But what about layer 3 routing? Routing is susceptible to all of this as well, right? Well, sort of. First, IP routing doesn't fall back to forwarding along a tree. If the router doesn't know the location of the destination, the packet just gets dropped. Even in the presence of default routes, the packets only get forwarded along a single path. Loops can be created between routers as well, but they aren't catastrophic at layer 3. Sub-optimal, absolutely, but not catastrophic. IP packets do have a time-to-live header that is decremented at each hop, so packets do eventually get flushed out of the network and don't circle forever, meaning you also don't generally see the follow-on effects of bandwidth and CPU exhaustion. Also, loops are much less likely in routing as IP routing doesn't generally depend on a coherent view of the whole topology in order for packets to find their way to their destination in the same way that ethernet switching does.
So the takeaway from all of this is that, ethernet switching can never be as robust and reliable as IP routing. Yes, this is comparing apples and oranges to some degree, but given the age-old debate between switching and routing and how they influence network design, its a comparison that needs to be made.
So, what about VMWare
Again, while I'm picking on VMWare, here, it really is just a stand-in for the bigger picture problem of, "How can I move a server and not have to re-IP it, and deal with all of the fall-out which that entails."
The biggest problem with IP networking (particularly on ethernet) is that there are three classifications of IP addresses that the stack has to deal with; localhost, on-link, and remote. Packets addressed to the localhost (not necessarily just the IP address 127.0.0.1) are easy, just process them in software. But when a server sends an IP packet onto a network, it looks at source and destination IP addresses, does a quick bitwise AND of each against its netmask, and if the resulting values are the same, decides that the IP address is on-link, at which point it arp's for the IP address (unless its in the arp cache already), and sends the frame directly to the destination system via its MAC address. On the other hand, if the results of those bitwise ANDs end up being different, then the destination address is not on-link, and the packet gets sent via a router by sending the packet out on the network with the router's MAC address to be forwarded on. Yes, this is an oversimplification, and packet forwarding on systems can be more complex than that, but it is the general case for broadcast type networks.
OK, so this post is too long already, what's the solution?
Simple, eliminate the need to send data to anything "on-link". You did notice the title of the overall blog, right? This definitely isn't mainstream thinking, but I think there's a good case to be made for this. With today's technologies (read: with today's VMWare limitations) the solution is to switch from running IP on an ethernet interface, to running IP on a PPP interface, most likely using PPPoE. This can be implemented using technology that already exists in virtually every network device vendor's stable, although not, perhaps, with common enterprise switches and routers. Also note that this isn't specific to VMWare guests as it can be run exactly the same way with physical hardware servers. There are two caveats, here. First, VMWare will expect its VMotion control traffic to be on the local network. Best practices typically has that traffic using a different NIC than the main networking for the guests, which means that it can easily be tunneled, so VMotion sees a local network, even across a WAN, without requiring the same for the main networking connectivity for the guests. Second, if a guest moves to a different network, the PPPoE link will drop and have to be re-established. This could happen very quickly, however, on par with the amount of loss associated with VMotioning of a guest in general. Once the PPPoE session is re-established, all of the previous network connectivity is restored, even to the point of in-progress TCP connections continuing unharmed.
Those are the downsides, small though they may be, but what about other benefits to this concept? This is where this concept really shines as far as I'm concerned.
First, complete system mobility. Because the PPPoE session allows the individual IP address of the system to be signaled to the network, and the network then can adjust its routing to match, any system can move to any point on the network, bring up its PPPoE session and gets its IP address, whether that is somewhere else in the current blade chassis, across the data center, across the WAN, and even, conceivably, to a DR site.
Second, real, system-specific network auto-provisioning. With the use of 802.1x, and DHCP, some semblance of network auto-provisioning can be accomplished in typical enterprise network designs, but there are always limitations. Limitations of IP allocations based on the on-link vs remote behavior described above, limitations of static configuration of IP information, or dependence on DHCP reservations keyed on MAC addresses that might change based on hardware replacements and the like. PPPoE allows for full, dynamic, reliable network auto-provisioning for systems. Setup the PPP session to authenticate with a username specific to that system, and do all of your network endpoint provisioning in a backend system such as a RADIUS server. You can even provision the endpoint to be handle multiple IP addresses (Framed-IP-Route in the RADIUS response) with this setup.
Third, robust support toolsets. While not commonly in use within enterprises today, the toolsets to support this sort of network design are well established and very robust. Internet Service Providers have been using these tools for decades to provide Internet access to their customers, even as far back as when dial-up networking was the predominant technology for getting on the Internet. The auditing and accounting technologies are already mature and reliable. In short, the full ecosystem of technologies to support this sort of network design already exist and are in extensive use...just not broadly in the enterprise.
Doesn't this sound a lot like Service Provider networking? Yes, yes it does, very much. And why not? Essentially, enterprise networking departments function as an ISP to the rest of the company, so why shouldn't they take advantage of some of the same tools to make the network work better? ISPs have figured out how to scale networks to a degree that enterprises can only seem to dream about. Enterprises need to jump on board and use those tools to accomplish the very same goals.
So what should be improved going forward?
There are a number of possibilities, VMWare (and other virtualization technologies) could implement the PPP connectivity from the guest to a vrouter within the VMWare core. This would eliminate the need for the PPPoE shim in the middle, reducing overhead slightly. The interface could appear on the guest as a traditional serial port, but one that's capable of running much faster than the traditional 115,200 baud. Such a VMWare vrouter would have to participate in the network routing protocols, like OSPF, but that's certainly achievable. Conceivably, this support could even eliminate the need to re-establish the PPP link after a server migration by migrating the state of the network connectivity along with the guest during VMotion.
PPPoE Access Concentrator capability would be beneficial in a lot more networking gear than currectly supports it. I'm thinking particularly along the lines of typical layer 3 capable 1RU switches (Juniper EX4200, Procurve 3500yl, Cisco 3570, Brocade/Founding FESX, etc.), but other enterprise focused equipment as well.
While there are other improvements that I'm sure could be made, those are the two that go a long way to mitigating the few downsides of this solution as it exists.
In summary, I feel that the current rush to design huge layer 2 networks for enterprise networking is analogous to a bunch of lemmings heading off a cliff. We know how to scale a network up to amazing degrees, its called "routing", and its the technology that has let us build a network that has scaled to worldwide levels in the Internet, including system mobility, multihoming, and redundancy. Let's learn from those who have come before us on how to do things well.