VSPHERE 7 CORES PER SOCKET AND VIRTUAL NUMA
Regularly I meet with customers to discuss NUMA technology, and one of the topics that are always on the list is the Cores per Socket setting and its potential impact. In vSphere 6.5, we made some significant adjustments to the scheduler that allowed us to decouple the NUMA client creation from the Cores per Socket setting. Before 6.5, if the vAdmin configured the VM with a non-default Cores per Socket setting, the NUMA scheduler automatically aligned the NUMA client configured to that Cores per Socket settings. Regardless of whether this configuration was optimal for performing in its physical surroundings.
DRS THRESHOLD 1 DOES NOT INITIATE LOAD BALANCING VMOTIONS
vSphere 7.0 introduces DRS 2.0 and its new load balancing algorithm. In essence, the new DRS is completely focused on taking care of the needs of the VMs and does this at a more aggressive pace than the old DRS. As a result, DRS will resort to vMotioning a virtual machine faster than the previous DRS. And this is something that a lot of customers are noticing. In highly consolidated clusters, you might see a lot of vMotions occur. I perceive this as an infrastructural service. However, some customers might see this as a turbulent or nervous environment and rather see fewer vMotions. As a result, these customers like to dial down the DRS threshold, which is the right thing to do. But please be aware, if you still want DRS to have load-balancing functionality, do not slide the threshold all the way to the left.
PROJECT MONTEREY AND THE NEED FOR NETWORK CYCLES OFFLOAD FOR ML WORKLOADS.
VMworld has started, and that means a lot of new announcements. One of the most significant projects VMware is working on is project Monterey. Project Monterey allows the use of SmartNICS, also known as Data Processing Units, of various VMware partners within the vSphere platform. Today we use the CPU inside the ESXi host to run workloads and to process network operations. With the shift towards distributed applications, the CPUs inside the ESXi hosts need to spend more time processing network IO instead of application operations. This extra utilization impacts data center economics like consolidation ratios and availability calculations. On top of this shift from monolith application to distributed application is the advent of machine learning supported services in the enterprise data center.
MACHINE LEARNING INFRASTRUCTURE FROM A VSPHERE INFRASTRUCTURE PERSPECTIVE
For the last 18 months, I’ve been focusing on machine learning, especially how customers can successfully deploy machine learning infrastructure on a vSphere infrastructure. This space is exciting as it has so many great angles to explore. Besides the model training, a lot of stuff happens with the data. Data is transformed, data is moved. Data sets are often hundreds of gigabytes in size. Although that doesn’t sound that much compared to modern databases, these data sets are transformed and versioned. Where massive databases nest on an array, these data sets travel through pipelines that connect multiple systems with different architectures, from data lakes to in-memory key-value stores. As a data center architect, you need to think about the various components involved, where the compute horsepower is needed? How do you deal with an explosion of data? Where do you place particular storage platforms, what kind of bandwidth is needed, and do you always need the extreme low-latency systems in your ML infrastructure landscape?
CPU PINNING IS NOT AN EXCLUSIVE RIGHT TO A CPU CORE!
https://twitter.com/MrsBrookfield/status/1402955235497287685 Katarina tweeted a very expressive tweet about her love/hate (mostly hate) relation with CPU pinning, and lately I have been in conversations with customers contemplating whether they should use CPU pinning. The analogy that I typically use to describe CPU pinning is the story of the favorite parking space at your office parking lot. CPU pinning limits the compliant CPU “slots” for that vCPU to be scheduled on. So think about that CPU slot as the parking spot closest to the entrance of your office. You have decided that you only want to park in that spot. Every day of the year, that’s your spot and no other place else. The problem is, this is not a company-wide directive. Anyone can park in that spot, but you just limited yourself to that spot only. So it can happen that Bob arrives at the office first and lazy as he is, he will park to the office entrance as close as he can. Right in your spot. Now the problem with your self-imposed rule is that you cannot and will not park anywhere else. So when you show up (late to the party), you notice that Bob’s car is in YOUR parking spot, and the only thing you can do is to drive circles in some holding pattern until Bob leaves the office again. The stupidest thing. It’s Sunday, and you and Bob are the only ones doing some work. You’re out there on the parking lot, driving circles waiting until Bob leaves again, while Bob is inside in the empty building waiting on you to get started.
VM SERVICE - HELP DEVELOPERS BY ALIGNING THEIR KUBERNETES NODES TO THE PHYSICAL INFRASTRUCTURE
The vSphere 7.0 U2a update released on April 27th introduces the new VM service and VM operator. Hidden away in what seems to be a trivial update is a collection of all-new functionalities. Myles Gray has written an extensive article about the new features. I want to highlight the administrative controls of VM classes of the VM service. VM Classes What are VM classes, and how are they used? With Tanzu Kubernetes Grid Service running in the Supervisor cluster, developers can deploy Kubernetes clusters without the help of the Infra Ops team. Using their native tooling, they specify the size of the cluster control plane and worker nodes by using a specific VM class. The VM class configuration acts a template used to define CPU, memory resources and possibly reservations for these resources. These templates allow the InfraOps team to set guardrails for the consumption of cluster resources by these TKG clusters.
VSPHERE WITH TANZU VCENTER SERVER NETWORK CONFIGURATION OVERVIEW
I noticed quite a few network-related queries about the install of vSphere with Tanzu together with vCenter Server Networks (Distributed Switch and HA-Proxy). The whole install process can be a little overwhelming when it’s your first time dealing with Kubernetes and load-balancing services. Before installing Workload Management (the user interface designation for running Kubernetes inside vSphere), you have to setup HA-Proxy, a virtual appliance for provisioning load-balancers. Cormac did a tremendous job describing the configuration steps of all components. Still, when installing vSphere with Tanzu, I found myself mapping out the network topology to make sense of it all since there seems to be a mismatch of terminology between HA-proxy and Workload Management configuration workflow.
WHAT'S YOUR FAVORITE TECH NOVEL?
Today I was discussing with Duncan some great books to read. Disconnecting fully from work is difficult for me, so typically, the books I read are tech-related. I have read some brilliant books that I want to share with you, but mostly I want to hear your recommendations for other excellent tech novels. Cyberwarfare Cyberwarfare intrigues me, so any book covering Operation Olympic Games -or- Stuxnet interests me. One of the best books on this topic “Countdown to Zero day” by Kim Zetter. The book is filled with footnotes and references to corresponding research.
NEW WHITEPAPER AVAILABLE ON VSPHERE 7 DRS LOAD BALANCING
vSphere 7 contains the new DRS algorithm that differs tremendously from the old one. The performance team has put the new algorithm through the test and have published a whitepaper presenting their findings. Read the white paper: Load Balancing Performance of DRS in vSphere 7.0.
VSPHERE 7 DRS SCALABLE SHARES DEEP DIVE
You are one tickbox away from completely overhauling the way you look at resource pools. Yes you can still use them as folders (sigh), but with the newly introduced Scalable Shares option in vSphere 7 you can turn resource pools into more or less Quality of Service classes. Sounds interesting right? Let’s first take a look at the traditional working of a resource pool, the challenges they introduced, and how this new delivery of resource distribution works. To understand that we have to take a look at the basics of how DRS distributes unreserved resources first.