Frank Denneman

MEMORY DEEP DIVE: MEMORY SUBSYSTEM ORGANISATION

This is part 2 of the memory deep dive. This is a series of articles that I wrote to share what I learned while documenting memory internals for large memory server configurations. This topic amongst others will be covered in the upcoming FVP book. The memory deep dive series: Part 1: Memory Deep Dive Intro Part 2: Memory subsystem Organisation Part 3: Memory Subsystem Bandwidth Part 4: Optimizing for Performance Part 5: DDR4 Memory Part 6: NUMA Architecture and Data Locality Part 7: Memory Deep Dive Summary Today’s CPU micro-architectures contain integrated memory controllers. The memory controller connects through a channel to the DIMMs. DIMM stands for Dual Inline Memory Module and contains the memory modules (DRAM chips) that provide 4 or 8 bits of data. Dual Inline refers to pins on both side of the module. Chips on the DIMM are arranged in groups called ranks that can be accessed simultaneously by the memory controller. Within a single memory cycle 64 bits of data will be accessed. These 64 bits may come from the 8 or 16 DRAM chips depending on how the DIMM is organized. An Overview of Server DIMM types There are different types of DIMMs, registered and unregistered. Unregistered DIMM (UDIMM) type is targeted towards the consumer market and systems that don’t require supporting very large amounts of memory. An UDIMM allows the memory controller address each memory chip individually and in parallel. Each memory chip places a certain amount of capacitance on the memory channel and weakens the signal. As a result, a limited number of memory chips can be used while maintaining stable and consistent performance. Servers running virtualized enterprise applications require a high concentration of memory. However with these high concentrations, the connection between the memory controller and the DRAM chips can overload, causing errors and delays in the flow of data. CPU speeds increase and therefor memory speeds have to increase as well. Consequently higher speeds of the memory bus leads to data flooding the channel faster, resulting in more errors occurring. To increase scale and robustness, a register is placed between the DRAM chips and the memory controller. This register, sometimes referred to as a buffer, isolates the control lines between the memory controller and each DRAM chip. This reduced the electrical load, allowing the memory controller to address more DRAM chips while maintaining stability. Registered DIMMs are referred to as RDIMMs. Load Reduced DIMMs (LRDIMMs) were introduced in the third generation of DDR memory (DDR3) and buffers both the control and data lines from the DRAM chips. This decreases the electrical load on the memory controller allowing for denser memory configurations. The increased memory capacity leads to increased power consumption, however by implementing the buffer structure differently it provides substantially higher operating data rates than RDIMMs in the same configuration. The key to increased capacity and performance of LRDIMMs is the abstraction of DRAM chips and especially the rank count by the buffer. RDIMMs register only buffers the command and address while leaving the more important data bus unbuffered. This leaves the group of DRAM chips (ranks) exposed to the memory controller. A memory controller accesses the grouped DRAM chips simultaneously. A Quad rank DIMM configuration presents four separate electrical loads on the data bus per DIMM. The memory controller can handle up to a certain amount of load and therefor there is a limitation on the number of exposed ranks. LRDIMMs scale to higher speeds by using rank multiplication, where multiple ranks appear to the memory controller as a logical rank of a larger size. DIMM Ranking DIMMs come in three rank configurations; single-rank, dual-rank or quad-rank configuration, ranks are denoted as (xR). Together the DRAM chips grouped into a rank contain 64-bit of data. If a DIMM contains DRAM chips on just one side of the printed circuit board (PCB), containing a single 64-bit chunk of data, it is referred to as a single-rank (1R) module. A dual rank (2R) module contains at least two 64-bit chunks of data, one chunk on each side of the PCB. Quad ranked DIMMs (4R) contains four 64-bit chunks, two chunks on each side. To increase capacity, combine the ranks with the largest DRAM chips. A quad-ranked DIMM with 4Gb chips equals 32GB DIMM (4Gb x 8bits x 4 ranks). As server boards have a finite amount of DIMM slots, quad-ranked DIMMs are the most effective way to achieve the highest memory capacity. As mentioned before there are some limitations when it comes to the amount of ranks used in a system. Memory controllers use channels to communicate with DIMM slots and each channel supports a limited amount of ranks due to maximal capacitance. Memory Channel Modern CPU microarchitectures support triple or quadruple memory channels. These multiple independent channels increases data transfer rates due to concurrent access of multiple DIMMs. When operating in triple-channel or in quad-channel mode, latency is reduced due to interleaving. The memory controller distributes the data amongst the DIMM in an alternating pattern, allowing the memory controller to access each DIMM for smaller bits of data instead of accessing a single DIMM for the entire chunk of data. This provides the memory controller more bandwidth for accessing the same amount of data across channels instead of traversing a single channel when it stores all data in one DIMM. If the CPU supports triple-channel mode, it is enabled when three identical memory modules are installed in the separate channel DIMM slots. If two of the three-channel slots are populated with identical DIMMs, then the CPU activates dual-channel mode. Quad-channel mode is activated when four identical DIMMs are put in quad-channel slots. When three matched DIMMs are used in Quad-channel CPU architectures, triple-channel is activated, when two identical DIMMs are used, the system will operate in dual-channel mode. LRDIMM rank aware controllers With the introduction of LRDIMMs, memory controllers have been enhanced to improve the utilization of the LRDIMMs memory capacity. Rank multiplication is of of these enhancements and improved latency and bandwidth tremendously. Generally memory controllers of systems prior to 2012 were “rank unaware” when operating in rank multiplication mode. Due to the onboard register on the DIMM it was unaware whether the rank was on the same DIMM it had to account for time to switch between DRAMS on the same bus. This resulted in lower back-to-back read transactions performance, sometimes up to 25% performance penalty. Many tests have been done between RDIMMs and LRDIMMs operating at the same speed. In systems with rank unaware memory controllers you can see a performance loss of 30% when comparing LRDIMMs and RDIMMS. Systems after 2012 are referred to generation 2 DDR3 platforms and contain controllers that are aware of the physical ranks behind the data buffer. Allowing the memory controller to adjust the timings and providing better back-to-back reads and writes. Gen 2 DDR3 systems reduce the latency gap between RDIMMs and LRDIMMs but most importantly it reduces the bandwidth gap. Please be aware of this difference when reading memory reviews posted on the net by independent hardware review sites. Verify the date of the publication to understand if they tested a configuration that was rank aware or rank unaware systems. DDR4 LRDIMMs improves lantencies even further due to use of distributed data buffers. DDR4 memory is covered in the third article in this series. Pairing DIMMs per Memory Channel Depending on the DIMM slot configuration of the server board, multiple DIMMs can be used per channel. If one DIMM is used per channel, this configuration is commonly referred to as 1 DIMM Per Channel (1 DPC). 2 DIMMs per channel (2 DPC) and if 3 DIMMs are used per channel, this configuration is referred to as 3 DPC. [caption id=“attachment_4949” align=“aligncenter” width=“511”] Figure 1: DPC configurations and channels[/caption] The diagram illustrates different DPC configurations; please note that balanced DIMM population (same number and type of DIMMs in each channel) is generally recommended for the best overall memory performance. The configuration displayed above is non-functional do not try to repeat. However there are some limitations to channels and ranking. To achieve more memory density, higher capacity DIMMs are required. As you move up in the size of gigabytes of memory, you are forced to move up in the ranks of memory. For example single rank and dual rank RDIMMs have a maximum capacity per DIMM of 16GB. DDR3 32GB RDIMMs are available in quad rank (QR). Recently 64GB DIMS are made available, but only in LRDIMM format. Memory rank impacts the number of DIMMS supported per channel. Modern CPUs can support up to 8 physical ranks per channel. This means that if a large amount of capacity is required quad ranked RDIMMs or LRDIMMs should be used. When using quad ranked RDIMMs, only 2 DPC configurations are possible as 3 DPC equals 12 ranks, which exceeds the 8 ranks per memory rank limit of currents systems. [caption id=“attachment_4952” align=“aligncenter” width=“556”] Maximum RDIMM configuration (256 GB per CPU)[/caption] When comparing 32GB LRDIMMs and 32GB Quad Rank RDIMMs it becomes apparent that LRDIMMS allow for higher capacity while retaining the bandwidth. For example, a Gen 12 Dell R720 contains two Intel Xeon E5 2600 CPU, allowing up to 1.5TB of RAM. The system contains 24 DIMM slots and allows up to 64GB DDR3 DIMMs up to 1866 Mhz. Dells memory configuration samples only contain configurations up to 1600 MHz. Table 1: Total capacity configuration of RDIMMs and LRDIMMs

Wed, Feb 18, 2015 memory

NEW TPS MANAGEMENT CAPABILITIES

Recently VMware decided that it’s best to change Transparent Page Sharing (TPS) behavior. In KB 2080735 they state the following: Although VMware believes the risk of TPS being used to gather sensitive information is low, we strive to ensure that products ship with default settings that are as secure as possible. For this reason new TPS management options are being introduced and inter-Virtual Machine TPS will no longer be enabled by default in ESXi 5.5, 5.1, 5.0 Updates and the next major ESXi release. Administrators may revert to the previous behavior if they so wish.

Mon, Feb 2, 2015 vmware

KB 2104983 EXPLAINED: DEFAULT BEHAVIOR OF DRS HAS BEEN CHANGED TO MAKE THE FEATURE LESS AGGRESSIVE

Yesterday a couple of tweets were in my timeline discussing DRS behavior mentioned in KB article 2104983. The article is terse at best, therefor I thought lets discuss this a little bit more in-depth. During normal behavior DRS uses an upper limit of 100% utilization in its load-balancing algorithm. It will never migrate a virtual machine to a host if that migration results in a host utilization of 100% or more. However this behavior can prolong the time to upgrade all the hosts in the cluster when using the cluster maintenance mode feature in vCenter update manager (parallel remediation). To reduce the overall remediation time, vSphere 5.5 contains an increased limit for cluster maintenance mode and uses a default setting of 150%. This can impact the performance of the virtual machine during the cluster upgrade. vCenter Server 5.5 Update 2d includes a fix that allows users to override the default and can specify the range between 40% and 200%. If no change is made to the setting, the default of 150% is used during cluster maintenance mode. Please note that normal load balancing behavior in vSphere 5.5 still uses a 100% upper limit for utilization calculation.

Wed, Jan 28, 2015 drs

MY WISH FOR 2015 - BETTER TOOLING TO PROVIDE BETTER INSIGHT

I’ve seen this image pop up quite a bit in my twitter timeline this week and it’s a very recognizable situation. Most of us have been in such conversation; I know I was when I was an Enterprise architect. And most tweets are wishing this situation changes in 2015, and I totally agree with it. However in my opinion it’s not the “app owner” who gives the wrong answer, it’s a wrong question to begin with. When I order a bread at the bakery, the baker ask what kind of bread I want, not how he needs to operate and fine tune his machinery in order to give me the product I want. Why do we think our industry, our service offering is different? In reality, can you expect from app owners to truly understand the I/O characteristics of his application? Maybe they read all the documentation of the vendor, maybe they followed a couple of courses on how to configure and operate the application, or maybe they might even got a few certifications under their belt. But in reality there are no classes and courses in the dynamics of the workload you are running. The application stack is merely a framework in order to delivery a service to the business you are servicing. The dynamics of workload is very complex, especially in a virtualized datacenter. Typically enterprise applications do not generate a consistent workload pattern. These patterns are different when servicing users or when interacting with infrastructure services. During their life cycle, applications are updated, code/query improves impacting application behaviour. Pete Koehler wrote down his experience in his article “Using a new tool to discover old problems”. Besides generating a variety of different workload patterns, applications are subject to change during its lifecycle. Change in interaction and demand, impacting the underlying infrastructure differently throughout time. Typically an application experiences a lot of interaction during test/dev/acceptance process before going into production. After the introduction period, demand is low but increasing. During the maturity of the application, demand will peak. At one point application will be replaced and is phased out. During this phase workload demand will taper off, but the service still demands a particular level of service. During all these phases, the infrastructure needs to provide the service the organization demands. And this is just an isolated case of one particular application. Typically the virtual datacenter infrastructure is shared. A virtual machine containing an application lands on a storage array, typically storing multiple virtual machines on that datastore. The datastore is backed by a LUN, backed by multiple physical devices. Access to the devices is done via shared controllers and the list continues all the way up the stack to hypervisor. Maybe the application owner understands what type of I/O the application is generally producing, but the underlying stack will impact the performance. Can you ensure the application gets the performance it requires? Do you know if the infrastructure is capable of delivering the service the application requires? And what about the impact of the application on the infrastructure. How will introducing this application impact the current active applications? Will it impact their service levels? Therefor I believe that two things need to change, behavior and tooling. IT needs to switch from asking technical questions to asking functional questions. It’s better to understand the role and place in the business process. Typically this provides insight on the availability, concurrency and response requirements of the application. The second thing that needs to change, and this is what I hope 2015 will bring, is better tooling that provides insight on workload characteristics. Tooling that provides better analysis of application demand and it’s impact on infrastructure. At this stage, most tooling is ineffective in proving proper information. Virtualized Datacenters need tooling that provides a better view into the theatre of consumers and producers. Tooling that provides a more holistic view of the application workload characteristics while being able to monitor the resource usage. Having such tools allows IT departments to operationalize and manage their environments much better, ensuring proper service levels while being able to understand the capability of the environment. Looking at the current developments in the IT industry, it is incredible difficult to predict what type of workloads (and especially in what form/platform) will hit the enterprise IT landscape in the next two years. Understanding what your environment truly delivers is a necessity when discussing future workloads. I think this is a necessary step for datacenter advancements. Once you know what’s going on, once you got proper tooling to provide better insights you can feed this data into advanced algorithms to distribute the load across the infrastructure to provide the performance it requires while optimizing resource utilization. All of this providing the correct priority aligning it with business needs. This goes beyond todays offering such as DRS, Storage DRS, SIOC in vSphere datacenters and Mesos in container landscapes.

Tue, Dec 30, 2014 miscellaneous

INTERESTING IT RELATED DOCUMENTARIES

The holidays are upon us and for most its time to wind down. Maybe time for some nice though-provoking documentaries before the food-coma sets in. ;) Most of the documentaries listed here are created by Tegenlicht (Backlight). Backlight provides some of the best documentaries on Dutch television and luckily they made most of them available in English. The following list is a set of documentaries that impressed me. If you found some awesome documentaries yourself, please leave a comment. Tech revolution on Wall Street Backlight created a trilogy on the tech revolution on Wall Street over a period of three years. The most famous one is the second one, “Money and Speed, inside the black box”. It received multiple awards and although it’s the second documentary in the series of three, I recommend starting with that one. If you are intrigued about how the impact of these algorithms, continue with the other two episodes, “Quants. The alchemists of Wall Street” and “Wall Street Code”. They almost make you feel like you are watching a thriller, highly recommended! 08.02.2010: Quants, The alchemists of Wall Street. English | Dutch 20.03.2013: Money & Speed: Inside the Black Box. English | Dutch 04.11.2013: Wall Street Code. English | Dutch Extra video 08.02.2010: Quants, George Dyson. English 01.07.2011: Kevin Slavin: How algorithms shape our world. English (Ted Talk) Unfortunately these two documentaries are Dutch only. 21.10.2013: Big Data, the Shell Search. Dutch Tegenlicht onderzoekt hoe je met behulp van big data kunt doordringen in gesloten bolwerken. Wat geven deze enorme informatiestromen prijs over een multinational als Shell? 1.10.2014: Zero Days veiligheidslekken te koop Dutch Tegenlicht neemt je mee in de handel van ‘zero days’, onbekende lekken in software of op het internet. Een strijd tussen ‘white hat’ en ‘black hat’ hackers bepaalt onze online veiligheid. Although the voice over is Dutch, most of this documentary is in English, you might want to give it a try. It focuses on legal trade of unknown security vulnerabilities, so called zero days. Yes your government is also acquiring these from hackers, all perfectly legal! Enjoy! And of course I wish you all happy holidays!

Wed, Dec 24, 2014 miscellaneous

PLAYING TONIGHT: DRS AND THE IO CONTROLLERS

Ever wondered why the band is always mentioned second, is the band replaceable? Is the sound of the instruments so ambiguous that you can swap out any musician with another? Apparently the front man is the headliner of the show and if he does he job well he will never be forgotten. The people who truly recognize talent are the ones that care about the musicians. They understand that the artist backing the singer create the true sound of the song. And I think this is also the case when it comes to DRS and his supporting act the Storage controllers. Namely SIOC and NETIOC. If you do it right, the combination creates the music in your virtual datacenter, well at least from a resource management perspective. ;)

Mon, Dec 1, 2014 drs ha

VCDX- YOU CANNOT ABSTRACT YOUR WAY OUT OF THINGS INDEFINITELY

The amount of abstraction in IT is amazing. Every level in the software and hardware stack attempts to abstract operations and details. And the industry is craving for more. Look at the impact “All Things Software Defined” has on todays datacenter. It touches almost every aspect, from design to operations. The user provides the bare minimum of inputs and the underlying structure automagically tunes itself to a working solution. Brilliant! However sometimes I get the feeling that this level of abstraction becomes an excuse to not understand the underlying technology. As an architect you need to do your due diligence. You need to understand the wheels and cogs that are turning when dialing a specific knob at the abstracted layer. But sometimes it seems that the abstraction level becomes the right to refuse to answer questions. This was always an interesting discussion during a VCDX defense session. When candidates argued that they weren’t aware of the details because other groups were responsible for that design. I tend to disagree What level of abstraction is sufficient? I am in the lucky position to work with PernixData R&D engineers and before that VMware R&D engineers. They tend to go deep, right down to the core of things. Discussing every little step of a process. Is this the necessary level of understanding the applied technology and solutions for an architect? I don’t think so. It’s interesting to know, but on a day-to-day basis you don’t have to understand the function of ceiling when DRS calculates priority levels of recommendations. What is interesting is to understand what happens if you place a virtual machine at the same hierarchical level as a resource pool filled with virtual machines. What is the impact on the service levels of these various entities? Something in the middle might be the NFS series of Josh Odgers. Josh goes in-depth about the technology involved using NFS datastores. Virtual SCSI Hard Drives are presented to virtual machines, even when ESXi is connected to an NFS datastore. How does this impact the integrity of I/O’s? How does the SCSI protocol emulation process affect write ordering and of I/O’s of business critical applications. You as the virtual datacenter architect should be able to discuss the impact of using this technology with application owners. You should understand the potential impact a selected technology has on the various levels throughout the stack and what impact it has on the service it provides. Recently I published a series on databases and what impact their workload characteristics have on storage architecture design. Understanding the position of a solution in the business process allows an architect to design a suitable solution. Lets use the OLTP example. Typically OLTP databases are at the front of the process, customer-facing process, dramatically put they are in the line of fire. When the OLTP database is performing slow or is unavailable it will typically impact revenue-generating processes. This means that latency is a priority but also concurrency and availability. You can then tailor your design to provide the best services to this application. This is just a simplified example, but it shows that you have to understand multiple aspects of the technology. Not just the behavior of a single component. The idea is to get a holistic view and then design your environment to cater the needs of the business, cause that’s why we get hired. Circling back to the abstraction and the power of software defined, I though the post from Bart Heungens was interesting. Bart argues that Software Defined Storage is not the panacea for all storage related challenges. Which is true. Bart illustrates an architecture that is comprised of heterogeneous components. In his example, he illustrates what happens when you combine two servers HP DL380, but from different generations. Different generations primarily noticeable from a storage controller perspective and especially the way software behave. This is interesting on so many levels, and it would be a very interesting discussion if this were a VCDX defense session. SDS abstracts many things, but it still relies on the underlying structure to provide the services. From a VCDX defense perspective, Bart has a constraint. And that is the already available hardware and the requirement to use these different generation hardware in his design. VCDX is not about providing the ideal design, but showing how you deal with constrains, requirements and demonstrating your expertise on technology how it impacts the requested solution. He didn’t solve the problem entirely, but by digging in deeper he managed to squeeze out performance to provide a better architecture to service the customers applications. He states the following:

Tue, Nov 11, 2014 vcdx

MS WORD STYLE FORMATTING SHORTCUT KEYS FOR MAC

Recently I started to spend a lot of time in MS word again, and as a stickler for details I dislike a mishmash of font types throughout my document. I spend a lot of time on configuring the styles of the document, yet when I paste something from other documents, MS word tend to ignore these. Correcting the format burns a lot of time and it simply annoys the crap out of me. To avoid this further, I started to dig around to find some font and style related shortcut keys. Yesterday I tweeted the shortcut key to apply the normal style and by the looks of retweets many of you are facing the same challenge. Below is a short list of shortcut keys that I use. There are many more, share the common ones you like to use. As I use Mac I listed the Mac shortcut combination. Replace CTRL for CMD if you are using MS Word on a windows machine. Select text: Select all: CTRL+A Select sentence: CMD + click Select word: Double click Select paragraph: Triple click Formatting: Clear formatting: CTRL+spacebar Apply Normal Style: Shift+CMD+N Header 1: CMD+ALT+1 Header 2: CMD+ALT+2 Header 3: CMD+ALT+3 Change Case: CMD+Option+C (repeat combination to cycle through options) Indent paragraph: CTRL+Shift+M Remove indent: CMD+Shift+M Find and replace: F5

Mon, Oct 27, 2014 miscellaneous

99 CENTS PROMO TO CELEBRATE A MAJOR MILESTONE OF THE VSPHERE CLUSTERING DEEPDIVE SERIES

This week Duncan was looking at the sales numbers of the vSphere Clustering Deep Dive series and he noticed that we hit a major milestone in September. In September 2014 we passed the 45000 copies distributed of the vSphere Clustering Deep Dive. Duncan and I never ever expected this or even dared to dream to hit this milestone. When we first started writing the 4.1 book we had discussions around what to expect from a sales point of view and we placed a bet, I was happy if we sold 100 books, Duncan was more ambitious with 400 books. Needless to say we reset our expectations many times since then… We didn’t really follow it closely in the last 12-18 months, and as today we were discussing a potentially update of the book we figured it was time to look at the numbers again just to get an idea. 45000 copies distributed (ebook + printed) is just remarkable. We’ve noticed that the ebook is still very popular, and decided to do a promo. As of Monday the 13th of October the 5.1 e-book will be available for only $ 0.99 for 72 hours, then after 72 hours the price will go up to $ 3.99 and then after 72 hours it will be back to the normal price. So make sure to get it while it is low priced! Pick it up here on Amazon.com! The only other kindle store we could open the promotion up for was amazon.co.uk, so that is also an option!

Thu, Oct 9, 2014 vmware

DATABASE WORKLOAD CHARACTERISTICS AND THEIR IMPACT ON STORAGE ARCHITECTURE DESIGN – PART 3 - ANCILLARY STRUCTURES FOR TUNING DATABASES

Welcome to part 3 of the Database workload characteristics series. Databases are considered to be one of the biggest I/O consumers in the virtual infrastructure. Database operations and database design are a study upon themselves, but I thought it might be interested to take a small peak underneath the surface of database design land. I turned to our resident Database expert Bala Narasimhan, PernixData’s director of products to provide some insights about the database designs and their I/O preferences. Previous instalments of the series:

Mon, Sep 29, 2014 miscellaneous

…