VMware TPS in shared VM environments

As many of you may know VMware changed the default Transpararent Page Sharing (TPS from now on) setting in the latest versions/updates of ESXi. Specifically the behaviour for Inter-VM TPS. But what about Intra-VM TPS?

What is TPS?

TPS is a technique which, when enabled, lets the ESXi host reclaim used memory pages by searching for duplicate small pages (4k) and elimate them. Which results in a potentially higher VM density. It's an asynchronous (so not in-line/realtime) proces running in the VMkernel and deduplicates memory within each NUMA node. There is also a second process which only kicks in when the physical host is under memory pressure due to overcommitment or fragmentation, this second process works by breaking large pages (2MB) up into small pages (4K) to enable page sharing. If the work of TPS is insufficient or disabled, ballooning kicks in followed by compression and eventually swapping to disk.

Inter-VM TPS disabled

So, as mentioned before, primarily based on research VMware disabled Inter-VM TPS as of the latest updates of ESXi, and is disabled in ESXi 6.0 altogether. Check this VMware KB to see in which versions the default and additional TPS management features changed for your environment.

Why did VMware changed this?
Well VMware says

The Risk:

Published academic papers have demonstrated that by forcing a flush and reload of cache memory, it is possible to measure memory timings to try and determine an AES encryption key in use on another virtual machine running on the same physical processor of the host server if Transparent Page Sharing is enabled between the two virtual machines. This technique works only in a highly controlled system configured in a non-standard way that VMware believes would not be recreated in a production environment.

Mitigation:

Even though VMware believes information being disclosed in real world conditions is unrealistic, out of an abundance of caution upcoming ESXi Update releases will no longer enable TPS between Virtual Machines by default (TPS will still be utilized within individual VMs).

Changed from default or not, Inter-VM TPS could still be very useful in memory overcommited scenario's. Especially in on-premise, single organization server and VDI environments, how useful depends...

Howto enable Inter-VM TPS?
By setting the host value Mem.ShareForceSalting to "0". Or for VM groups (like per cloud customer) by setting the host value Mem.ShareForceSalting to "1" and the Per-VM setting Sched.Mem.Pshare.Salt to the same salt for each customer.

The table below probably shows best howto achieve the required behavior.



source

What about Intra-VM TPS?

So while Inter-VM TPS is disabled by default, Intra-VM TPS is still enabled. Intra-VM means from within the VM. So a running VM can share pages allocated within the same NUMA node with itself. Still keeping some of the benefits TPS gives you when running memory overcommitment, just not ESXi host wide (actually NUMA node wide) anymore.

When looking at cloud/shared hosting environments, where VM's are shared between customers, like for example in:

  • Shared Docker/container VMs
  • Shared Apache/nginx/webserver VMs
  • Shared Database server VMs

Should you disable Intra-VM TPS in these scenario's? VMware states the risks are low. But though it maybe harder, perhaps even impossible, to exploit TPS in an Intra-VM scenario. Shouldn't you disable TPS? Just to be safe, so a different customer or hacker (a hacker could rent a Web Virtual Host or Docker container) isn't able to possibly exploit TPS?

I guess it depends on the environment and type of customer. And if you run on-premise there probably are few companies who have the security guidelines in place which demand disabling TPS. But if you're a cloud container operator or webhoster, it's my opinion you should at least inform the customer of these risks and/or give them the option to run without TPS enabled (without the memory sharing benefits) or just disable TPS altogether, cause it could also be difficult to explain the risks to a (potential) client.

Howto disable Intra-VM tps?
Well, looking at the table above showing the configuration options for TPS, there doesn't appear to be an option to disable Intra-VM TPS, which means it's always on! So that's tricky, cause there may not be an official way to do this. Is there? NO? Maybe there should be? :) But it is possible to disable TPS altogether, though the steps are a bit tedious!



source

So, IMHO, when to (not) use TPS?

For most workloads and organizations I think the table below shows when to use TPS, call it a best practice. But maybe some organizations do not want to take any risk at all, like banks, federal governement and healthcare. They can just disable TPS altogether (and buy sufficient amounts of RAM).

Environment Inter-VM TPS Intra-VM TPS
On-Premise Server/VDI YES YES
On-Premise Container/Web YES YES
(Public) Cloud/Shared Server/VDI NO
or only for VM's from a single customer
(VMs with same salt)
YES
(Public) Cloud/Shared Container/Web NO
or only if entire VM is used for the same customer
and only for VM's from a single customer
(VMs with same salt)
NO
or only if entire VM is used for the same customer

More info on TPS and the changes VMware made:

  1. http://blogs.vmware.com/apps/2014/10/disabling-tps-vsphere-impact-critical-applications.html
  2. https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2080735
  3. http://frankdenneman.nl/2015/02/02/new-tps-management-capabilities/
  4. http://www.yellow-bricks.com/2014/10/27/tps-disabled-default/
  5. https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1021095
  6. http://blogs.vmware.com/virtualreality/2011/02/hypervisor-memory-management-done-right.html
  7. https://www.vmware.com/files/pdf/mem_mgmt_perf_vsphere5.pdf pages 7-9