Virtualization
OP-TEE have experimental virtualization support. This is when one OP-TEE instance can run TAs from multiple virtual machines. OP-TEE isolates all VM-related states, so one VM can’t affect another in any way.
With virtualization support enabled, OP-TEE will rely on a hypervisor, because only the hypervisor knows which VM is calling OP-TEE. Also, naturally the hypervisor should inform OP-TEE about creation and destruction of VMs. Besides, in almost all cases, hypervisor enables two-stage MMU translation, so VMs does not see real physical address of memory, instead they work with intermediate physical addresses (IPAs). On other hand OP-TEE can’t translate IPA to PA, so this is a hypervisor’s responsibility to do this kind of translation. So, hypervisor should include a component that knows about OP-TEE protocol internals and can do this translation. We call this component “TEE mediator” and right now only XEN hypervisor have OP-TEE mediator.
Configuration
Virtualization support is enabled with CFG_VIRTUALIZATION
configuration
option. When this option is enabled, OP-TEE will not work without compatible
a hypervisor. This is because the hypervisor should send
OPTEE_SMC_VM_CREATED
SMC with VM ID before any standard SMC can be received
from client.
CFG_VIRT_GUEST_COUNT
controls the maximum number of supported VMs. As OP-TEE
have limited size of available memory, increasing this count will decrease
amount of memory available to one VM. Because we want VMs to be independent,
OP-TEE splits available memory in equal portions to every VM, so one VM can’t
consume all memory and cause DoS to other VMs.
Requirements for hypervisor
As said earlier, hypervisor should be aware of OP-TEE and SMCs from virtual guests to OP-TEE. This is a list of things, that compatible hypervisor should perform:
When new OP-TEE-capable VM is created, hypervisor should inform OP-TEE about it with SMC
OPTEE_SMC_VM_CREATED
.a1
parameter should contain VM id. ID 0 is defined asHYP_CLNT_ID
and is reserved for hypervisor itself.When OP-TEE-capable VM is being destroyed, hypervisor should stop all VCPUs (this will ensure that OP-TEE have no active threads for that VMs) and send SMC
OPTEE_SMC_VM_DESTROYED
with the same parameters as forOPTEE_SMC_VM_CREATED
.Any SMC to OP-TEE should have VM ID in
a7
parameter. This is eitherHYP_CLNT_ID
if call originates from hypervisor or VM ID that was passed inOPTEE_SMC_VM_CREATED
call.Hypervisor should perform IPA<->PA address translation for all SMCs. This includes both arguments in
a1
-a6
registers and in in-memory command buffers.Hypervisor should pin memory pages that VM shares with OP-TEE. This means, that hypervisor should ensure that pinned page will reside at the original PA as long, as it is shared with OP-TEE. Also it should still belong to the VM that shared it. For example, the hypervisor should not swap out this page, transfer ownership to another VM, unmap it from VM address space and so on.
Naturally, the hypervisor should correctly handle the OP-TEE protocol, so for any VM it should look like it is working with OP-TEE directly.
Limitations
Virtualization support is in experimental state and it have some limitations, user should be aware of.
Platforms support
Only Armv8 architecture is supported. There is no hard restriction, but currently Armv7-specific code (like MMU or thread manipulation) just know nothing about virtualization. Only one platform has been tested right now and that is QEMU-V8 (aka qemu that emulates Arm Versatile Express with Armv8 architecture). Support for Rcar Gen3 should be added soon.
Static VMs guest count and memory allocation
Currently, a user should configure maximum number of guests. OP-TEE will split
memory into equal chunks, so every VM will have the same amount of memory. For
example, if you have 6MB for your TAs, you can set CFG_VIRT_GUEST_COUNT
to 3
and every VM would be able to use 2MB maximum, even if there is no other VMs
running. This is okay for embedded setups when you know exact number and roles
of VMs, but can be inconvenient for server applications. Also, it is impossible
to configure amount of memory available for a given VM. Every VM instance will
have exactly the same amount of memory.
No compatibility with “normal” mode
OP-TEE built with CFG_VIRTUALIZATION=y
will not work without a hypervisor,
because before executing any standard SMC, OPTEE_SMC_VM_CREATED
must be
called. This can be inconvenient if one wants to switch between virtualized and
non-virtualized environment frequently. On other hand, it is not a big deal in a
production environment. Simple workaround can be made for this: if OP-TEE
receives standard SMC prior to OPTEE_SMC_VM_CREATED
, it implicitly creates
VM context and uses it for all subsequent calls.
Implementation details
OP-TEE as a whole can be split into two entities. Let us call them “nexus” and TEE. Nexus is a core part of OP-TEE that takes care of low level things: SMC handling, memory management, threads creation and so on. TEE is a part that does the actual job: handles requests, loads TAs, executes them, and so on. So, it is natural to have one nexus instance and multiple instances of TEE, one TEE instance per registered VM. This can be done either explicitly or implicitly.
Explicit way is to move TEE state in some sort of structure and make all code to
access fields of this structure. Something like struct task_struct
and
current
in linux kernel. Then it is easy to allocate such structure for
every VM instance. But this approach basically requires to rewrite all OP-TEE
code.
Implicit way is to have banked memory sections for TEE/VM instances. So memory layout can look something like that:
+-------------------------------------------------+
| Nexus: .nex_bss, .nex_data, ... |
+-------------------------------------------------+
| TEE states |
| |
| VM1 TEE state | VM 2 TEE state | VM 3 TEE state |
| .bss, .data | .bss, .data | .bss, .data, |
+-------------------------------------------------+
This approach requires no changes in TEE code and requires some changes into
nexus code. So, idea that Nexus state resides in separate sections
(.nex_data
, .nex_bss
, .nex_nozi
, .nex_heap
and others) and is
always mapped.
TEE state resides in standard sections (like .data
, .bss
, .heap
and
so on). There is a separate set of this sections for every VM registered and
Nexus maps them only when it receives call from corresponding VM.
As Nexus and TEE have separate heaps, bget
allocator was extended to work
with multiple “contexts”. malloc()
, free()
with friends work with one
context. nex_malloc()
(and other nex_
functions) were added. They use
different context, so now Nexus can use separate heap, which is always mapped
into OP-TEE address space. When virtualization support is disabled, all those
nex_
functions are defined to point to standard malloc()
counterparts.
To change memory mappings in run-time, in MMU code we have added a new entity,
named “partition”, which is defined by struct mmu_partition
. It holds
information about all page-tables, so the whole MMU mapping can be switched by
one write to TTBR
register.
There is the default partition, it holds MMU state when there is no VM context
active, so no TEE state is mapped. When OP-TEE receives OPTEE_SMC_VM_CREATED
call, it copies default partition into new one and then maps sections with TEE
data. This is done by prepare_memory_map()
function in virtualization.c
.
When OP-TEE receives STD call it checks that the supplied VM ID is valid and then activates corresponding MMU partition, so TEE code can access its own data. This is basically how virtualization support is working.