Discussion:
Standardizing an MSR or other hypercall to get an RNG seed?
Andy Lutomirski
2014-09-18 02:50:42 UTC
Permalink
Hi all-

I would like to standardize on a very simple protocol by which a guest
OS can obtain an RNG seed early in boot.

The main design requirements are:

- The interface should be very easy to use. Linux, at least, will
want to use it extremely early in boot as part of kernel ASLR. This
means that PCI and ACPI will not work.

- It should be synchronous. We don't want to delay boot while
waiting for a slow host RNG. (On Linux, at least, we have a separate
interface for that: virtio-rng. I think that Windows has some support
for virtio-rng as well.)

- Random numbers obtained through this interface should be
best-effort. We want the best quality randomness that the host can
provide immediately.

It seems to me that the best interface for the actual request for a
random number is rdmsr. This is supported on all hypervisors and all
virtualization technologies. It can return a 64 bit random number,
and it is easy to rdmsr the same register more than once to get a
larger random number.

The main questions are what MSR index to use and how to detect the
presence of the MSR. I've played with two approaches:

1. Use CPUID to detect the presence of this feature. This is very
easy for KVM to implement by using a KVM-specific CPUID feature. The
problem is that this will necessarily be KVM-specific, as the guest
must first probe for KVM and then probe for the KVM feature. I doubt
that Hyper-V, for example, wants to claim to be KVM. If we could
standardize a non-hypervisor-specific CPUID feature, then this problem
would go away.

2. Detect the existence of the MSR by trying to read it and handling
the #GP(0) that will occur if the MSR is not present. Linux, at
least, is okay with doing this, and I have code to enable an IDT and
an rdmsr fixup early enough in boot to use it for ASLR. I don't know
whether other operating systems can do this, though.

The major questions, then, are what enumeration mechanism should be
used and what MSR index should be used.

For the MSR index, we could use an MSR from the Intel range if Intel
were to give explicit approval, thus guaranteeing that nothing would
conflict. Or we could try to agree on an MSR index in the
0x40000000-0x4fffffff range that is unlikely to conflict with
anything.

For enumeration, we could just probe the MSR if all relevant guests
are okay with this or we could standardize on a CPUID-based mechanism.
If we do the latter, I don't know what that mechanism would be.

NB: This thread will be cc'd to Microsoft and possibly Hyper-V people
shortly. I very much appreciate Jun Nakajima's help with this!

Thanks,
Andy
--
Andy Lutomirski
AMA Capital Management, LLC
H. Peter Anvin
2014-09-18 14:43:09 UTC
Permalink
Post by Andy Lutomirski
The main questions are what MSR index to use and how to detect the
1. Use CPUID to detect the presence of this feature. This is very easy for
KVM to implement by using a KVM-specific CPUID feature. The problem is
that this will necessarily be KVM-specific, as the guest must first probe for
KVM and then probe for the KVM feature. I doubt that Hyper-V, for
example, wants to claim to be KVM. If we could standardize a non-
hypervisor-specific CPUID feature, then this problem would go away.
We would prefer a CPUID feature bit to detect this feature.
I guess if we're introducing the concept of pan-OS MSRs we could also
have pan-OS CPUID. The real issue is to get a single non-conflicting
standard.

-hpa
Andy Lutomirski
2014-09-18 15:38:02 UTC
Permalink
Post by H. Peter Anvin
Post by Andy Lutomirski
The main questions are what MSR index to use and how to detect the
1. Use CPUID to detect the presence of this feature. This is very easy for
KVM to implement by using a KVM-specific CPUID feature. The problem is
that this will necessarily be KVM-specific, as the guest must first probe for
KVM and then probe for the KVM feature. I doubt that Hyper-V, for
example, wants to claim to be KVM. If we could standardize a non-
hypervisor-specific CPUID feature, then this problem would go away.
We would prefer a CPUID feature bit to detect this feature.
I guess if we're introducing the concept of pan-OS MSRs we could also
have pan-OS CPUID. The real issue is to get a single non-conflicting
standard.
Agreed.

KVM currently puts 0 in 0x40000000.EAX, meaning that a feature bit in
Microsoft's leaf 0x40000003 would probably not work well for KVM. I
don't expect that Microsoft wants to start claiming to be KVM for the
purpose of using a KVM-style feature bit, so, if we went the CPUID
route, we would probably need something new.

--Andy
Post by H. Peter Anvin
-hpa
--
Andy Lutomirski
AMA Capital Management, LLC
Andy Lutomirski
2014-09-18 15:44:29 UTC
Permalink
Post by Andy Lutomirski
Post by H. Peter Anvin
Post by Andy Lutomirski
The main questions are what MSR index to use and how to detect the
1. Use CPUID to detect the presence of this feature. This is very easy for
KVM to implement by using a KVM-specific CPUID feature. The problem is
that this will necessarily be KVM-specific, as the guest must first probe for
KVM and then probe for the KVM feature. I doubt that Hyper-V, for
example, wants to claim to be KVM. If we could standardize a non-
hypervisor-specific CPUID feature, then this problem would go away.
We would prefer a CPUID feature bit to detect this feature.
I guess if we're introducing the concept of pan-OS MSRs we could also
have pan-OS CPUID. The real issue is to get a single non-conflicting
standard.
Agreed.
KVM currently puts 0 in 0x40000000.EAX, meaning that a feature bit in
Microsoft's leaf 0x40000003 would probably not work well for KVM. I
don't expect that Microsoft wants to start claiming to be KVM for the
purpose of using a KVM-style feature bit, so, if we went the CPUID
route, we would probably need something new.
Slight correction: QEMU/KVM has optional support for Hyper-V feature
enumeration. Ideally the RNG seed mechanism would be enabled by
default, but I don't know whether the QEMU maintainers would be okay
with enabling the Hyper-V cpuid mechanism in a default configuration.

--Andy
Post by Andy Lutomirski
--Andy
Post by H. Peter Anvin
-hpa
--
Andy Lutomirski
AMA Capital Management, LLC
--
Andy Lutomirski
AMA Capital Management, LLC
Paolo Bonzini
2014-09-18 15:58:25 UTC
Permalink
Post by Andy Lutomirski
Slight correction: QEMU/KVM has optional support for Hyper-V feature
enumeration. Ideally the RNG seed mechanism would be enabled by
default, but I don't know whether the QEMU maintainers would be okay
with enabling the Hyper-V cpuid mechanism in a default configuration.
Some guests cannot find the KVM leaves at 0x40000100, so it wouldn't be
great. And I also don't know what VMware folks would think, but I think
they would be even less thrilled than me.

Note that even if there is no well-defined CPUID leaf, and the main
detection mechanism is #GP, each hypervisor is free to define a CPUID
bit of its own.

However, if it's going to be an architectural (Intel-defined) MSR, I
think the right place for a feature bit is in the low leaves (like
EAX=7, ECX=0).

Paolo
KY Srinivasan
2014-09-18 16:36:45 UTC
Permalink
-----Original Message-----
From: Andy Lutomirski [mailto:luto at amacapital.net]
Sent: Thursday, September 18, 2014 8:38 AM
To: H. Peter Anvin
Cc: KY Srinivasan; Linux Virtualization; kvm list; Gleb Natapov; Paolo Bonzini;
Theodore Ts'o
Subject: Re: Standardizing an MSR or other hypercall to get an RNG seed?
Post by H. Peter Anvin
Post by Andy Lutomirski
The main questions are what MSR index to use and how to detect the
1. Use CPUID to detect the presence of this feature. This is very
easy for KVM to implement by using a KVM-specific CPUID feature.
The problem is that this will necessarily be KVM-specific, as the
guest must first probe for KVM and then probe for the KVM feature.
I doubt that Hyper-V, for example, wants to claim to be KVM. If we
could standardize a non- hypervisor-specific CPUID feature, then this
problem would go away.
Post by H. Peter Anvin
We would prefer a CPUID feature bit to detect this feature.
I guess if we're introducing the concept of pan-OS MSRs we could also
have pan-OS CPUID. The real issue is to get a single non-conflicting
standard.
Agreed.
KVM currently puts 0 in 0x40000000.EAX, meaning that a feature bit in
Microsoft's leaf 0x40000003 would probably not work well for KVM. I don't
expect that Microsoft wants to start claiming to be KVM for the purpose of
using a KVM-style feature bit, so, if we went the CPUID route, we would
probably need something new.
--Andy
I am copying other Hyper-V engineers to this discussion.

Regards,

K. Y
Post by H. Peter Anvin
-hpa
--
Andy Lutomirski
AMA Capital Management, LLC
Nakajima, Jun
2014-09-18 17:13:05 UTC
Permalink
Post by KY Srinivasan
I am copying other Hyper-V engineers to this discussion.
Thanks, K.Y.

In terms of the address for the MSR, I suggest that you choose one
from the range between 40000000H - 400000FFH. The SDM (35.1
ARCHITECTURAL MSRS) says "All existing and
future processors will not implement any features using any MSR in
this range." Hyper-V already defines many synthetic MSRs in this
range, and I think it would be reasonable for you to pick one for this
to avoid a conflict?
--
Jun
Intel Open Source Technology Center
Paolo Bonzini
2014-09-18 17:17:44 UTC
Permalink
Post by Nakajima, Jun
In terms of the address for the MSR, I suggest that you choose one
from the range between 40000000H - 400000FFH. The SDM (35.1
ARCHITECTURAL MSRS) says "All existing and
future processors will not implement any features using any MSR in
this range." Hyper-V already defines many synthetic MSRs in this
range, and I think it would be reasonable for you to pick one for this
to avoid a conflict?
KVM is not using any MSR in that range.

However, I think it would be better to have the MSR (and perhaps CPUID)
outside the hypervisor-reserved ranges, so that it becomes
architecturally defined. In some sense it is similar to the HYPERVISOR
CPUID feature.

Paolo
KY Srinivasan
2014-09-18 17:20:44 UTC
Permalink
-----Original Message-----
From: Paolo Bonzini [mailto:paolo.bonzini at gmail.com] On Behalf Of Paolo
Bonzini
Sent: Thursday, September 18, 2014 10:18 AM
To: Nakajima, Jun; KY Srinivasan
Cc: Mathew John; Theodore Ts'o; John Starks; kvm list; Gleb Natapov; Niels
Ferguson; Andy Lutomirski; David Hepkin; H. Peter Anvin; Jake Oshins; Linux
Virtualization
Subject: Re: Standardizing an MSR or other hypercall to get an RNG seed?
Post by Nakajima, Jun
In terms of the address for the MSR, I suggest that you choose one
from the range between 40000000H - 400000FFH. The SDM (35.1
ARCHITECTURAL MSRS) says "All existing and future processors will not
implement any features using any MSR in this range." Hyper-V already
defines many synthetic MSRs in this range, and I think it would be
reasonable for you to pick one for this to avoid a conflict?
KVM is not using any MSR in that range.
However, I think it would be better to have the MSR (and perhaps CPUID)
outside the hypervisor-reserved ranges, so that it becomes architecturally
defined. In some sense it is similar to the HYPERVISOR CPUID feature.
Yes, given that we want this to be hypervisor agnostic.

K. Y
Nakajima, Jun
2014-09-18 17:42:09 UTC
Permalink
Post by KY Srinivasan
-----Original Message-----
From: Paolo Bonzini [mailto:paolo.bonzini at gmail.com] On Behalf Of Paolo
Bonzini
Sent: Thursday, September 18, 2014 10:18 AM
To: Nakajima, Jun; KY Srinivasan
Cc: Mathew John; Theodore Ts'o; John Starks; kvm list; Gleb Natapov; Niels
Ferguson; Andy Lutomirski; David Hepkin; H. Peter Anvin; Jake Oshins; Linux
Virtualization
Subject: Re: Standardizing an MSR or other hypercall to get an RNG seed?
Post by Nakajima, Jun
In terms of the address for the MSR, I suggest that you choose one
from the range between 40000000H - 400000FFH. The SDM (35.1
ARCHITECTURAL MSRS) says "All existing and future processors will not
implement any features using any MSR in this range." Hyper-V already
defines many synthetic MSRs in this range, and I think it would be
reasonable for you to pick one for this to avoid a conflict?
KVM is not using any MSR in that range.
However, I think it would be better to have the MSR (and perhaps CPUID)
outside the hypervisor-reserved ranges, so that it becomes architecturally
defined. In some sense it is similar to the HYPERVISOR CPUID feature.
Yes, given that we want this to be hypervisor agnostic.
Actually, that MSR address range has been reserved for that purpose, along with:
- CPUID.EAX=1 -> ECX bit 31 (always returns 0 on bare metal)
- CPUID.EAX=4000_00xxH leaves (i.e. HYPERVISOR CPUID)
--
Jun
Intel Open Source Technology Center
Andy Lutomirski
2014-09-18 18:35:39 UTC
Permalink
Post by Nakajima, Jun
Post by KY Srinivasan
-----Original Message-----
From: Paolo Bonzini [mailto:paolo.bonzini at gmail.com] On Behalf Of Paolo
Bonzini
Sent: Thursday, September 18, 2014 10:18 AM
To: Nakajima, Jun; KY Srinivasan
Cc: Mathew John; Theodore Ts'o; John Starks; kvm list; Gleb Natapov; Niels
Ferguson; Andy Lutomirski; David Hepkin; H. Peter Anvin; Jake Oshins; Linux
Virtualization
Subject: Re: Standardizing an MSR or other hypercall to get an RNG seed?
Post by Nakajima, Jun
In terms of the address for the MSR, I suggest that you choose one
from the range between 40000000H - 400000FFH. The SDM (35.1
ARCHITECTURAL MSRS) says "All existing and future processors will not
implement any features using any MSR in this range." Hyper-V already
defines many synthetic MSRs in this range, and I think it would be
reasonable for you to pick one for this to avoid a conflict?
KVM is not using any MSR in that range.
However, I think it would be better to have the MSR (and perhaps CPUID)
outside the hypervisor-reserved ranges, so that it becomes architecturally
defined. In some sense it is similar to the HYPERVISOR CPUID feature.
Yes, given that we want this to be hypervisor agnostic.
- CPUID.EAX=1 -> ECX bit 31 (always returns 0 on bare metal)
- CPUID.EAX=4000_00xxH leaves (i.e. HYPERVISOR CPUID)
I don't know whether this is documented anywhere, but Linux tries to
detect a hypervisor by searching CPUID leaves 0x400xyz00 for
"KVMKVMKVM\0\0\0", so at least Linux can handle the KVM leaves being
in a somewhat variable location.

Do we consider this mechanism to work across all hypervisors and
guests? That is, could we put something like "CrossHVPara\0"
somewhere in that range, where each hypervisor would be free to decide
exactly where it ends up?

--Andy
H. Peter Anvin
2014-09-18 18:39:11 UTC
Permalink
Quite frankly it might make more sense to define a cross-VM *cpuid* range. The cpuid leaf can just point to the MSR. The big question is who will be willing to be the registrar.
On Thu, Sep 18, 2014 at 10:42 AM, Nakajima, Jun
On Thu, Sep 18, 2014 at 10:20 AM, KY Srinivasan <kys at microsoft.com>
Post by KY Srinivasan
-----Original Message-----
From: Paolo Bonzini [mailto:paolo.bonzini at gmail.com] On Behalf Of
Paolo
Post by KY Srinivasan
Bonzini
Sent: Thursday, September 18, 2014 10:18 AM
To: Nakajima, Jun; KY Srinivasan
Cc: Mathew John; Theodore Ts'o; John Starks; kvm list; Gleb
Natapov; Niels
Post by KY Srinivasan
Ferguson; Andy Lutomirski; David Hepkin; H. Peter Anvin; Jake
Oshins; Linux
Post by KY Srinivasan
Virtualization
Subject: Re: Standardizing an MSR or other hypercall to get an RNG
seed?
Post by KY Srinivasan
Post by Nakajima, Jun
In terms of the address for the MSR, I suggest that you choose
one
Post by KY Srinivasan
Post by Nakajima, Jun
from the range between 40000000H - 400000FFH. The SDM (35.1
ARCHITECTURAL MSRS) says "All existing and future processors will
not
Post by KY Srinivasan
Post by Nakajima, Jun
implement any features using any MSR in this range." Hyper-V
already
Post by KY Srinivasan
Post by Nakajima, Jun
defines many synthetic MSRs in this range, and I think it would
be
Post by KY Srinivasan
Post by Nakajima, Jun
reasonable for you to pick one for this to avoid a conflict?
KVM is not using any MSR in that range.
However, I think it would be better to have the MSR (and perhaps
CPUID)
Post by KY Srinivasan
outside the hypervisor-reserved ranges, so that it becomes
architecturally
Post by KY Srinivasan
defined. In some sense it is similar to the HYPERVISOR CPUID
feature.
Post by KY Srinivasan
Yes, given that we want this to be hypervisor agnostic.
Actually, that MSR address range has been reserved for that purpose,
- CPUID.EAX=1 -> ECX bit 31 (always returns 0 on bare metal)
- CPUID.EAX=4000_00xxH leaves (i.e. HYPERVISOR CPUID)
I don't know whether this is documented anywhere, but Linux tries to
detect a hypervisor by searching CPUID leaves 0x400xyz00 for
"KVMKVMKVM\0\0\0", so at least Linux can handle the KVM leaves being
in a somewhat variable location.
Do we consider this mechanism to work across all hypervisors and
guests? That is, could we put something like "CrossHVPara\0"
somewhere in that range, where each hypervisor would be free to decide
exactly where it ends up?
--Andy
--
Sent from my mobile phone. Please pardon brevity and lack of formatting.
Andy Lutomirski
2014-09-18 19:03:42 UTC
Permalink
Defining a standard way of transferring random numbers between the host and the guest is an excellent idea.
It should be possible to detect this feature through CPUID or similar mechanism. That allows the code that uses this feature to be written without needing the ability to catch CPU exceptions. I could be wrong, but as far as I know there is no support for exception handling in the Windows OS loader where we gather our initial random state.
Linux is like this, too, except that I have experimental code to
create an IDT in that code, so we can handle it. I agree, though,
that using CPUID in early boot is easier.
Is there a way we can transfer more bytes per interaction? With a single 64-bit MSR we always need multiple reads to get a seed, and each of them results in a context switch to the host, which is expensive. This is even worse for 32-bit guests. Windows would typically need to fetch 64 bytes of random data at boot and at regular intervals. It is not a show-stopper, but better efficiency would be nice.
I thought about this for a while and didn't come up with anything that
wouldn't messy. We could fudge the MSR rax/rdx high bits to get 128
bits, but that's nonportable and awful to implement. We could return
a random number directly from CPUID, but that's weird.

In very informal benchmarking, rdmsr wasn't that bad. On the other
hand, I wasn't immediately planning on using the msr on an ongoing
basis on Linux guests except after suspend/resume.
Can we also define a way to have random values flow from the guest to the host? Guests are also gathering entropy from their own sources, and if we allow the guests to send random data to the host, then the host can treat it as an entropy source and all the VMs on a single host can share their entropy. (This is not a security problem; any reasonable host RNG cannot be hurt even by maliciously chosen entropy inputs.)
wrmsr on the same MSR?
I don't know much about how hypervisors work on the inside, but maybe we can define a mechanism for standardized hypervisor calls that work on all hypervisors that support this feature. Then we could define a function to do an entropy exchange: the guest provides N bytes of random data to the host, and the host replies with N bytes of random data. The data exchange can now be done through memory.
A standardized hypervisor-call mechanism also seems generally useful for future features, whereas the MSR solution is very limited in what it can do. We might end up with standardized hypervisor-calls in the future for some other reason, and then the MSR solution looks very odd.
I think there'll be resistance to a standardized hypercall mechanism,
just because the implementations tend to be complex. Hyper-V uses a
special page in guest physical memory that contains a trampoline.

We could use wrmsr to a register where the payload is a pointer to a
buffer to receive random bytes, but that loses some of the simplicity
of just calling rdmsr a few times.

--Andy
Paolo Bonzini
2014-09-19 06:04:44 UTC
Permalink
The chief advantage I see to using a hypercall based mechanism is
that it would work across more architectures. MSR's and CPUID's are
specific to X86. If we ever wanted this same mechanism to be
available on an architecture that doesn't support MSR's, a hypercall
based approach would allow for a more consistent mechanism across the
architectures.
I agree, though, that converging on a common hypercall interface that
would be implemented by all of the hypervisors would likely be much
harder to achieve.
There are differences between architectures at the hypercall level,
starting with the calling convention. So I don't think it makes much
sense to use a hypercall.

Paolo
David Hepkin
2014-09-18 21:54:14 UTC
Permalink
The chief advantage I see to using a hypercall based mechanism is that it would work across more architectures. MSR's and CPUID's are specific to X86. If we ever wanted this same mechanism to be available on an architecture that doesn't support MSR's, a hypercall based approach would allow for a more consistent mechanism across the architectures.

I agree, though, that converging on a common hypercall interface that would be implemented by all of the hypervisors would likely be much harder to achieve.

Thanks...

David

-----Original Message-----
From: Andy Lutomirski [mailto:luto at amacapital.net]
Sent: Thursday, September 18, 2014 12:04 PM
To: Niels Ferguson
Cc: H. Peter Anvin; Nakajima, Jun; KY Srinivasan; Paolo Bonzini; Mathew John; Theodore Ts'o; John Starks; kvm list; Gleb Natapov; David Hepkin; Jake Oshins; Linux Virtualization
Subject: Re: Standardizing an MSR or other hypercall to get an RNG seed?
Defining a standard way of transferring random numbers between the host and the guest is an excellent idea.
It should be possible to detect this feature through CPUID or similar mechanism. That allows the code that uses this feature to be written without needing the ability to catch CPU exceptions. I could be wrong, but as far as I know there is no support for exception handling in the Windows OS loader where we gather our initial random state.
Linux is like this, too, except that I have experimental code to create an IDT in that code, so we can handle it. I agree, though, that using CPUID in early boot is easier.
Is there a way we can transfer more bytes per interaction? With a single 64-bit MSR we always need multiple reads to get a seed, and each of them results in a context switch to the host, which is expensive. This is even worse for 32-bit guests. Windows would typically need to fetch 64 bytes of random data at boot and at regular intervals. It is not a show-stopper, but better efficiency would be nice.
I thought about this for a while and didn't come up with anything that wouldn't messy. We could fudge the MSR rax/rdx high bits to get 128 bits, but that's nonportable and awful to implement. We could return a random number directly from CPUID, but that's weird.

In very informal benchmarking, rdmsr wasn't that bad. On the other hand, I wasn't immediately planning on using the msr on an ongoing basis on Linux guests except after suspend/resume.
Can we also define a way to have random values flow from the guest to
the host? Guests are also gathering entropy from their own sources,
and if we allow the guests to send random data to the host, then the
host can treat it as an entropy source and all the VMs on a single
host can share their entropy. (This is not a security problem; any
reasonable host RNG cannot be hurt even by maliciously chosen entropy
inputs.)
wrmsr on the same MSR?
I don't know much about how hypervisors work on the inside, but maybe we can define a mechanism for standardized hypervisor calls that work on all hypervisors that support this feature. Then we could define a function to do an entropy exchange: the guest provides N bytes of random data to the host, and the host replies with N bytes of random data. The data exchange can now be done through memory.
A standardized hypervisor-call mechanism also seems generally useful for future features, whereas the MSR solution is very limited in what it can do. We might end up with standardized hypervisor-calls in the future for some other reason, and then the MSR solution looks very odd.
I think there'll be resistance to a standardized hypercall mechanism, just because the implementations tend to be complex. Hyper-V uses a special page in guest physical memory that contains a trampoline.

We could use wrmsr to a register where the payload is a pointer to a buffer to receive random bytes, but that loses some of the simplicity of just calling rdmsr a few times.

--Andy
Niels Ferguson
2014-09-18 18:54:22 UTC
Permalink
Defining a standard way of transferring random numbers between the host and the guest is an excellent idea.

As the person who writes the RNG code in Windows, I have a few comments:

DETECTION:
It should be possible to detect this feature through CPUID or similar mechanism. That allows the code that uses this feature to be written without needing the ability to catch CPU exceptions. I could be wrong, but as far as I know there is no support for exception handling in the Windows OS loader where we gather our initial random state.

EFFICIENCY:
Is there a way we can transfer more bytes per interaction? With a single 64-bit MSR we always need multiple reads to get a seed, and each of them results in a context switch to the host, which is expensive. This is even worse for 32-bit guests. Windows would typically need to fetch 64 bytes of random data at boot and at regular intervals. It is not a show-stopper, but better efficiency would be nice.

GUEST-TO-HOST:
Can we also define a way to have random values flow from the guest to the host? Guests are also gathering entropy from their own sources, and if we allow the guests to send random data to the host, then the host can treat it as an entropy source and all the VMs on a single host can share their entropy. (This is not a security problem; any reasonable host RNG cannot be hurt even by maliciously chosen entropy inputs.)


I don't know much about how hypervisors work on the inside, but maybe we can define a mechanism for standardized hypervisor calls that work on all hypervisors that support this feature. Then we could define a function to do an entropy exchange: the guest provides N bytes of random data to the host, and the host replies with N bytes of random data. The data exchange can now be done through memory.

A standardized hypervisor-call mechanism also seems generally useful for future features, whereas the MSR solution is very limited in what it can do. We might end up with standardized hypervisor-calls in the future for some other reason, and then the MSR solution looks very odd.

Niels


-----Original Message-----
From: H. Peter Anvin [mailto:hpa at zytor.com]
Sent: Thursday, September 18, 2014 11:39 AM
To: Andy Lutomirski; Nakajima, Jun
Cc: KY Srinivasan; Paolo Bonzini; Mathew John; Theodore Ts'o; John Starks; kvm list; Gleb Natapov; Niels Ferguson; David Hepkin; Jake Oshins; Linux Virtualization
Subject: Re: Standardizing an MSR or other hypercall to get an RNG seed?

Quite frankly it might make more sense to define a cross-VM *cpuid* range. The cpuid leaf can just point to the MSR. The big question is who will be willing to be the registrar.
On Thu, Sep 18, 2014 at 10:42 AM, Nakajima, Jun
On Thu, Sep 18, 2014 at 10:20 AM, KY Srinivasan <kys at microsoft.com>
Post by KY Srinivasan
-----Original Message-----
From: Paolo Bonzini [mailto:paolo.bonzini at gmail.com] On Behalf Of
Paolo
Post by KY Srinivasan
Bonzini
Sent: Thursday, September 18, 2014 10:18 AM
To: Nakajima, Jun; KY Srinivasan
Cc: Mathew John; Theodore Ts'o; John Starks; kvm list; Gleb
Natapov; Niels
Post by KY Srinivasan
Ferguson; Andy Lutomirski; David Hepkin; H. Peter Anvin; Jake
Oshins; Linux
Post by KY Srinivasan
Virtualization
Subject: Re: Standardizing an MSR or other hypercall to get an RNG
seed?
Post by KY Srinivasan
Post by Nakajima, Jun
In terms of the address for the MSR, I suggest that you choose
one
Post by KY Srinivasan
Post by Nakajima, Jun
from the range between 40000000H - 400000FFH. The SDM (35.1
ARCHITECTURAL MSRS) says "All existing and future processors will
not
Post by KY Srinivasan
Post by Nakajima, Jun
implement any features using any MSR in this range." Hyper-V
already
Post by KY Srinivasan
Post by Nakajima, Jun
defines many synthetic MSRs in this range, and I think it would
be
Post by KY Srinivasan
Post by Nakajima, Jun
reasonable for you to pick one for this to avoid a conflict?
KVM is not using any MSR in that range.
However, I think it would be better to have the MSR (and perhaps
CPUID)
Post by KY Srinivasan
outside the hypervisor-reserved ranges, so that it becomes
architecturally
Post by KY Srinivasan
defined. In some sense it is similar to the HYPERVISOR CPUID
feature.
Post by KY Srinivasan
Yes, given that we want this to be hypervisor agnostic.
Actually, that MSR address range has been reserved for that purpose,
- CPUID.EAX=1 -> ECX bit 31 (always returns 0 on bare metal)
- CPUID.EAX=4000_00xxH leaves (i.e. HYPERVISOR CPUID)
I don't know whether this is documented anywhere, but Linux tries to
detect a hypervisor by searching CPUID leaves 0x400xyz00 for
"KVMKVMKVM\0\0\0", so at least Linux can handle the KVM leaves being in
a somewhat variable location.
Do we consider this mechanism to work across all hypervisors and
guests? That is, could we put something like "CrossHVPara\0"
somewhere in that range, where each hypervisor would be free to decide
exactly where it ends up?
--Andy
--
Sent from my mobile phone. Please pardon brevity and lack of formatting.
Paolo Bonzini
2014-09-18 18:58:35 UTC
Permalink
Post by Andy Lutomirski
Post by Nakajima, Jun
- CPUID.EAX=1 -> ECX bit 31 (always returns 0 on bare metal)
- CPUID.EAX=4000_00xxH leaves (i.e. HYPERVISOR CPUID)
I don't know whether this is documented anywhere, but Linux tries to
detect a hypervisor by searching CPUID leaves 0x400xyz00 for
"KVMKVMKVM\0\0\0", so at least Linux can handle the KVM leaves being
in a somewhat variable location.
Do we consider this mechanism to work across all hypervisors and
guests? That is, could we put something like "CrossHVPara\0"
somewhere in that range, where each hypervisor would be free to decide
exactly where it ends up?
That's also possible, but extending the hypervisor CPUID range
beywond 400000FFH is not officially sanctioned by Intel.

Xen started doing that in order to expose both Hyper-V and Xen
CPUID leaves, and KVM followed the practice.

Paolo
Andy Lutomirski
2014-09-18 19:07:14 UTC
Permalink
Post by Paolo Bonzini
Post by Andy Lutomirski
Post by Nakajima, Jun
- CPUID.EAX=1 -> ECX bit 31 (always returns 0 on bare metal)
- CPUID.EAX=4000_00xxH leaves (i.e. HYPERVISOR CPUID)
I don't know whether this is documented anywhere, but Linux tries to
detect a hypervisor by searching CPUID leaves 0x400xyz00 for
"KVMKVMKVM\0\0\0", so at least Linux can handle the KVM leaves being
in a somewhat variable location.
Do we consider this mechanism to work across all hypervisors and
guests? That is, could we put something like "CrossHVPara\0"
somewhere in that range, where each hypervisor would be free to decide
exactly where it ends up?
That's also possible, but extending the hypervisor CPUID range
beywond 400000FFH is not officially sanctioned by Intel.
Xen started doing that in order to expose both Hyper-V and Xen
CPUID leaves, and KVM followed the practice.
Whoops.

Might Intel be willing to extend that range to 0x40000000 -
0x400fffff? And would Microsoft be okay with using this mechanism for
discovery?

Do we have anyone from VMware in this thread? I don't have any VMware contacts.

--Andy
Nakajima, Jun
2014-09-18 21:21:39 UTC
Permalink
Post by Andy Lutomirski
Might Intel be willing to extend that range to 0x40000000 -
0x400fffff? And would Microsoft be okay with using this mechanism for
discovery?
So, for CPUID, the SDM (Table 3-17. Information Returned by CPUID) says today:
"No existing or future CPU will return processor identification or
feature information if the initial EAX value is in the range 40000000H
to 4FFFFFFFH."

We can define a cross-VM CPUID range from there. The CPUID can return
the index of the MSR if needed.
--
Jun
Intel Open Source Technology Center
Andy Lutomirski
2014-09-18 21:35:28 UTC
Permalink
Post by Nakajima, Jun
Post by Andy Lutomirski
Might Intel be willing to extend that range to 0x40000000 -
0x400fffff? And would Microsoft be okay with using this mechanism for
discovery?
"No existing or future CPU will return processor identification or
feature information if the initial EAX value is in the range 40000000H
to 4FFFFFFFH."
We can define a cross-VM CPUID range from there. The CPUID can return
the index of the MSR if needed.
Right, sorry. I was looking at this sentence in SDM Volume 3 Section 35.1:

MSR address range between 40000000H - 400000FFH is marked as a
specially reserved range. All existing and
future processors will not implement any features using any MSR in this range.

That's not really a large enough range for us to reserve an MSR for
this. However, KVM, is already using MSRs outside that range: it uses
0x4b564d00-0x4b564d04 or so. I wonder whether KVM got confused by the
differing ranges for cpuid leaves and MSR indices.

Any chance that Intel could reserve a larger range to include the KVM
MSRs? It would also be easier if the MSR indices for cross-HV
features were constants.

Thanks,
Andy
H. Peter Anvin
2014-09-18 21:57:39 UTC
Permalink
I'm not sure what you mean by "this mechanism?" Are you suggesting that each hypervisor put "CrossHVPara\0" somewhere in the 0x40000000 - 0x400fffff CPUID range, and an OS has to do a full scan of this CPUID range on boot to find it? That seems pretty inefficient. An OS will take 1000's of hypervisor intercepts on every boot just to search this CPUID range.
I suggest we come to consensus on a specific CPUID leaf where an OS needs to look to determine if a hypervisor supports this capability. We could define a new CPUID leaf range at a well-defined location, or we could just use one of the existing CPUID leaf ranges implemented by an existing hypervisor. I'm not familiar with the KVM CPUID leaf range, but in the case of Hyper-V, the Hyper-V CPUID leaf range was architected to allow for other hypervisors to implement it and just show through specific capabilities supported by the hypervisor. So, we could define a bit in the Hyper-V CPUID leaf range (since Xen and KVM also implement this range), but that would require Linux to look in that range on boot to discover this capability.
Yes, I would agree that if anything we should define a new range unique
to this cross-VM interface, e.g. 0x48000000.

-hpa
Andy Lutomirski
2014-09-18 22:07:45 UTC
Permalink
Post by H. Peter Anvin
I'm not sure what you mean by "this mechanism?" Are you suggesting that each hypervisor put "CrossHVPara\0" somewhere in the 0x40000000 - 0x400fffff CPUID range, and an OS has to do a full scan of this CPUID range on boot to find it? That seems pretty inefficient. An OS will take 1000's of hypervisor intercepts on every boot just to search this CPUID range.
I suggest we come to consensus on a specific CPUID leaf where an OS needs to look to determine if a hypervisor supports this capability. We could define a new CPUID leaf range at a well-defined location, or we could just use one of the existing CPUID leaf ranges implemented by an existing hypervisor. I'm not familiar with the KVM CPUID leaf range, but in the case of Hyper-V, the Hyper-V CPUID leaf range was architected to allow for other hypervisors to implement it and just show through specific capabilities supported by the hypervisor. So, we could define a bit in the Hyper-V CPUID leaf range (since Xen and KVM also implement this range), but that would require Linux to look in that range on boot to discover this capability.
Yes, I would agree that if anything we should define a new range unique
to this cross-VM interface, e.g. 0x48000000.
So, as a concrete straw-man:

CPUID leaf 0x48000000 would return a maximum leaf number in EAX (e.g.
0x48000001) along with a signature value (e.g. "CrossHVPara\0") in
EBX, ECX, and EDX.

CPUID 0x48000001.EAX would contain an MSR number to read to get a
random number if supported and zero if not supported.

Questions:

1. Can we use a fixed MSR number? This would be a little bit simpler,
but it would depend on getting a wider MSR range from Intel.

2. Who would host and maintain such a spec? I could do it on github,
but this seems a bit silly. Other options would include Intel,
Microsoft, or perhaps the Linux Foundation. I don't know whether
Intel or LF would want to do this, and MS isn't exactly
vendor-neutral. (Even L-F isn't entirely neutral, since they sort of
represent two hypervisors.) Or we could do something temporary and
then try to work with a group like OASIS, but that might end up being
a lot of work.

--Andy
Nakajima, Jun
2014-09-19 00:49:44 UTC
Permalink
Post by Andy Lutomirski
CPUID leaf 0x48000000 would return a maximum leaf number in EAX (e.g.
0x48000001) along with a signature value (e.g. "CrossHVPara\0") in
EBX, ECX, and EDX.
CPUID 0x48000001.EAX would contain an MSR number to read to get a
random number if supported and zero if not supported.
1. Can we use a fixed MSR number? This would be a little bit simpler,
but it would depend on getting a wider MSR range from Intel.
Why do you need a wider MSR range if you always detect the feature by
CPUID.0x48000001?
Or are you still trying to avoid the detection by CPUID?
--
Jun
Intel Open Source Technology Center
Andy Lutomirski
2014-09-19 01:03:46 UTC
Permalink
Post by Nakajima, Jun
Post by Andy Lutomirski
CPUID leaf 0x48000000 would return a maximum leaf number in EAX (e.g.
0x48000001) along with a signature value (e.g. "CrossHVPara\0") in
EBX, ECX, and EDX.
CPUID 0x48000001.EAX would contain an MSR number to read to get a
random number if supported and zero if not supported.
1. Can we use a fixed MSR number? This would be a little bit simpler,
but it would depend on getting a wider MSR range from Intel.
Why do you need a wider MSR range if you always detect the feature by
CPUID.0x48000001?
Or are you still trying to avoid the detection by CPUID?
Detecting the feature is one thing, but figuring out the MSR index is
another. We could shove the index into the cpuid leaf, but that seems
unnecessarily indirect. I'd much rather just say that CPUID leaves
*and* MSR indexes 0x48000000-0x4800ffff or so are reserved for the
cross-HV mechanism, but we can't do that without either knowingly
violating the SDM assignments or asking Intel to consider allocating
more MSR indexes.

Also, KVM is already conflicting with the SDM right now in its MSR
choice :( I *think* that KVM could be changed to fix that, but 256
MSRs is rather confining given that KVM currently implements its own
MSR index *and* part of the Hyper-V index.

--Andy
Andy Lutomirski
2014-09-19 01:28:08 UTC
Permalink
Post by Andy Lutomirski
Post by Nakajima, Jun
Post by Andy Lutomirski
CPUID leaf 0x48000000 would return a maximum leaf number in EAX (e.g.
0x48000001) along with a signature value (e.g. "CrossHVPara\0") in
EBX, ECX, and EDX.
CPUID 0x48000001.EAX would contain an MSR number to read to get a
random number if supported and zero if not supported.
1. Can we use a fixed MSR number? This would be a little bit simpler,
but it would depend on getting a wider MSR range from Intel.
Why do you need a wider MSR range if you always detect the feature by
CPUID.0x48000001?
Or are you still trying to avoid the detection by CPUID?
Detecting the feature is one thing, but figuring out the MSR index is
another. We could shove the index into the cpuid leaf, but that seems
unnecessarily indirect. I'd much rather just say that CPUID leaves
*and* MSR indexes 0x48000000-0x4800ffff or so are reserved for the
cross-HV mechanism, but we can't do that without either knowingly
violating the SDM assignments or asking Intel to consider allocating
more MSR indexes.
Also, KVM is already conflicting with the SDM right now in its MSR
choice :( I *think* that KVM could be changed to fix that, but 256
MSRs is rather confining given that KVM currently implements its own
MSR index *and* part of the Hyper-V index.
Correction and update:

KVM currently implements its own MSRs and, optionally, some of the
Hyper-V MSRs. By my count, Linux knows about 68 Hyper-V MSRs (in a
header file), and there are current 7 KVM MSRs, so over 1/4 of the
available MSR indices are taken (and even more would be taken if KVM
were to move its MSRs into the correct range).

--Andy
Nakajima, Jun
2014-09-19 16:14:56 UTC
Permalink
Post by Andy Lutomirski
Post by Andy Lutomirski
Post by Nakajima, Jun
Post by Andy Lutomirski
CPUID leaf 0x48000000 would return a maximum leaf number in EAX (e.g.
0x48000001) along with a signature value (e.g. "CrossHVPara\0") in
EBX, ECX, and EDX.
CPUID 0x48000001.EAX would contain an MSR number to read to get a
random number if supported and zero if not supported.
1. Can we use a fixed MSR number? This would be a little bit simpler,
but it would depend on getting a wider MSR range from Intel.
Why do you need a wider MSR range if you always detect the feature by
CPUID.0x48000001?
Or are you still trying to avoid the detection by CPUID?
Detecting the feature is one thing, but figuring out the MSR index is
another. We could shove the index into the cpuid leaf, but that seems
unnecessarily indirect. I'd much rather just say that CPUID leaves
*and* MSR indexes 0x48000000-0x4800ffff or so are reserved for the
cross-HV mechanism, but we can't do that without either knowingly
violating the SDM assignments or asking Intel to consider allocating
more MSR indexes.
Also, KVM is already conflicting with the SDM right now in its MSR
choice :( I *think* that KVM could be changed to fix that, but 256
MSRs is rather confining given that KVM currently implements its own
MSR index *and* part of the Hyper-V index.
KVM currently implements its own MSRs and, optionally, some of the
Hyper-V MSRs. By my count, Linux knows about 68 Hyper-V MSRs (in a
header file), and there are current 7 KVM MSRs, so over 1/4 of the
available MSR indices are taken (and even more would be taken if KVM
were to move its MSRs into the correct range).
I slept on it, and I think using the CPUID instruction alone would be
simple and efficient:
- We have a huge space for CPUID leaves
- CPUID also works for user-level
- It can take an additional 32-bit parameter (ECX), and returns 4
32-bit values (EAX, EBX, ECX, and EDX). RDMSR, for example, returns a
64-bit value.

Basically we can use it to implement a hypercall (rather than VMCALL).

For example,
- CPUID 0x48000001.EAX would return the feature presence (e.g. in
EBX), and the result in EDX:EAX (if present) at the same time, or
- CPUID 0x48000001.EAX would return the feature presence only, and
CPUID 0x48000002.EAX (acts like a hypercall) returns up to 4 32-bit
values.
--
Jun
Intel Open Source Technology Center
Paolo Bonzini
2014-09-19 16:22:24 UTC
Permalink
Post by Nakajima, Jun
For example,
- CPUID 0x48000001.EAX would return the feature presence (e.g. in
EBX), and the result in EDX:EAX (if present) at the same time, or
- CPUID 0x48000001.EAX would return the feature presence only, and
CPUID 0x48000002.EAX (acts like a hypercall) returns up to 4 32-bit
values.
The latter is much better, because an "unknown" CPUID will return the
value of the highest leaf below 0x80000000, and conflicts can happen easily.

Paolo
H. Peter Anvin
2014-09-19 16:40:42 UTC
Permalink
Post by Nakajima, Jun
I slept on it, and I think using the CPUID instruction alone would be
- We have a huge space for CPUID leaves
- CPUID also works for user-level
- It can take an additional 32-bit parameter (ECX), and returns 4
32-bit values (EAX, EBX, ECX, and EDX). RDMSR, for example, returns a
64-bit value.
Basically we can use it to implement a hypercall (rather than VMCALL).
For example,
- CPUID 0x48000001.EAX would return the feature presence (e.g. in
EBX), and the result in EDX:EAX (if present) at the same time, or
- CPUID 0x48000001.EAX would return the feature presence only, and
CPUID 0x48000002.EAX (acts like a hypercall) returns up to 4 32-bit
values.
There is a huge disadvantage to the fact that CPUID is a user space
instruction, though.

-hpa
Andy Lutomirski
2014-09-19 17:21:36 UTC
Permalink
Post by H. Peter Anvin
Post by Nakajima, Jun
I slept on it, and I think using the CPUID instruction alone would be
- We have a huge space for CPUID leaves
- CPUID also works for user-level
- It can take an additional 32-bit parameter (ECX), and returns 4
32-bit values (EAX, EBX, ECX, and EDX). RDMSR, for example, returns a
64-bit value.
Basically we can use it to implement a hypercall (rather than VMCALL).
For example,
- CPUID 0x48000001.EAX would return the feature presence (e.g. in
EBX), and the result in EDX:EAX (if present) at the same time, or
- CPUID 0x48000001.EAX would return the feature presence only, and
CPUID 0x48000002.EAX (acts like a hypercall) returns up to 4 32-bit
values.
There is a huge disadvantage to the fact that CPUID is a user space
instruction, though.
We can always make cpuid on the leaf in question return all zeros if CPL > 0.
Post by H. Peter Anvin
-hpa
H. Peter Anvin
2014-09-19 17:36:28 UTC
Permalink
Post by Andy Lutomirski
Post by H. Peter Anvin
There is a huge disadvantage to the fact that CPUID is a user space
instruction, though.
We can always make cpuid on the leaf in question return all zeros if CPL > 0.
Not sure that is better...

-hpa
Andy Lutomirski
2014-09-19 17:39:58 UTC
Permalink
Post by H. Peter Anvin
Post by Andy Lutomirski
Post by H. Peter Anvin
There is a huge disadvantage to the fact that CPUID is a user space
instruction, though.
We can always make cpuid on the leaf in question return all zeros if CPL > 0.
Not sure that is better...
It's better than #GP...

This is why I prefer rdmsr: the privilege semantics are already
appropriate. Also, I wouldn't be surprised if shoehorning
non-constant results into cpuid implementations might be awkward for
some hypervisors.

--Andy
Theodore Ts'o
2014-09-19 22:05:37 UTC
Permalink
Post by H. Peter Anvin
There is a huge disadvantage to the fact that CPUID is a user space
instruction, though.
But if the goal is to provide something like getrandom(2) direct from
the Host OS, it's not necessarily harmful to allow the Guest ring 3
code to be able to fetch randomness in that way. The hypervisor can
implement rate limiting to protect against the guest using this too
frequently, but this is something that you should be doing for guest
ring 0 code anyway, since from the POV of the hypervisor Guest ring 0
is not necessarily any more trusted than Guest ring 3.

- Ted
Andy Lutomirski
2014-09-19 22:06:55 UTC
Permalink
Post by Theodore Ts'o
Post by H. Peter Anvin
There is a huge disadvantage to the fact that CPUID is a user space
instruction, though.
But if the goal is to provide something like getrandom(2) direct from
the Host OS, it's not necessarily harmful to allow the Guest ring 3
code to be able to fetch randomness in that way. The hypervisor can
implement rate limiting to protect against the guest using this too
frequently, but this is something that you should be doing for guest
ring 0 code anyway, since from the POV of the hypervisor Guest ring 0
is not necessarily any more trusted than Guest ring 3.
On the other hand, the guest kernel might not want the guest ring 3 to
be able to get random numbers.

--Andy
Nakajima, Jun
2014-09-19 22:57:05 UTC
Permalink
Post by Andy Lutomirski
Post by Theodore Ts'o
Post by H. Peter Anvin
There is a huge disadvantage to the fact that CPUID is a user space
instruction, though.
But if the goal is to provide something like getrandom(2) direct from
the Host OS, it's not necessarily harmful to allow the Guest ring 3
code to be able to fetch randomness in that way. The hypervisor can
implement rate limiting to protect against the guest using this too
frequently, but this is something that you should be doing for guest
ring 0 code anyway, since from the POV of the hypervisor Guest ring 0
is not necessarily any more trusted than Guest ring 3.
On the other hand, the guest kernel might not want the guest ring 3 to
be able to get random numbers.
But the RDSEED instruction, for example, is available in user-level.
And I'm not sure that the kernel can do something with that.
--
Jun
Intel Open Source Technology Center
Theodore Ts'o
2014-09-19 22:57:27 UTC
Permalink
Post by Andy Lutomirski
Post by Theodore Ts'o
Post by H. Peter Anvin
There is a huge disadvantage to the fact that CPUID is a user space
instruction, though.
But if the goal is to provide something like getrandom(2) direct from
the Host OS, it's not necessarily harmful to allow the Guest ring 3
code to be able to fetch randomness in that way. The hypervisor can
implement rate limiting to protect against the guest using this too
frequently, but this is something that you should be doing for guest
ring 0 code anyway, since from the POV of the hypervisor Guest ring 0
is not necessarily any more trusted than Guest ring 3.
On the other hand, the guest kernel might not want the guest ring 3 to
be able to get random numbers.
Um, why?

We're talking about using this to seed the RNG, and not something that
the guest kernel would be using continuously. So what's the problem
with letting the guest ring get random numbers from the host?

- Ted
Andy Lutomirski
2014-09-19 23:12:23 UTC
Permalink
This post might be inappropriate. Click to display it.
H. Peter Anvin
2014-09-19 23:29:53 UTC
Permalink
Post by Andy Lutomirski
To force deterministic execution.
I incorrectly thought that the kernel could switch RDRAND on and off.
It turns out that a hypervisor can do this, but not the kernel. Also,
determinism is lost anyway because of TSX, which *also* can't be
turned on and off.
Actually, a much bigger reason is because it lets rogue guest *user
space*, even will a well-behaved guest OS, do something potentially
harmful to the host.

-hpa
Theodore Ts'o
2014-09-19 23:35:02 UTC
Permalink
Post by H. Peter Anvin
Actually, a much bigger reason is because it lets rogue guest *user
space*, even will a well-behaved guest OS, do something potentially
harmful to the host.
Right, but if the host kernel is dependent on the guest OS for
security, the game is over. The Guest Kernel must NEVER been able to
do anything harmful to the host. If it can, it is a severe security
bug in KVM that must be fixed ASAP.

- Ted
Andy Lutomirski
2014-09-19 23:41:59 UTC
Permalink
Post by Theodore Ts'o
Post by H. Peter Anvin
Actually, a much bigger reason is because it lets rogue guest *user
space*, even will a well-behaved guest OS, do something potentially
harmful to the host.
Right, but if the host kernel is dependent on the guest OS for
security, the game is over. The Guest Kernel must NEVER been able to
do anything harmful to the host. If it can, it is a severe security
bug in KVM that must be fixed ASAP.
Nonetheless, I suspect that some OS kernel author, somewhere, will
object to having a hypervisor that exposes new capabilities to guest
CPL 3 without requiring the guest to opt in, if for no other reason
than that it slightly increases the attack surface.

I certainly object on these grounds.

--Andy
H. Peter Anvin
2014-09-20 00:06:59 UTC
Permalink
Post by Theodore Ts'o
Post by H. Peter Anvin
Actually, a much bigger reason is because it lets rogue guest *user
space*, even will a well-behaved guest OS, do something potentially
harmful to the host.
Right, but if the host kernel is dependent on the guest OS for
security, the game is over. The Guest Kernel must NEVER been able to
do anything harmful to the host. If it can, it is a severe security
bug in KVM that must be fixed ASAP.
"Security" and "resource well-behaved" are two different things.

-hpa
H. Peter Anvin
2014-09-19 23:29:53 UTC
Permalink
Post by Andy Lutomirski
To force deterministic execution.
I incorrectly thought that the kernel could switch RDRAND on and off.
It turns out that a hypervisor can do this, but not the kernel. Also,
determinism is lost anyway because of TSX, which *also* can't be
turned on and off.
Actually, a much bigger reason is because it lets rogue guest *user
space*, even will a well-behaved guest OS, do something potentially
harmful to the host.

-hpa
Andy Lutomirski
2014-09-18 22:00:05 UTC
Permalink
I'm not sure what you mean by "this mechanism?" Are you suggesting that each hypervisor put "CrossHVPara\0" somewhere in the 0x40000000 - 0x400fffff CPUID range, and an OS has to do a full scan of this CPUID range on boot to find it? That seems pretty inefficient. An OS will take 1000's of hypervisor intercepts on every boot just to search this CPUID range.
Linux already does this, which is arguably unfortunate. But it's not
quite that bad; the KVM and Xen code is only scanning at increments of
0x100.

I think that Linux as a guest would have no problem with checking the
Hyper-V range or some new range. I don't think that Linux would want
to have to set a guest OS identity, and it's not entirely clear to me
whether this would be necessary to use the Hyper-V mechanism.
I suggest we come to consensus on a specific CPUID leaf where an OS needs to look to determine if a hypervisor supports this capability. We could define a new CPUID leaf range at a well-defined location, or we could just use one of the existing CPUID leaf ranges implemented by an existing hypervisor. I'm not familiar with the KVM CPUID leaf range, but in the case of Hyper-V, the Hyper-V CPUID leaf range was architected to allow for other hypervisors to implement it and just show through specific capabilities supported by the hypervisor. So, we could define a bit in the Hyper-V CPUID leaf range (since Xen and KVM also implement this range), but that would require Linux to look in that range on boot to discover this capability.
I also don't know whether QEMU and KVM would be okay with implementing
the host side of the Hyper-V mechanism by default. They would have to
implement at least leaves 0x40000001 and 0x4000002, plus correctly
reporting zeros through whatever leaf is used for this new feature.
Gleb? Paolo?

--Andy
H. Peter Anvin
2014-09-18 22:03:57 UTC
Permalink
Post by Andy Lutomirski
I'm not sure what you mean by "this mechanism?" Are you suggesting that each hypervisor put "CrossHVPara\0" somewhere in the 0x40000000 - 0x400fffff CPUID range, and an OS has to do a full scan of this CPUID range on boot to find it? That seems pretty inefficient. An OS will take 1000's of hypervisor intercepts on every boot just to search this CPUID range.
Linux already does this, which is arguably unfortunate. But it's not
quite that bad; the KVM and Xen code is only scanning at increments of
0x100.
I think that Linux as a guest would have no problem with checking the
Hyper-V range or some new range. I don't think that Linux would want
to have to set a guest OS identity, and it's not entirely clear to me
whether this would be necessary to use the Hyper-V mechanism.
We really don't want to have to do this in early code, though.
Post by Andy Lutomirski
I suggest we come to consensus on a specific CPUID leaf where an OS needs to look to determine if a hypervisor supports this capability. We could define a new CPUID leaf range at a well-defined location, or we could just use one of the existing CPUID leaf ranges implemented by an existing hypervisor. I'm not familiar with the KVM CPUID leaf range, but in the case of Hyper-V, the Hyper-V CPUID leaf range was architected to allow for other hypervisors to implement it and just show through specific capabilities supported by the hypervisor. So, we could define a bit in the Hyper-V CPUID leaf range (since Xen and KVM also implement this range), but that would require Linux to look in that range on boot to discover this capability.
I also don't know whether QEMU and KVM would be okay with implementing
the host side of the Hyper-V mechanism by default. They would have to
implement at least leaves 0x40000001 and 0x4000002, plus correctly
reporting zeros through whatever leaf is used for this new feature.
Gleb? Paolo?
The problem is what happens with a noncooperating hypervisor. I guess
we could put a magic number in one of the leaf registers, but still...

-hpa
Gleb Natapov
2014-09-19 16:37:50 UTC
Permalink
Post by Andy Lutomirski
I suggest we come to consensus on a specific CPUID leaf where an OS needs to look to determine if a hypervisor supports this capability. We could define a new CPUID leaf range at a well-defined location, or we could just use one of the existing CPUID leaf ranges implemented by an existing hypervisor. I'm not familiar with the KVM CPUID leaf range, but in the case of Hyper-V, the Hyper-V CPUID leaf range was architected to allow for other hypervisors to implement it and just show through specific capabilities supported by the hypervisor. So, we could define a bit in the Hyper-V CPUID leaf range (since Xen and KVM also implement this range), but that would require Linux to look in that range on boot to discover this capability.
I also don't know whether QEMU and KVM would be okay with implementing
the host side of the Hyper-V mechanism by default. They would have to
implement at least leaves 0x40000001 and 0x4000002, plus correctly
reporting zeros through whatever leaf is used for this new feature.
Gleb? Paolo?
KVM and any other hypervisor out there already implement capability
detection mechanism in 0x40000000 range, and of course all of them do
it differently. Linux detects what hypervior it runs on very early and
switch to correspondent code to handle each hypervisor. Quite frankly
I do not see what problem you are trying to fix with standardizing MSR
to get RND and detection mechanism for this MSR. RND MSR is in no way
unique here. There are other mechanisms that are virtually identical
between hypervisors but have different gust/hypervisor interfaces and
are detected differently on different hypervisors. Examples are pvclock,
pveoi may be others.

--
Gleb.
H. Peter Anvin
2014-09-19 16:40:07 UTC
Permalink
Post by Gleb Natapov
Linux detects what hypervior it runs on very early
Not anywhere close to early enough. We're talking for uses like kASLR.

-hpa
Gleb Natapov
2014-09-19 16:53:49 UTC
Permalink
Post by H. Peter Anvin
Post by Gleb Natapov
Linux detects what hypervior it runs on very early
Not anywhere close to early enough. We're talking for uses like kASLR.
Still to early to do:

h = cpuid(HYPERVIOR_SIGNATURE)
if (h == KVMKVMKVM) {
if (cpuid(kvm_features) & kvm_rnd)
rdmsr(kvm_rnd)
else (h == HyperV) {
if (cpuid(hv_features) & hv_rnd)
rdmsr(hv_rnd)
else (h == XenXenXen) {
if (cpuid(xen_features) & xen_rnd)
rdmsr(xen_rnd)
}

?

--
Gleb.
H. Peter Anvin
2014-09-19 17:08:20 UTC
Permalink
Post by Gleb Natapov
Post by H. Peter Anvin
Post by Gleb Natapov
Linux detects what hypervior it runs on very early
Not anywhere close to early enough. We're talking for uses like kASLR.
h = cpuid(HYPERVIOR_SIGNATURE)
if (h == KVMKVMKVM) {
if (cpuid(kvm_features) & kvm_rnd)
rdmsr(kvm_rnd)
else (h == HyperV) {
if (cpuid(hv_features) & hv_rnd)
rdmsr(hv_rnd)
else (h == XenXenXen) {
if (cpuid(xen_features) & xen_rnd)
rdmsr(xen_rnd)
}
If we need to do chase loops, especially not so...

-hpa
Gleb Natapov
2014-09-19 17:15:45 UTC
Permalink
Post by H. Peter Anvin
Post by Gleb Natapov
Post by H. Peter Anvin
Post by Gleb Natapov
Linux detects what hypervior it runs on very early
Not anywhere close to early enough. We're talking for uses like kASLR.
h = cpuid(HYPERVIOR_SIGNATURE)
if (h == KVMKVMKVM) {
if (cpuid(kvm_features) & kvm_rnd)
rdmsr(kvm_rnd)
else (h == HyperV) {
if (cpuid(hv_features) & hv_rnd)
rdmsr(hv_rnd)
else (h == XenXenXen) {
if (cpuid(xen_features) & xen_rnd)
rdmsr(xen_rnd)
}
If we need to do chase loops, especially not so...
What loops exactly? As a non native English speaker I fail to understand
if your answer is "yes" or "no" ;)

--
Gleb.
H. Peter Anvin
2014-09-19 17:18:37 UTC
Permalink
Post by Gleb Natapov
Post by H. Peter Anvin
Post by Gleb Natapov
Post by H. Peter Anvin
Post by Gleb Natapov
Linux detects what hypervior it runs on very early
Not anywhere close to early enough. We're talking for uses like kASLR.
h = cpuid(HYPERVIOR_SIGNATURE)
if (h == KVMKVMKVM) {
if (cpuid(kvm_features) & kvm_rnd)
rdmsr(kvm_rnd)
else (h == HyperV) {
if (cpuid(hv_features) & hv_rnd)
rdmsr(hv_rnd)
else (h == XenXenXen) {
if (cpuid(xen_features) & xen_rnd)
rdmsr(xen_rnd)
}
If we need to do chase loops, especially not so...
What loops exactly? As a non native English speaker I fail to understand
if your answer is "yes" or "no" ;)
The above isn't actually the full algorithm used.

-hpa
H. Peter Anvin
2014-09-19 17:18:37 UTC
Permalink
Post by Gleb Natapov
Post by H. Peter Anvin
Post by Gleb Natapov
Post by H. Peter Anvin
Post by Gleb Natapov
Linux detects what hypervior it runs on very early
Not anywhere close to early enough. We're talking for uses like kASLR.
h = cpuid(HYPERVIOR_SIGNATURE)
if (h == KVMKVMKVM) {
if (cpuid(kvm_features) & kvm_rnd)
rdmsr(kvm_rnd)
else (h == HyperV) {
if (cpuid(hv_features) & hv_rnd)
rdmsr(hv_rnd)
else (h == XenXenXen) {
if (cpuid(xen_features) & xen_rnd)
rdmsr(xen_rnd)
}
If we need to do chase loops, especially not so...
What loops exactly? As a non native English speaker I fail to understand
if your answer is "yes" or "no" ;)
The above isn't actually the full algorithm used.

-hpa
Gleb Natapov
2014-09-19 17:49:42 UTC
Permalink
Post by H. Peter Anvin
Post by Gleb Natapov
Post by H. Peter Anvin
Post by Gleb Natapov
Post by H. Peter Anvin
Post by Gleb Natapov
Linux detects what hypervior it runs on very early
Not anywhere close to early enough. We're talking for uses like kASLR.
h = cpuid(HYPERVIOR_SIGNATURE)
if (h == KVMKVMKVM) {
if (cpuid(kvm_features) & kvm_rnd)
rdmsr(kvm_rnd)
else (h == HyperV) {
if (cpuid(hv_features) & hv_rnd)
rdmsr(hv_rnd)
else (h == XenXenXen) {
if (cpuid(xen_features) & xen_rnd)
rdmsr(xen_rnd)
}
If we need to do chase loops, especially not so...
What loops exactly? As a non native English speaker I fail to understand
if your answer is "yes" or "no" ;)
The above isn't actually the full algorithm used.
What part of actually algorithm cannot be implemented? Loop that searches
for KVM leaf in case KVM pretend to be HyperV (is this what you called
"chase loops"?)? First of all there is no need to implement it, if KVM
pretends to be HyperV use HyperV's way to obtain RNG, but what is the
problem with the loop?

--
Gleb.
Andy Lutomirski
2014-09-19 18:02:38 UTC
Permalink
Post by Gleb Natapov
Post by H. Peter Anvin
Post by Gleb Natapov
Post by H. Peter Anvin
Post by Gleb Natapov
Post by H. Peter Anvin
Post by Gleb Natapov
Linux detects what hypervior it runs on very early
Not anywhere close to early enough. We're talking for uses like kASLR.
h = cpuid(HYPERVIOR_SIGNATURE)
if (h == KVMKVMKVM) {
if (cpuid(kvm_features) & kvm_rnd)
rdmsr(kvm_rnd)
else (h == HyperV) {
if (cpuid(hv_features) & hv_rnd)
rdmsr(hv_rnd)
else (h == XenXenXen) {
if (cpuid(xen_features) & xen_rnd)
rdmsr(xen_rnd)
}
If we need to do chase loops, especially not so...
What loops exactly? As a non native English speaker I fail to understand
if your answer is "yes" or "no" ;)
The above isn't actually the full algorithm used.
What part of actually algorithm cannot be implemented? Loop that searches
for KVM leaf in case KVM pretend to be HyperV (is this what you called
"chase loops"?)? First of all there is no need to implement it, if KVM
pretends to be HyperV use HyperV's way to obtain RNG, but what is the
problem with the loop?
It can be implemented, and I've done it. But it's a mess. Almost the
very first thing we do in boot (even before decompressing the kernel)
will be to scan a bunch of cpuid leaves looking for a hypervisor with
an rng source that we can use for kASLR. And we'll have to update
that code and make it bigger every time another hypervisor adds
exactly the same feature.

And then we have another copy of almost exactly the same code in the
normal post-boot part of the kernel.

We can certainly do this, but I'd much rather solve the problem once
and let all of the hypervisors and guests opt in and immediately be
compatible with each other.
Post by Gleb Natapov
I "forgot" VMware because I do not see VMware people to be CCed. They may
be even less excited about them being told _how_ this feature need to be
implemented (e.g implement HyperV leafs for the feature detection). I
do not want to and cannot speak for VMware, but my guess is that for
them it would be much easier to add an else clause for VMware in above
"if" then to coordinate with all hypervisor developers about MSR/cpuid
details. And since this is security feature implementing it for Linux
is in their best interest.
Do you know any of them who should be cc'd?

--Andy
Gleb Natapov
2014-09-19 18:12:49 UTC
Permalink
Post by Andy Lutomirski
Post by Gleb Natapov
Post by H. Peter Anvin
Post by Gleb Natapov
Post by H. Peter Anvin
Post by Gleb Natapov
Post by H. Peter Anvin
Post by Gleb Natapov
Linux detects what hypervior it runs on very early
Not anywhere close to early enough. We're talking for uses like kASLR.
h = cpuid(HYPERVIOR_SIGNATURE)
if (h == KVMKVMKVM) {
if (cpuid(kvm_features) & kvm_rnd)
rdmsr(kvm_rnd)
else (h == HyperV) {
if (cpuid(hv_features) & hv_rnd)
rdmsr(hv_rnd)
else (h == XenXenXen) {
if (cpuid(xen_features) & xen_rnd)
rdmsr(xen_rnd)
}
If we need to do chase loops, especially not so...
What loops exactly? As a non native English speaker I fail to understand
if your answer is "yes" or "no" ;)
The above isn't actually the full algorithm used.
What part of actually algorithm cannot be implemented? Loop that searches
for KVM leaf in case KVM pretend to be HyperV (is this what you called
"chase loops"?)? First of all there is no need to implement it, if KVM
pretends to be HyperV use HyperV's way to obtain RNG, but what is the
problem with the loop?
It can be implemented, and I've done it. But it's a mess. Almost the
very first thing we do in boot (even before decompressing the kernel)
will be to scan a bunch of cpuid leaves looking for a hypervisor with
an rng source that we can use for kASLR. And we'll have to update
that code and make it bigger every time another hypervisor adds
exactly the same feature.
IMO implementing this feature is in hypervisor's best interest, so the task
of updating the code will scale by virtue of hypervisor's developers each
adding it for hypervisor he cares about.
Post by Andy Lutomirski
And then we have another copy of almost exactly the same code in the
normal post-boot part of the kernel.
We can certainly do this, but I'd much rather solve the problem once
and let all of the hypervisors and guests opt in and immediately be
compatible with each other.
Post by Gleb Natapov
I "forgot" VMware because I do not see VMware people to be CCed. They may
be even less excited about them being told _how_ this feature need to be
implemented (e.g implement HyperV leafs for the feature detection). I
do not want to and cannot speak for VMware, but my guess is that for
them it would be much easier to add an else clause for VMware in above
"if" then to coordinate with all hypervisor developers about MSR/cpuid
details. And since this is security feature implementing it for Linux
is in their best interest.
Do you know any of them who should be cc'd?
No, not anyone in particular. git log arch/x86/kernel/cpu/vmware.c may help.

But VMware is an elephant in the room here. There are other hypervisors out there.
VirtualBox, bhyve...

--
Gleb.
Andy Lutomirski
2014-09-19 18:20:49 UTC
Permalink
[cc: Alok Kataria at VMware]
Post by Gleb Natapov
Post by Andy Lutomirski
Post by Gleb Natapov
Post by H. Peter Anvin
Post by Gleb Natapov
Post by H. Peter Anvin
Post by Gleb Natapov
Post by H. Peter Anvin
Post by Gleb Natapov
Linux detects what hypervior it runs on very early
Not anywhere close to early enough. We're talking for uses like kASLR.
h = cpuid(HYPERVIOR_SIGNATURE)
if (h == KVMKVMKVM) {
if (cpuid(kvm_features) & kvm_rnd)
rdmsr(kvm_rnd)
else (h == HyperV) {
if (cpuid(hv_features) & hv_rnd)
rdmsr(hv_rnd)
else (h == XenXenXen) {
if (cpuid(xen_features) & xen_rnd)
rdmsr(xen_rnd)
}
If we need to do chase loops, especially not so...
What loops exactly? As a non native English speaker I fail to understand
if your answer is "yes" or "no" ;)
The above isn't actually the full algorithm used.
What part of actually algorithm cannot be implemented? Loop that searches
for KVM leaf in case KVM pretend to be HyperV (is this what you called
"chase loops"?)? First of all there is no need to implement it, if KVM
pretends to be HyperV use HyperV's way to obtain RNG, but what is the
problem with the loop?
It can be implemented, and I've done it. But it's a mess. Almost the
very first thing we do in boot (even before decompressing the kernel)
will be to scan a bunch of cpuid leaves looking for a hypervisor with
an rng source that we can use for kASLR. And we'll have to update
that code and make it bigger every time another hypervisor adds
exactly the same feature.
IMO implementing this feature is in hypervisor's best interest, so the task
of updating the code will scale by virtue of hypervisor's developers each
adding it for hypervisor he cares about.
I assume that you mean guest, not hypervisor.
Post by Gleb Natapov
Post by Andy Lutomirski
And then we have another copy of almost exactly the same code in the
normal post-boot part of the kernel.
We can certainly do this, but I'd much rather solve the problem once
and let all of the hypervisors and guests opt in and immediately be
compatible with each other.
Post by Gleb Natapov
I "forgot" VMware because I do not see VMware people to be CCed. They may
be even less excited about them being told _how_ this feature need to be
implemented (e.g implement HyperV leafs for the feature detection). I
do not want to and cannot speak for VMware, but my guess is that for
them it would be much easier to add an else clause for VMware in above
"if" then to coordinate with all hypervisor developers about MSR/cpuid
details. And since this is security feature implementing it for Linux
is in their best interest.
Do you know any of them who should be cc'd?
No, not anyone in particular. git log arch/x86/kernel/cpu/vmware.c may help.
But VMware is an elephant in the room here. There are other hypervisors out there.
VirtualBox, bhyve...
Exactly. The amount of effort to get everything to be compatible with
everything scales quadratically in the number of hypervisors, and the
probability that some combination is broken also increases.

If we can get everyone to back something common here then this problem
goes away.

--Andy
Post by Gleb Natapov
--
Gleb.
--
Andy Lutomirski
AMA Capital Management, LLC
Gleb Natapov
2014-09-19 20:53:42 UTC
Permalink
Post by Andy Lutomirski
[cc: Alok Kataria at VMware]
Post by Gleb Natapov
Post by Andy Lutomirski
Post by Gleb Natapov
Post by H. Peter Anvin
Post by Gleb Natapov
Post by H. Peter Anvin
Post by Gleb Natapov
Post by H. Peter Anvin
Post by Gleb Natapov
Linux detects what hypervior it runs on very early
Not anywhere close to early enough. We're talking for uses like kASLR.
h = cpuid(HYPERVIOR_SIGNATURE)
if (h == KVMKVMKVM) {
if (cpuid(kvm_features) & kvm_rnd)
rdmsr(kvm_rnd)
else (h == HyperV) {
if (cpuid(hv_features) & hv_rnd)
rdmsr(hv_rnd)
else (h == XenXenXen) {
if (cpuid(xen_features) & xen_rnd)
rdmsr(xen_rnd)
}
If we need to do chase loops, especially not so...
What loops exactly? As a non native English speaker I fail to understand
if your answer is "yes" or "no" ;)
The above isn't actually the full algorithm used.
What part of actually algorithm cannot be implemented? Loop that searches
for KVM leaf in case KVM pretend to be HyperV (is this what you called
"chase loops"?)? First of all there is no need to implement it, if KVM
pretends to be HyperV use HyperV's way to obtain RNG, but what is the
problem with the loop?
It can be implemented, and I've done it. But it's a mess. Almost the
very first thing we do in boot (even before decompressing the kernel)
will be to scan a bunch of cpuid leaves looking for a hypervisor with
an rng source that we can use for kASLR. And we'll have to update
that code and make it bigger every time another hypervisor adds
exactly the same feature.
IMO implementing this feature is in hypervisor's best interest, so the task
of updating the code will scale by virtue of hypervisor's developers each
adding it for hypervisor he cares about.
I assume that you mean guest, not hypervisor.
Yes, I mean guest support for hypervisor he cares about.
Post by Andy Lutomirski
Post by Gleb Natapov
Post by Andy Lutomirski
And then we have another copy of almost exactly the same code in the
normal post-boot part of the kernel.
We can certainly do this, but I'd much rather solve the problem once
and let all of the hypervisors and guests opt in and immediately be
compatible with each other.
Post by Gleb Natapov
I "forgot" VMware because I do not see VMware people to be CCed. They may
be even less excited about them being told _how_ this feature need to be
implemented (e.g implement HyperV leafs for the feature detection). I
do not want to and cannot speak for VMware, but my guess is that for
them it would be much easier to add an else clause for VMware in above
"if" then to coordinate with all hypervisor developers about MSR/cpuid
details. And since this is security feature implementing it for Linux
is in their best interest.
Do you know any of them who should be cc'd?
No, not anyone in particular. git log arch/x86/kernel/cpu/vmware.c may help.
But VMware is an elephant in the room here. There are other hypervisors out there.
VirtualBox, bhyve...
Exactly. The amount of effort to get everything to be compatible with
everything scales quadratically in the number of hypervisors, and the
probability that some combination is broken also increases.
The effort is distributed equally among hypervisor developers. If they
want Linux to be more secure on their hypervisor they contribute guest
code. They do need to write hypervisor part anyway. On cpus with RDRAND
instruction this MSR is not even needed and some hypervisors may decide
that support for old cpus does not worth the effort. Unified interface
does not help if hypervisor does not implement it.

--
Gleb.
Alok Kataria
2014-09-22 04:11:07 UTC
Permalink
Hi Andy,
Post by Andy Lutomirski
[cc: Alok Kataria at VMware]
Post by Gleb Natapov
Post by Andy Lutomirski
Post by Gleb Natapov
Post by H. Peter Anvin
Post by Gleb Natapov
Post by H. Peter Anvin
Post by Gleb Natapov
Post by H. Peter Anvin
Post by Gleb Natapov
Linux detects what hypervior it runs on very early
Not anywhere close to early enough. We're talking for uses like kASLR.
h = cpuid(HYPERVIOR_SIGNATURE)
if (h == KVMKVMKVM) {
if (cpuid(kvm_features) & kvm_rnd)
rdmsr(kvm_rnd)
else (h == HyperV) {
if (cpuid(hv_features) & hv_rnd)
rdmsr(hv_rnd)
else (h == XenXenXen) {
if (cpuid(xen_features) & xen_rnd)
rdmsr(xen_rnd)
}
If we need to do chase loops, especially not so...
What loops exactly? As a non native English speaker I fail to understand
if your answer is "yes" or "no" ;)
The above isn't actually the full algorithm used.
What part of actually algorithm cannot be implemented? Loop that searches
for KVM leaf in case KVM pretend to be HyperV (is this what you called
"chase loops"?)? First of all there is no need to implement it, if KVM
pretends to be HyperV use HyperV's way to obtain RNG, but what is the
problem with the loop?
It can be implemented, and I've done it. But it's a mess. Almost the
very first thing we do in boot (even before decompressing the kernel)
will be to scan a bunch of cpuid leaves looking for a hypervisor with
an rng source that we can use for kASLR. And we'll have to update
that code and make it bigger every time another hypervisor adds
exactly the same feature.
IMO implementing this feature is in hypervisor's best interest, so the task
of updating the code will scale by virtue of hypervisor's developers each
adding it for hypervisor he cares about.
I assume that you mean guest, not hypervisor.
Post by Gleb Natapov
Post by Andy Lutomirski
And then we have another copy of almost exactly the same code in the
normal post-boot part of the kernel.
We can certainly do this, but I'd much rather solve the problem once
and let all of the hypervisors and guests opt in and immediately be
compatible with each other.
Post by Gleb Natapov
I "forgot" VMware because I do not see VMware people to be CCed. They may
be even less excited about them being told _how_ this feature need to be
implemented (e.g implement HyperV leafs for the feature detection). I
do not want to and cannot speak for VMware, but my guess is that for
them it would be much easier to add an else clause for VMware in above
"if" then to coordinate with all hypervisor developers about MSR/cpuid
details. And since this is security feature implementing it for Linux
is in their best interest.
Do you know any of them who should be cc'd?
No, not anyone in particular. git log arch/x86/kernel/cpu/vmware.c may help.
But VMware is an elephant in the room here. There are other hypervisors out there.
VirtualBox, bhyve...
Exactly. The amount of effort to get everything to be compatible with
everything scales quadratically in the number of hypervisors, and the
probability that some combination is broken also increases.
If we can get everyone to back something common here then this problem
goes away.
There was a similar attempt few years back [1], to standardize on the
hypervisor cpuid space. Though a few of them were interested, getting
all hypervisor vendors to agree (actually even discuss this) turned out
to be a futile exercise. Don't mean to discourage you, but what I
learned from that attempt was that it's very difficult to standardize
unless the hardware vendors are proposing it.

In anycase can you point me to a mail which discusses the specifics of
the interface you are proposing ?

Alok

[1] - http://thread.gmane.org/gmane.comp.emulators.kvm.devel/22643
https://lkml.org/lkml/2008/9/26/351
Andy Lutomirski
2014-09-19 17:21:27 UTC
Permalink
Post by Gleb Natapov
Post by H. Peter Anvin
Post by Gleb Natapov
Linux detects what hypervior it runs on very early
Not anywhere close to early enough. We're talking for uses like kASLR.
h = cpuid(HYPERVIOR_SIGNATURE)
if (h == KVMKVMKVM) {
if (cpuid(kvm_features) & kvm_rnd)
rdmsr(kvm_rnd)
else (h == HyperV) {
if (cpuid(hv_features) & hv_rnd)
rdmsr(hv_rnd)
else (h == XenXenXen) {
if (cpuid(xen_features) & xen_rnd)
rdmsr(xen_rnd)
}
I think that there's a lot of value in having each guest
implementation be automatically compatible with all hypervisors. For
example, you forgot VMware, and VMware might be less excited about
implementing this feature if all the guests won't immediately start
using it.
Post by Gleb Natapov
?
--
Gleb.
Gleb Natapov
2014-09-19 17:59:44 UTC
Permalink
Post by Andy Lutomirski
Post by Gleb Natapov
Post by H. Peter Anvin
Post by Gleb Natapov
Linux detects what hypervior it runs on very early
Not anywhere close to early enough. We're talking for uses like kASLR.
h = cpuid(HYPERVIOR_SIGNATURE)
if (h == KVMKVMKVM) {
if (cpuid(kvm_features) & kvm_rnd)
rdmsr(kvm_rnd)
else (h == HyperV) {
if (cpuid(hv_features) & hv_rnd)
rdmsr(hv_rnd)
else (h == XenXenXen) {
if (cpuid(xen_features) & xen_rnd)
rdmsr(xen_rnd)
}
I think that there's a lot of value in having each guest
implementation be automatically compatible with all hypervisors. For
example, you forgot VMware, and VMware might be less excited about
implementing this feature if all the guests won't immediately start
using it.
I "forgot" VMware because I do not see VMware people to be CCed. They may
be even less excited about them being told _how_ this feature need to be
implemented (e.g implement HyperV leafs for the feature detection). I
do not want to and cannot speak for VMware, but my guess is that for
them it would be much easier to add an else clause for VMware in above
"if" then to coordinate with all hypervisor developers about MSR/cpuid
details. And since this is security feature implementing it for Linux
is in their best interest.

--
Gleb.
David Hepkin
2014-09-18 21:46:05 UTC
Permalink
I'm not sure what you mean by "this mechanism?" Are you suggesting that each hypervisor put "CrossHVPara\0" somewhere in the 0x40000000 - 0x400fffff CPUID range, and an OS has to do a full scan of this CPUID range on boot to find it? That seems pretty inefficient. An OS will take 1000's of hypervisor intercepts on every boot just to search this CPUID range.

I suggest we come to consensus on a specific CPUID leaf where an OS needs to look to determine if a hypervisor supports this capability. We could define a new CPUID leaf range at a well-defined location, or we could just use one of the existing CPUID leaf ranges implemented by an existing hypervisor. I'm not familiar with the KVM CPUID leaf range, but in the case of Hyper-V, the Hyper-V CPUID leaf range was architected to allow for other hypervisors to implement it and just show through specific capabilities supported by the hypervisor. So, we could define a bit in the Hyper-V CPUID leaf range (since Xen and KVM also implement this range), but that would require Linux to look in that range on boot to discover this capability.

Thanks...

David

-----Original Message-----
From: Andy Lutomirski [mailto:luto at amacapital.net]
Sent: Thursday, September 18, 2014 12:07 PM
To: Paolo Bonzini
Cc: Jun Nakajima; KY Srinivasan; Mathew John; Theodore Ts'o; John Starks; kvm list; Gleb Natapov; Niels Ferguson; David Hepkin; H. Peter Anvin; Jake Oshins; Linux Virtualization
Subject: Re: Standardizing an MSR or other hypercall to get an RNG seed?
Post by Andy Lutomirski
Post by Nakajima, Jun
Actually, that MSR address range has been reserved for that
purpose, along
- CPUID.EAX=1 -> ECX bit 31 (always returns 0 on bare metal)
- CPUID.EAX=4000_00xxH leaves (i.e. HYPERVISOR CPUID)
I don't know whether this is documented anywhere, but Linux tries to
detect a hypervisor by searching CPUID leaves 0x400xyz00 for
"KVMKVMKVM\0\0\0", so at least Linux can handle the KVM leaves being
in a somewhat variable location.
Do we consider this mechanism to work across all hypervisors and
guests? That is, could we put something like "CrossHVPara\0"
somewhere in that range, where each hypervisor would be free to
decide exactly where it ends up?
That's also possible, but extending the hypervisor CPUID range beywond
400000FFH is not officially sanctioned by Intel.
Xen started doing that in order to expose both Hyper-V and Xen CPUID
leaves, and KVM followed the practice.
Whoops.

Might Intel be willing to extend that range to 0x40000000 - 0x400fffff? And would Microsoft be okay with using this mechanism for discovery?

Do we have anyone from VMware in this thread? I don't have any VMware contacts.

--Andy
Paolo Bonzini
2014-09-18 18:56:01 UTC
Permalink
Post by Nakajima, Jun
Post by KY Srinivasan
Post by Paolo Bonzini
However, I think it would be better to have the MSR (and perhaps CPUID)
outside the hypervisor-reserved ranges, so that it becomes architecturally
defined. In some sense it is similar to the HYPERVISOR CPUID feature.
Yes, given that we want this to be hypervisor agnostic.
- CPUID.EAX=1 -> ECX bit 31 (always returns 0 on bare metal)
- CPUID.EAX=4000_00xxH leaves (i.e. HYPERVISOR CPUID)
No, that has been reserved for hypervisor-specific information (same for the MSR).
Here we want a feature that is standardized across all hypervisors.

Of course we could just agree to have a common 4000_00C0H to 4000_00FFH range
agreed upon by KVM/Xen/Hyper-V/VMware for both MSRs and CPUID. But it would
be nice for Intel to act as the registrar, also because this particular
feature in principle can be implemented by processors too (not that it makes
much sense since you could use RDRAND, but it _could_).

Paolo
Jake Oshins
2014-09-18 17:20:28 UTC
Permalink
That certainly sound reasonable to me. How do you see discovery of that working?

Thanks,
Jake Oshins


-----Original Message-----
From: Paolo Bonzini [mailto:paolo.bonzini at gmail.com] On Behalf Of Paolo Bonzini
Sent: Thursday, September 18, 2014 10:18 AM
To: Nakajima, Jun; KY Srinivasan
Cc: Mathew John; Theodore Ts'o; John Starks; kvm list; Gleb Natapov; Niels Ferguson; Andy Lutomirski; David Hepkin; H. Peter Anvin; Jake Oshins; Linux Virtualization
Subject: Re: Standardizing an MSR or other hypercall to get an RNG seed?
Post by Nakajima, Jun
In terms of the address for the MSR, I suggest that you choose one
from the range between 40000000H - 400000FFH. The SDM (35.1
ARCHITECTURAL MSRS) says "All existing and
future processors will not implement any features using any MSR in
this range." Hyper-V already defines many synthetic MSRs in this
range, and I think it would be reasonable for you to pick one for this
to avoid a conflict?
KVM is not using any MSR in that range.

However, I think it would be better to have the MSR (and perhaps CPUID)
outside the hypervisor-reserved ranges, so that it becomes
architecturally defined. In some sense it is similar to the HYPERVISOR
CPUID feature.

Paolo
KY Srinivasan
2014-09-18 14:40:44 UTC
Permalink
-----Original Message-----
From: virtualization-bounces at lists.linux-foundation.org
[mailto:virtualization-bounces at lists.linux-foundation.org] On Behalf Of Andy
Lutomirski
Sent: Wednesday, September 17, 2014 7:51 PM
To: Linux Virtualization; kvm list
Cc: Gleb Natapov; Paolo Bonzini; Theodore Ts'o; H. Peter Anvin
Subject: Standardizing an MSR or other hypercall to get an RNG seed?
Hi all-
I would like to standardize on a very simple protocol by which a guest OS can
obtain an RNG seed early in boot.
- The interface should be very easy to use. Linux, at least, will want to use it
extremely early in boot as part of kernel ASLR. This means that PCI and ACPI
will not work.
- It should be synchronous. We don't want to delay boot while waiting for a
virtio-rng. I think that Windows has some support for virtio-rng as well.)
- Random numbers obtained through this interface should be best-effort.
We want the best quality randomness that the host can provide
immediately.
It seems to me that the best interface for the actual request for a random
number is rdmsr. This is supported on all hypervisors and all virtualization
technologies. It can return a 64 bit random number, and it is easy to rdmsr
the same register more than once to get a larger random number.
The main questions are what MSR index to use and how to detect the
1. Use CPUID to detect the presence of this feature. This is very easy for
KVM to implement by using a KVM-specific CPUID feature. The problem is
that this will necessarily be KVM-specific, as the guest must first probe for
KVM and then probe for the KVM feature. I doubt that Hyper-V, for
example, wants to claim to be KVM. If we could standardize a non-
hypervisor-specific CPUID feature, then this problem would go away.
We would prefer a CPUID feature bit to detect this feature.
2. Detect the existence of the MSR by trying to read it and handling the
#GP(0) that will occur if the MSR is not present. Linux, at least, is okay with
doing this, and I have code to enable an IDT and an rdmsr fixup early enough
in boot to use it for ASLR. I don't know whether other operating systems can
do this, though.
The major questions, then, are what enumeration mechanism should be
used and what MSR index should be used.
For the MSR index, we could use an MSR from the Intel range if Intel were to
give explicit approval, thus guaranteeing that nothing would conflict. Or we
could try to agree on an MSR index in the 0x40000000-0x4fffffff range that is
unlikely to conflict with anything.
For enumeration, we could just probe the MSR if all relevant guests are okay
with this or we could standardize on a CPUID-based mechanism.
If we do the latter, I don't know what that mechanism would be.
NB: This thread will be cc'd to Microsoft and possibly Hyper-V people shortly.
I very much appreciate Jun Nakajima's help with this!
Thanks,
Andy
Regards,

K. Y
--
Andy Lutomirski
AMA Capital Management, LLC
_______________________________________________
Virtualization mailing list
Virtualization at lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Christopher Covington
2014-09-19 18:30:51 UTC
Permalink
Post by Andy Lutomirski
Hi all-
I would like to standardize on a very simple protocol by which a guest
OS can obtain an RNG seed early in boot.
- The interface should be very easy to use. Linux, at least, will
want to use it extremely early in boot as part of kernel ASLR. This
means that PCI and ACPI will not work.
How do non-virtual systems get entropy this early? RDRAND/Padlock? Truerand?
Could hypervisors and simulators simply make sure these work?

Christopher
--
Employee of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
hosted by the Linux Foundation.
Andy Lutomirski
2014-09-19 18:42:02 UTC
Permalink
On Fri, Sep 19, 2014 at 11:30 AM, Christopher Covington
Post by Christopher Covington
Post by Andy Lutomirski
Hi all-
I would like to standardize on a very simple protocol by which a guest
OS can obtain an RNG seed early in boot.
- The interface should be very easy to use. Linux, at least, will
want to use it extremely early in boot as part of kernel ASLR. This
means that PCI and ACPI will not work.
How do non-virtual systems get entropy this early? RDRAND/Padlock? Truerand?
Could hypervisors and simulators simply make sure these work?
If RDRAND is available, then Linux, at least, will use it. The rest
are too complicated for early use. Linux on x86 plays some vaguely
clever games with rdtsc and poking at the i8254 port.

I think that these tricks are even less useful as a guest than they
are on metal, and we can use paravirt mechanisms to make guest early
boot rngs much stronger.

--Andy
Nadav Amit
2014-09-19 20:21:15 UTC
Permalink
Post by Andy Lutomirski
On Fri, Sep 19, 2014 at 11:30 AM, Christopher Covington
Post by Christopher Covington
Post by Andy Lutomirski
Hi all-
I would like to standardize on a very simple protocol by which a guest
OS can obtain an RNG seed early in boot.
- The interface should be very easy to use. Linux, at least, will
want to use it extremely early in boot as part of kernel ASLR. This
means that PCI and ACPI will not work.
How do non-virtual systems get entropy this early? RDRAND/Padlock? Truerand?
Could hypervisors and simulators simply make sure these work?
If RDRAND is available, then Linux, at least, will use it. The rest
are too complicated for early use. Linux on x86 plays some vaguely
clever games with rdtsc and poking at the i8254 port.
I think that these tricks are even less useful as a guest than they
are on metal, and we can use paravirt mechanisms to make guest early
boot rngs much stronger.
Sorry for interrupting, as I understand the discussion tries to be generic.

However, it sounds to me that at least for KVM, it is very easy just to emulate the RDRAND instruction. The hypervisor would report to the guest that RDRAND is supported in CPUID and the emulate the instruction when guest executes it. KVM already traps guest #UD (which would occur if RDRAND executed while it is not supported) - so this scheme wouldn?t introduce additional overhead over RDMSR.

Nadav
Andy Lutomirski
2014-09-19 20:46:33 UTC
Permalink
Post by Nadav Amit
Post by Andy Lutomirski
On Fri, Sep 19, 2014 at 11:30 AM, Christopher Covington
Post by Christopher Covington
Post by Andy Lutomirski
Hi all-
I would like to standardize on a very simple protocol by which a guest
OS can obtain an RNG seed early in boot.
- The interface should be very easy to use. Linux, at least, will
want to use it extremely early in boot as part of kernel ASLR. This
means that PCI and ACPI will not work.
How do non-virtual systems get entropy this early? RDRAND/Padlock? Truerand?
Could hypervisors and simulators simply make sure these work?
If RDRAND is available, then Linux, at least, will use it. The rest
are too complicated for early use. Linux on x86 plays some vaguely
clever games with rdtsc and poking at the i8254 port.
I think that these tricks are even less useful as a guest than they
are on metal, and we can use paravirt mechanisms to make guest early
boot rngs much stronger.
Sorry for interrupting, as I understand the discussion tries to be generic.
However, it sounds to me that at least for KVM, it is very easy just to emulate the RDRAND instruction. The hypervisor would report to the guest that RDRAND is supported in CPUID and the emulate the instruction when guest executes it. KVM already traps guest #UD (which would occur if RDRAND executed while it is not supported) - so this scheme wouldn?t introduce additional overhead over RDMSR.
Because then guest user code will think that rdrand is there and will
try to use it, resulting in abysmal performance.

--Andy
Post by Nadav Amit
Nadav
--
Andy Lutomirski
AMA Capital Management, LLC
H. Peter Anvin
2014-09-19 21:46:40 UTC
Permalink
Post by Andy Lutomirski
Post by Nadav Amit
However, it sounds to me that at least for KVM, it is very easy just to emulate the RDRAND instruction. The hypervisor would report to the guest that RDRAND is supported in CPUID and the emulate the instruction when guest executes it. KVM already traps guest #UD (which would occur if RDRAND executed while it is not supported) - so this scheme wouldn?t introduce additional overhead over RDMSR.
Because then guest user code will think that rdrand is there and will
try to use it, resulting in abysmal performance.
Yes, the presence of RDRAND implies a cheap and inexhaustible entropy
source.

-hpa
Christopher Covington
2014-09-22 13:31:55 UTC
Permalink
Post by H. Peter Anvin
Post by Andy Lutomirski
Post by Nadav Amit
However, it sounds to me that at least for KVM, it is very easy just to emulate the RDRAND instruction. The hypervisor would report to the guest that RDRAND is supported in CPUID and the emulate the instruction when guest executes it. KVM already traps guest #UD (which would occur if RDRAND executed while it is not supported) - so this scheme wouldn?t introduce additional overhead over RDMSR.
Because then guest user code will think that rdrand is there and will
try to use it, resulting in abysmal performance.
Yes, the presence of RDRAND implies a cheap and inexhaustible entropy
source.
A guest kernel couldn't make it look like RDRAND is not present to guest
userspace?

Christopher
--
Employee of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
hosted by the Linux Foundation.
H. Peter Anvin
2014-09-22 14:17:11 UTC
Permalink
Post by Christopher Covington
Post by H. Peter Anvin
Post by Andy Lutomirski
Post by Nadav Amit
However, it sounds to me that at least for KVM, it is very easy just to emulate the RDRAND instruction. The hypervisor would report to the guest that RDRAND is supported in CPUID and the emulate the instruction when guest executes it. KVM already traps guest #UD (which would occur if RDRAND executed while it is not supported) - so this scheme wouldn?t introduce additional overhead over RDMSR.
Because then guest user code will think that rdrand is there and will
try to use it, resulting in abysmal performance.
Yes, the presence of RDRAND implies a cheap and inexhaustible entropy
source.
A guest kernel couldn't make it look like RDRAND is not present to guest
userspace?
It could, but how would you enumerate that? A new "RDRAND-CPL-0" CPUID
bit pretty much would be required.

-hpa
H. Peter Anvin
2014-09-22 14:18:20 UTC
Permalink
Post by H. Peter Anvin
It could, but how would you enumerate that? A new "RDRAND-CPL-0" CPUID
bit pretty much would be required.
Note that there are two things that differ: the CPL 0-ness and the
performance/exhaustibility attributes.

-hpa
H. Peter Anvin
2014-09-22 23:01:28 UTC
Permalink
Not really, no.

Sent from my tablet, pardon any formatting problems.
Post by Christopher Covington
Post by H. Peter Anvin
Post by Andy Lutomirski
Post by Nadav Amit
However, it sounds to me that at least for KVM, it is very easy just to emulate the RDRAND instruction. The hypervisor would report to the guest that RDRAND is supported in CPUID and the emulate the instruction when guest executes it. KVM already traps guest #UD (which would occur if RDRAND executed while it is not supported) - so this scheme wouldn?t introduce additional overhead over RDMSR.
Because then guest user code will think that rdrand is there and will
try to use it, resulting in abysmal performance.
Yes, the presence of RDRAND implies a cheap and inexhaustible entropy
source.
A guest kernel couldn't make it look like RDRAND is not present to guest
userspace?
Christopher
--
Employee of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
hosted by the Linux Foundation.
Paolo Bonzini
2014-09-21 12:39:01 UTC
Permalink
Post by Andy Lutomirski
Post by Nadav Amit
However, it sounds to me that at least for KVM, it is very easy just to emulate the RDRAND instruction. The hypervisor would report to the guest that RDRAND is supported in CPUID and the emulate the instruction when guest executes it. KVM already traps guest #UD (which would occur if RDRAND executed while it is not supported) - so this scheme wouldn?t introduce additional overhead over RDMSR.
Because then guest user code will think that rdrand is there and will
try to use it, resulting in abysmal performance.
KVM could expose a CPUID leaf that says "RDRAND is not there, but if you
execute it the hypervisor will try to do something slow but sane".

Paolo
Christopher Covington
2014-09-22 13:33:41 UTC
Permalink
Post by Andy Lutomirski
On Fri, Sep 19, 2014 at 11:30 AM, Christopher Covington
Post by Christopher Covington
Post by Andy Lutomirski
Hi all-
I would like to standardize on a very simple protocol by which a guest
OS can obtain an RNG seed early in boot.
- The interface should be very easy to use. Linux, at least, will
want to use it extremely early in boot as part of kernel ASLR. This
means that PCI and ACPI will not work.
How do non-virtual systems get entropy this early? RDRAND/Padlock? Truerand?
Could hypervisors and simulators simply make sure these work?
If RDRAND is available, then Linux, at least, will use it. The rest
are too complicated for early use. Linux on x86 plays some vaguely
clever games with rdtsc and poking at the i8254 port.
I just wanted to check that it couldn't be as simple as giving one or both of
the timers random initial values.

Christopher
--
Employee of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
hosted by the Linux Foundation.
Loading...