Sunday, May 24, 2015

The Trouble with Timestamps

Home / Previous Lesson / Next Lesson

It's obviously tempting to read the TSC in a very tight loop, feeding each read into a pseudorandom number generator (PRNG), in order to produce what passes for the output of an ideal TRNG. The problems with such an approach are many, which we will investigate in detail below.

First of all, we need to introduce the concept of a "timedelta". A timedelta is the difference between 2 successive TSC reads. Usually, timedeltas are small because the reads occur much closer to one another in time than the full TSC span (usually 64 but sometimes 32 bits). Nonetheless, some rare events can result in massive timedelta observations: (1) performing the pair of reads across a protracted environmental interrupt, such as a system suspend event; (2) core relocation due to OS task rescheduling, occasionally resulting in quasinegative timedeltas; or (3) bad OS or firmware code which improperly saves and restores the TSC, for example, across a CPU clock halt event required for power savings.

The timedelta stream nonetheless serves as the exclusive source of "protoentropy", the highly biased but slightly unpredictable "soup" which gets blended into bona fide entropy, for Enranda. At least, this saves us the embarrassment of claiming that the initial TSC state represents any particular entropy content; Enranda assumes it to be completely predictable, which is part of its coldboot attack immune system.

Now, reading the TSC in the manner described above is overwhelmingly biased. After all, a CPU is a moderately predictable beast, executing tight loops in identical time, almost every time. So at first glance, this bias appears to be an asset: we could simply ignore all the "usual" timedeltas, retaining the oddballs as a source of entropy. For example, we might have a timedelta stream that looks like:

{5, 5, 5, 15}

in which case, we would ignore the 5s and keep the 15. But what, exactly, is that 15 worth in term of bits? It occurred once out of 4 samples, so logically, it must be worth 2 bits. But there are some problems with that naive assumption. For example, the timedelta stream might continue like this:

{5, 5, 5, 15, 15, 15, 15, 15, 15}

in which case, the most reasonable conclusion would be that CPU frequency throttling by a factor of 3 simply expanded the usual timedelta by the same factor. But surely this cannot be true because the CPU should be able to execute each TSC read loop in the same number of clock cycles, regardless of the wall clock equivalent thereof. Yet, in fact, this counterargument turns out to be false on X86 and X64 CPUs (and probably others) due to a phenomenon called "TSC invariance".

TSC invariance is the scaling process by which the TSC tick rate is made to be constant relative to wall clock time, regardless of the current CPU clock frequency. This is easily accomplished by adding N (instead of one) on each TSC incrementation after dividing the clock frequency by N. (Noninteger scaling factors actually induce periodic cycles of addends which differ by one, for example {2, 2, 2, 3}, to make things look even more noisy despite maintaining absolute predictability.) I personally consider TSC invariance to be a design flaw because the TSC was never intended to measure wall clock time to begin with, but my contention is obviously moot.

Nevertheless we should be able to recognize scaled usual timedeltas in order to discard them as well. For instance, we know that this particular CPU can throttle by a factor of 3, so we can just ignore the 15s in addition to the 5s. Unfortunately, even if we resort to longterm statistical analysis, there is still no concrete means by which to discern CPU throttling factors, not the least of which because continuous scaling across hundreds of different frequencies is likely to become the norm in the future because it maximizes power and thermal efficiency, not unlike the replacement of gearwise tranmission in cars with continuously variable transmission. Moreover, different parts of the CPU operate in different clock domains. So for instance, despite the frequency throttling of the CPU by a factor of 2, the memory bus might be kept at the same speed because it consumes so little power relative to the CPU. In that case, we might find that a loop increases from 5 to 7 clocks. The 7 would look unique and remarkable from a naive integer ratio throtting perspective, but it would in fact be mundanely predictable and in no way representative of the notional scaling factor of 2.

It gets worse. We have what I call "autopseudorandomness", which is the tendency of software TRNGs to observe unique timedeltas due not to truly unpredictable environmental interrupts, but rather the pseudorandom effects of the read loop itself on CPU execution timing. For example, there might be memory accesses inside the loop which may or may not hit some particular level of the CPU cache, resulting in read stalls. The stalls might occur, or not, based on the pseudorandomness of cache contents. Instruction alignment might also play a role: because the CPU is constantly prefetching memory into the L1 (most local) cache, it might encounter delays due to speculatively prefetching memory addresses implied by instructions which exist beyond the end of the loop. Or perhaps the top of the loop is aligned on an odd byte, resulting in similar prefetching phenomena. Or worse, perhaps we are unknowingly executing inside a virtual machine, in which case the complicated but perhaps predictable behavior of the hypervisor (virtual machine manager) would fan the flames of autopseudorandomness.

Yet, when we read a sufficiently large number of timedeltas, even hypervisor autopseudorandomness can be filtered out to the extent to which it is truly predictable; any residue thereof would indeed contain entropy. (But what about the first loop? Is that not cold and predictable, yet speciously entropic? Yes, it is, which is why Enranda ignores most of the first (2^16) TSC reads (as explained later).) Ultimately, like all PRNGs, autopseudorandom sequences must by definition be periodic. (This is not theoretical. I have seen period 4 myself, for example.) While we cannot filter all periodic signals, Enranda has a robust filter for all periodic signals of realistic length in uncompromised machines, virtualized or not. We will describe this filter later.

For now, suppose for example that we have the following periodic timedelta stream:

{5, 5, 7, 8, 7, 7, 5, 5, 7, 8, 7, 7...}

So above we have period 6. Although 8 happens to occur only once per period, 5 and 7 occur more often, making it very difficult to discern the period without doing computationally expensive transforms (such as treewise string compression or a Fourier transform). Worse, what if the period is 20, just because some hypervisor resets something every 20 TSC reads? How would we detect such repetition efficiently? And moreover, how would we extract the residual entropy from quasiperiodic timedelta streams, which by definition are only mostly periodic but are peppered with rare unexpected values?

Notwithstanding these questions, let us assume that we somehow manage to delete all the periodic and predictable timedeltas from the stream, leaving a residual stream which is aperiodic and presumably unpredictable. When we analyze this stream, we notice that timedelta 29 occurs only about every 1000 filtered reads. We conclude that it must be worth about 10 bits because ((2^10)=1024). We need only hash the exact timestamp upon reading it, resulting in 10 white (uniformally unbiased) bits -- and voila! -- we have 10 bits of bona fide entropy.

Only, we don't! What we failed to realize is that the timedelta of 29 is due to some periodic hardware event, such as the audio device reading another line of data from memory once per millisecond. While it does not occur exactly every 1000 TSC reads, it occurs approximately every 1000, say 980 to 1020. So in fact, the jitter in the timedelta 29 observation is only worth about 5 bits, not 10! The only way to see this clearly would be to observe millions if not billions of timedeltas, gathering statistics all the while.

So I tried this. But it doesn't work, either, because the world is unpredictable to us (yet potentially predictable to attackers) in ways which defeat all the best statistical engines -- nevermind the appalling latency that we might induce. For example, it might be that a millisecond after we gather these timedelta statistics, a harddrive suddenly comes online and starts causing predictable yet unforeseen timing disruption. We mistake such disruptions for novelty, resulting in the output of predictable coldboot entropy. Game over.

Now perhaps you can appreciate why Enranda has been under development, on and off, for the past 3 years!

After discarding so many flawed designs, I experienced a rather obtuse epiphany: we must regard the entire timedelta stream as pseudorandom, as though it were entirely predictable to the hypervisor which pulls the strings controlling the TSC puppet. This is an obnoxiously pessimistic assumption, because, after all, noise happens, but it is nonetheless the safest assumption. How, though, does it leave any room to produce a TRNG at all? Technically, it doesn't. But there is such a thing as a PRNG of such complexity as to defy analysis to anyone not in direct control of the machine. Enranda regards the timedelta stream as a biased version of exactly this sort of byzantine PRNG, and uses it as its exclusive source of protoentropy.

No comments:

Post a Comment