Plausible periodic timedelta sequences must not be allowed to enable an attacker to predict any part of the output entropy. For safety's sake, we should endeavor to ignore them entirely. Let me say up front that, while Enranda cannot reject every periodic signal, it does so successfully up to period 255 (with caveats as explained below). Period 255 is utterly implausible in the most quiescent systems, let alone those interesting enough to warrant attack, unless they have been rooted in order to construct such a protracted period via malicious virtualization (a blue pill attack), in which case, the TRNG output is irrelevant anyway. So in practice, Enranda does indeed ignore all periodic signals. But without performing a computationally expensive operation such as treewise string compression, how can we accomplish this?
Enter the "sequence hash". Here's how it works. Suppose we have a period-5 timedelta stream:
{8, 9, 8, 3, 3, 8, 9, 8, 3, 3, 8, 9, 8, 3, 3, 8, 9, 8, 3, 3...}
We then identify, one at a time, each unique sequence in the order of discovery. In the periodic example above, this process would produce the following unique sequences separated by semicolons:
{8; 9; 8, 3; 3; 8, 9; 8, 3, 3; 8, 9, 8; 3, 3; 8, 9, 8, 3; 3...}
We don't actually search backwards, which would be computationally expensive. Instead Enranda implements the sequence hash ("sequence_hash" in the code) as a 16-bit "corkscrew hash" of successive 16-bit timedeltas until said hash is found to be "unique" (meaning that it has not occurred in the past (2^16) such hashes, as ensured by a ring buffer of sequence_hashes at sequence_hash_list_base, and their populations at sequence_hash_count_list_base). Just like its metal counterpart, a corkscrew hash rotates on each iteration in order to make it as sensitive as possible to the order of particular timedeltas, yet symmetric with respect to each iteration in that all bits of all timedeltas influence the corkscrew hash, regardless of order. The simplest corkscrew hashes meeting these criteria are rotate-and-add and rotate-and-xor; we use the former for the sake of maximum collision avoidance. (See enranda_entropy_accrue() for an example.) The reason for the rotation (which is coprime to 16 (bits)), is to achieve maximum dispersion despite most timedeltas being of similar small magnitude.
As soon as a unique sequence_hash is discovered, it's stored to sequence_hash_list_base and its population at sequence_hash_count_list_base is set to one, whereupon sequence_hash itself is reset to zero. But if a previously encountered hash is seen again (as evidenced by its population being nonzero), its population is incremented (such that wrapping back to zero is prohibited). The only way in which population can be decremented is by virtue of a sequence_hash expiring from sequence_hash_list_base, which contains (2^16) items.
Critically, the populations of all (2^16) possible (sequence_hash)es at sequence_hash_count_list_base are initially set to one by enranda_rewind(). This almost gaurantees that the first (2^16) timedelta observations will be ignored. A few such observations pass through this filter, but it doesn't matter because of the way that protoentropy is hashed into entropy, as explained later. And of course, there are hash collisions from time to time, meaning that we fail to detect a small fraction of unique sequences. This is a minor tax on efficiency, which is well worth the speed advantages of keeping the ring buffer and population list small enough to fit in most CPU caches.
So how many hashes, in the worst case, would be required to detect a sequence with period 255? In the worst case, if all the timedeltas were unique, then we would have 255 sequences of period one. We would also have 255 sequences of periods 2, 3... 255. If every such sequence had a unique hash, then we would need (255*(1+2+... 255)), or 32,640 (because sum(n)=(n(n+1)/2)), items in the ring buffer, which is slightly less than half of the (2^16) items actually present. The reason for the factor of (1/2) will become clear in the next lesson, but essentially, we need to ensure that at least half of the unique (sequence_hash)es were generated in a truly unpredictable manner. So in other words, we are making the conservative assumption that the first 32,640 hashes are all different but actually predictable due to a period-255 timedelta sequence. In reality, tiny periods seldom exceeding 4, which are much harder to forget (or to regard as unique in the first place) on account of requiring quadratically fewer unique (sequence_hash)es to circumscribe, will dominate.
So conservatively, a unique sequence_hash is worth at least 16 bits, on average, because by definition it occurs less than once in (2^16) such hashes. But we cannot directly use them as protoentropy, let alone entropy! The reason is that, while unique (sequence_hash)es are mostly unpredictable, we can at least say with confidence that popular (sequence_hash)es will not be among them. This constitutes a weak but potentially exploitable bias. What to do? See the next lesson.
No comments:
Post a Comment