“Pick a random number between 0 and 10.”
It’s a fairly basic task, often used as a project in introductory programming classes and extended language examples. The most common and straightforward solution is modulus. (Indeed, most CS classes use this exercise purely to introduce the concept of modulus.)
int number = rand() % range;
However, there is a problem with this method — it skews the distribution of numbers produced.
C’s rand() returns an integer in the range [0, RAND_MAX].
For illustration, let’s:
Assume RAND_MAX = 8
Assume rand is a perfect random distribution. (Equal chance of returning any integer in range.)
Let range = 5.
The possible values for number are then:
[0 % 5, 1 % 5, 2 % 5, 3 % 5, 4 % 5, 5 % 5, 6 % 5, 7 % 5, 8 % 5] => [0, 1, 2, 3, 4, 0, 1, 2, 3]
You’ll notice that the [0, 3] range appears *twice*, but 4 only once. Therefore, there is only a 1/9 chance that rand() % 5 will generate a 4, but all other values have a probability of 2/9. That said, there *are* combinations of RAND_MAX and range that will work: for example, RAND_MAX = 7, range = 4:
[0, 1, 2, 3, 4, 5, 6, 7] % 4 => [0, 1, 2, 3, 0, 1, 2, 3]
Of course, on a real system, RAND_MAX is never so low. To analyse the actual ranges, we’ll need to rely on more advanced statistics.
The Goodness of Fit (GOF) test is a method of testing whether the distribution of a given sample agrees with the theoretical distribution of the entire population. This test relies on the test statistic (read ‘chi-squared’). We derive our by:
That is to say, we take a sample of data — in our case, iterations of rand(). This data is (must be) categorical, separable into discrete ‘buckets’ of values. In this case, the buckets are the integer values mod range. We then compare the expected and observed counts in each bucket, and sum the result for all buckets. (Important caveat: for this to provide a valid result, the expected count *must* be at least 5 for all buckets.)
With in hand, we can calculate the p-value of our data. The p-value is the probability that the theoretical distribution is correct, based on our sample. To calculate the p-value, we use the function. The cdf, or Cumulative Distribution Function, is the area underneath the distribution function for a given distribution. The value we are interested in is the area to the right of our statistic, so our calculation looks like:
Because this is a probability distribution, the total area under the curve is 1. Subtracting the area up to our , we’re left with the right tail, which is our p-value. The higher our p-value, the closer we are to the theoretical distribution.
You’ll note one other variable in that equation: . This is the ‘degree of freedom’ in the distribution. The distribution function changes with sample size — the degree of freedom is how this is represented. For a Goodness of Fit test, the degree of freedom is equal to the number of buckets, minus one.
Now for the actual test.
Since we expect rand() to approximate a perfect random distribution (it doesn’t, on most systems, but that is a separate issue), our theoretical, expected count in each bucket is equal to the number of trials divided by the number of buckets (our range.) For the Goodness of Fit test, we need a minimum of five expected counts in each bucket. Therefore, to test a given range we need to generate at least trials.
I wrote a short program in C++ to calculate for all ranges of rand(). Note that this does not directly tell us our p-values. Instead, it generates a Scilab script to calculate and plot them. (Actually evaluating requires Calculus beyond what I’m familiar with.)
Below ~4000, many ranges produce a reasonable distribution. However, above that point, the vast majority of ranges lead to *highly* skewed distributions (p = 0).
So, what’s the takeaway? How bad *is* rand() % range?
Pretty bad! The more important question, however, is whether you care.
Cryptography has very strict standards on the distribution of its random numbers. Not only do you not want to use modulus to remap ranges in crypto, you don’t want to be using rand() in the first place! Other applications are far less demanding. The slight skew is likely irrevelant in a game, for example. (Doom rather famously uses a single static table for its random number generator, foregoing rand() altogether.)
Finally: This post is *almost certainly* littered with errors. Most glaringly, assuming rand() to have a proper random distribution is incorrect. This is intended more as an exploration of the concept than a rigorous proof. However, corrections are welcome!