Solution:
For K=1, the solution is posted in my previous blog post Picking an Element Randomly from Stream.
Case 1: N is known
In this case problem can be solved in O(N) time using O(K) space. The idea is that pick the first element with probability K/N. If the element is picked then pick the next element with probability (K-1)/(N-1) otherwise pick it with probability K/(N-1). To see that the second element is still picked with probability K/N. There are two cases.
- First element is picked. So second element is picked with probability = K/N * (K-1)/(N-1).
- First element is not picked. So second element is picked with probability = (1-K/N) * K/(N-1).
for i = 1 to N if rand <= K/(N+1-i) print A[i] K -= 1 if K == 0 break end end end if K != 0 print A[N-K:N] endCase 2: N is unknown.
If we do not know N beforehand, and yet we want to guarantee that all elements are selected with probability K/N, we can use Reservoir Sampling algorithm.
for i = 1:K S[i] = A[i] end for i = K+1:length(A) j = random(1,i) if j <= K S[j] = A[i] end endThe first K elements of the stream are automatically selected. So for N = K this solves the problem. Now consider N = K+1. For i <= N-1 all elements are selected with probability 1 so far. The last element should be selected with probability K/N and so should others be. The index j is less than or equal to K with probability K/(K+1), so last element is rejected with probability 1/(K+1). For the elements in the array at index j they can be rejected if last element is accepted and its index is j. The chances of that is K/(K+1) * 1/K = 1/(K+1). So all elements are rejected with 1/(K+1) probability. Now consider N = K+2. Until index i=K+1 all elements are rejected with probability 1/(K+1). Using same logic as we applied above, we can see that an element gets rejected with probability 2/(K+2).
The algorithm you give for the second part is false, it should be
ReplyDeletefor i = K+1:length(A)
so that case were length(A) <= K still work.
Thanks for pointing it out. Its corrected
ReplyDelete