January 12, 2009

Picking k Elements Randomly from Stream

Consider the problem of picking K elements randomly from a stream of N elements. Solve the problem for known and unknown (but finite N).

Solution:
For K=1, the solution is posted in my previous blog post Picking an Element Randomly from Stream.

Case 1: N is known
In this case problem can be solved in O(N) time using O(K) space. The idea is that pick the first element with probability K/N. If the element is picked then pick the next element with probability (K-1)/(N-1) otherwise pick it with probability K/(N-1). To see that the second element is still picked with probability K/N. There are two cases.
  1. First element is picked. So second element is picked with probability = K/N * (K-1)/(N-1).
  2. First element is not picked. So second element is picked with probability = (1-K/N) * K/(N-1).
So the selection probability of second element is = K/N*(K-1)/(N-1) + (1-K/N)*K/(N-1) = K/N. Similar logic extends further. The following code implements this approach.
for i = 1 to N
   if rand <= K/(N+1-i)
      print A[i]
      K -= 1
      if K == 0
         break
      end
   end
end
if K != 0
  print A[N-K:N]
end
Case 2: N is unknown.
If we do not know N beforehand, and yet we want to guarantee that all elements are selected with probability K/N, we can use Reservoir Sampling algorithm.
for i = 1:K
   S[i] = A[i]
end

for i = K+1:length(A)
   j = random(1,i)
   if j <= K
      S[j] = A[i]
   end 
end
The first K elements of the stream are automatically selected. So for N = K this solves the problem. Now consider N = K+1. For i <= N-1 all elements are selected with probability 1 so far. The last element should be selected with probability K/N and so should others be. The index j is less than or equal to K with probability K/(K+1), so last element is rejected with probability 1/(K+1). For the elements in the array at index j they can be rejected if last element is accepted and its index is j. The chances of that is K/(K+1) * 1/K = 1/(K+1). So all elements are rejected with 1/(K+1) probability. Now consider N = K+2. Until index i=K+1 all elements are rejected with probability 1/(K+1). Using same logic as we applied above, we can see that an element gets rejected with probability 2/(K+2).

2 comments:

  1. The algorithm you give for the second part is false, it should be

    for i = K+1:length(A)

    so that case were length(A) <= K still work.

    ReplyDelete
  2. Thanks for pointing it out. Its corrected

    ReplyDelete