Puzzles, Maths and Algorithms: Finding Top K Frequent Items

Problem: Consider a file containing N numbers. For e.x. {2,3,4,3,1,78,78,3} is a file containing N=8 numbers. Assume that N is small enough to fit in the computer's memory M. So in this case N << M. Now we need to find top K frequent numbers in the file. For the above example, output should be {3,78} for K = 2.

Running Time: Time complexity should be either O(N), O(N log K), O(N + K log N)

Solution: show

Finding top K frequent elements is a classical database and data streaming problem and there are several solutions to it ranging from deterministic to probabilistic. I'll illustrate the three obvious ones here.

The first step is to count how many times each number appears in the file. If the file is pre-sorted then we need a single scan over the file.

Function COUNTS:
   A = load_file_in_array

   last = A[1]
   ctr = 1
   for i = 2:N
      if last == A[i]
         ctr += 1
      else
         CountMap{last} = ctr;
         last = A[i]
         ctr = 1
      end 
   end

   // required for the last element of the array
   CountMap{last} = ctr
end

Note that CountMap is a hashmap that stores counts of all the elements. The procedure COUNTS is quite efficient if the file is pre-sorted. Now if the file is not pre-sorted then sorting increases the time complexity to O(N log N). In that case, we can do better by directly using hashmap to maintain current count of each number without sorting the file as follows:

Function EFFICIENT_COUNTS:
   A = load_file_in_array
   
   for i = 1:N
      CountMap{A[i]} += 1
   end
end

The above procedure obtains counts of each element in a single scan of the file. Hence it runs in O(N) time. So now we have all the numbers along with their frequencies in an array (can be extracted by enumerating all keys of the CountMap or by another scan of the file!!). So for the example we have in the problem statement, we get the following tuple: {(2,1), (3,3), (4,1), (1,1), (78,2)}.

The problem that remains is to find the most frequent K numbers from the array. The naive approach would be to sort numbers on their frequencies and pick the top K. This would take O(U log U) time where U (=5) is the number of unique elements in the array. If we consider U = O(N) then the time complexity of this sorting is O(N log N). We can do better than that as follows:

Approach 1: O(N) time
Use selection algorithm to find the Kth most frequent number (on the second element of the tuple) using the Selection Algorithm in O(U) time. The Kth most frequent element partitions the array in two parts: first part containing top K most frequent elements and second part containing bottom U-K-1 frequent elements. So we get the top K most frequent elements in no particular order in O(N) time (assuming U = O(N)). They can be sorted in O(K log K) if needed. Note that although this approach runs in O(N) time, the constants hidden in the O-notation can be large. So in practice this approach can be slower than the two approaches described below.

Approach 2: O(N log K) time
Pick first K tuples and put them on MIN-HEAP, where a tuple (x,y) is less than a tuple (a,b) if y is less than b. The time complexity to make the min heap of size K is O(K).

Then for the remaining U - K elements, pick them one by one. If the picked element is lesser than the minimum on the heap, discard that element. Otherwise remove the min element from the head and insert the selected element in the heap. This ensures that heap contains only K elements. This delete-insert operation is O(log K) for each element.

Once we are done picking all the elements, the elements that finally remain in the min-heap are the top K frequent items which can be popped in O(K log K) time. The overall cost of this approach is O(K + (U-K) log K + K log K) = O(K + U log K). Since K < U and U = O(N), we get the time complexity of O(N log K).

Approach 3: O(N + K log N) time
This approach is similar to approach 2 but the main difference is that we make a MAX-HEAP of all the U elements. So the first step is to make the max heap of all the elements in O(U). Then remove the maximum element from the heap K times in O(K log U) time. The K removed elements are the desired most frequent elements. The time complexity of this method is O(U + K log U) and by setting U = O(N) we get O(N + K log N).

Let us stop for a moment and contrast approach 2 from 3. For simplicity let T2 = K + N log K be the time complexity of approach 2 and T3 = N + K log N be the time complexity of the third approach. Figure below plots the ratio T2/T3 for N=100 and for different values of K. We observe that approach 3 is considerably better for small values of K whereas approach 2 is better for large values of K. Though actual difference depends on the constants involved we can still see the merit of one approach over another.

Puzzles, Maths and Algorithms

July 3, 2012

Finding Top K Frequent Items

No comments:

Post a Comment