July 9, 2012

Finding Top K Elements in Large Dataset

Problem: Consider the problem of finding top K frequent numbers from a file with N numbers. Now consider that N is very large such that it cannot fit in the memory M available to the program. How do we find the top K frequent elements now with the assumption that K < M < N.

Deterministic solution: show

Single pass probabilistic solution: show


  1. for the first method if assume we have 10 chunks, with 1st chunk we form k min heap and use the same heap for all chunks completed, each of which updates heap based on local count of numbers...

    but it may happen such that a number present across chunks but may not have resulted to final heap obtained with this sol ?

  2. @Anon, All numbers are guaranteed to be in only one of the chunks due to hashing based partitioning.

  3. hm..makes sense thanks for the response.

    i didn't get Alternate method!! can you give any reference to that or give small example for better understanding...

  4. Thanks Srinivaas for pointing that I missed a "r". Updated.

  5. I didn't understand why you remove the min element from the min heap if it is greater than f in the probabilistic solution. Shouldn't the min element be removed if it is less than f?
    I checked a paper about the subject (https://www.cs.rutgers.edu/~farach/pubs/FrequentStream.pdf) and in the 5th page, it is as I think. It says approximately this: "Add x to the heap if the estimate is greater than the smallest estimated count in the heap". It is on page 5.

  6. Thanks sevgiinsani for the due dilligence. Most likely the bug occured due to HTML goof-up of < and >. Corrected it now.