Bulls and Cows is a two player game. First player thinks of a N digit secret number and second player guesses that number. This is done by second player making a guess and first player telling the number of bulls and cows in the guess. If the digits of guess matches secret and they are in right position, it's a bull, if they match but on different position, they are cows. For e.g. Let secret be 0123 and guess be 0012. The guess has 1 bull, 2 cows. The game continues, until second player gets a response of N bulls.
The problem is to write a computer solver for bulls and cows that would try to guess the secret making only a small number of attempts.
Solution: show
August 12, 2012
July 21, 2012
Number of Paths in a Rectangular Grid
Problem 1: Consider a rectangular grid of n x m rectangular cells. The problem is to find the number of shortest (or monotonic in this case) paths along the edges of the cell that start at (0,0) and end at (n,m). How do we print all such paths?
A monotonic path consists of moving up or right but not down or left. Figure 1 illustrates such a path for a grid of size 5 x 3.
Solution: show
Problem 2: Now we need to print all monotonic paths that do not go above the diagonal y = x. Figure 2 shows such a path. Note that the path in figure 1 goes above the diagonal, hence not desirable in this case. How many such paths exist for n x m grid (n >= m)?
Solution: show
Problem 3: Now let us consider that apart from up and right moves we can take up-right (diagonal) moves. Figure 3 shows such a path. How many such paths exist for n x m grid?
Solution: show
A monotonic path consists of moving up or right but not down or left. Figure 1 illustrates such a path for a grid of size 5 x 3.
Figure 1 |
To go from (0,0) to (n,m), we need to make n moves to the right and m moves up. These moves can be performed in any order and still we reach the point (n,m). So the total number of ways is simply the total number of ways n red balls and m green balls can be arranged in a straight line. This is simply n+mCm.
We can print all such paths using recursion. The following code does this.
We can print all such paths using recursion. The following code does this.
Function printPath(n, m, path) if n == 0 and m == 0 print path return end if n > 0 printPath(n-1, m, path + "right") end if m > 0 printPath(n, m-1, path + "up") end end
Problem 2: Now we need to print all monotonic paths that do not go above the diagonal y = x. Figure 2 shows such a path. Note that the path in figure 1 goes above the diagonal, hence not desirable in this case. How many such paths exist for n x m grid (n >= m)?
Figure 2 |
Let us first discuss how to print all monotonic paths that do not go above the diagonal. In order for a path to be valid, at all points (x,y) on the path, y should be less than or equal to x. We can simply ensure this constraint while printing paths, using the following code.
Now, In order to obtain the number of paths, we need to subtract the number of paths that cross the diagonal from the total number of paths n+mCm.
Consider a path that crosses the diagonal. Let that path cross the diagonal for the first time at point (x,y). Since this is the first point after crossing the diagonal, so y should be equal to x+1. So the path has taken x right and y up moves. It is yet to take n-x right and m-y up moves to reach (n,m). If we flip the n-x right to n-x up and m-y up to right, then starting from point (x,y), we will reach the point (x+m-y,y+n-x). Setting y = x+1, we get that the resulting end point is (m-1,n+1). Since m <= n, the resulting end point is above the diagonal. So we can do surgery on all paths that cross the diagonal to make them reach (m-1,n+1). So the total number of paths that cross the diagonal must be the total number of paths from (0,0) to (m-1,n+1). This is n+mCm-1.
So the desired number is = n+mCm - n+mCm-1 = [(n-m+1)/(n+1)] * n+mCm. This is a proof based on Andre's reflection method.
Alternately:
Andre's reflection method was used to originally solve a very interesting problem called Ballot problem. The ballot problem states that in a two candidate election, candidate A polled p votes and candidate B polled q votes (p > q). What is the probability that candidate A was leading over candidate B throughout the counting.
Well the probability is simply (p-q)/(p+q).
We can use the Ballot theorem to find the number of monotonic paths below the diagonal (that do not even touch the diagonal). We can assume right move to be candidate A and up move to be candidate B and the two polled n,m votes respectively. So the probability of number of right leading number of up is (n-m)/(n+m). So the number of monotonic paths that are below the diagonal (do not even touch) is [(n-m)/(n+m)] * n+mCm.
Now since we are allowed to touch the diagonal, we can assume instead going to (n+1,m). This is because the first two moves would be right moves (otherwise we might touch the diagonal). Then we eliminate the first right move and allow a diagonal touch. So the total number of ways is [(n+1-m)/(n+1+m)] * n+m+1Cm = [(n-m+1)/(n+1)] * n+mCm.
Function printPath(x, y, n, m, path) if x == n and y == m: print path return end if x < n printPath(x+1, y, n, m, path + "right") end if y < m and y < x printPath(x, y+1, n, m, path + "up") end endThe total number of such paths are [(n-m+1)/(n+1)] * n+mCm. For n = m, this simplifies to [1/(n+1)] * 2nCn which is known as Catalan number. The wiki page also contains other interesting problems that have Catalan number as their solution.
Now, In order to obtain the number of paths, we need to subtract the number of paths that cross the diagonal from the total number of paths n+mCm.
Consider a path that crosses the diagonal. Let that path cross the diagonal for the first time at point (x,y). Since this is the first point after crossing the diagonal, so y should be equal to x+1. So the path has taken x right and y up moves. It is yet to take n-x right and m-y up moves to reach (n,m). If we flip the n-x right to n-x up and m-y up to right, then starting from point (x,y), we will reach the point (x+m-y,y+n-x). Setting y = x+1, we get that the resulting end point is (m-1,n+1). Since m <= n, the resulting end point is above the diagonal. So we can do surgery on all paths that cross the diagonal to make them reach (m-1,n+1). So the total number of paths that cross the diagonal must be the total number of paths from (0,0) to (m-1,n+1). This is n+mCm-1.
So the desired number is = n+mCm - n+mCm-1 = [(n-m+1)/(n+1)] * n+mCm. This is a proof based on Andre's reflection method.
Alternately:
Andre's reflection method was used to originally solve a very interesting problem called Ballot problem. The ballot problem states that in a two candidate election, candidate A polled p votes and candidate B polled q votes (p > q). What is the probability that candidate A was leading over candidate B throughout the counting.
Well the probability is simply (p-q)/(p+q).
We can use the Ballot theorem to find the number of monotonic paths below the diagonal (that do not even touch the diagonal). We can assume right move to be candidate A and up move to be candidate B and the two polled n,m votes respectively. So the probability of number of right leading number of up is (n-m)/(n+m). So the number of monotonic paths that are below the diagonal (do not even touch) is [(n-m)/(n+m)] * n+mCm.
Now since we are allowed to touch the diagonal, we can assume instead going to (n+1,m). This is because the first two moves would be right moves (otherwise we might touch the diagonal). Then we eliminate the first right move and allow a diagonal touch. So the total number of ways is [(n+1-m)/(n+1+m)] * n+m+1Cm = [(n-m+1)/(n+1)] * n+mCm.
Problem 3: Now let us consider that apart from up and right moves we can take up-right (diagonal) moves. Figure 3 shows such a path. How many such paths exist for n x m grid?
Figure 3 |
This problem is similar to problem 1, except that paths can contain diagonal moves. One diagonal move consumes 1 right and 1 up move.
So let us assume that we make r diagonal moves, where r <= min(n,m). In this case, we need to make n-r right moves and m-r up moves apart from r diagonal moves to reach (n,m). The number of ways, we can do this is = (n+m-r)! / [ (n-r)! * (m-r)! * r! ]. If we set r = 0, then we get the solution to problem 1. But in this case r can go from 0 to min(m,n). So the total number of ways is =
The code to print all paths has a small addition to the code for problem 1.
Source Code: grid_paths.py
So let us assume that we make r diagonal moves, where r <= min(n,m). In this case, we need to make n-r right moves and m-r up moves apart from r diagonal moves to reach (n,m). The number of ways, we can do this is = (n+m-r)! / [ (n-r)! * (m-r)! * r! ]. If we set r = 0, then we get the solution to problem 1. But in this case r can go from 0 to min(m,n). So the total number of ways is =
Sum_r (n+m-r)! / [(n-r)!*(m-r)!*r!], where r in [0, min(m,n)]Alternately, we can derive the recurrence relation for the total number of paths. Let D(n,m) be the total number of paths from (0,0) to (n,m) with up, right and diagonal moves. Clearly, we can reach (n,m) from (n-1,m) by a right move, from (n,m-1) by a up move, and from (n-1,m-1) by a diagonal move. These are the only three atomic steps. So, D(n,m) = D(n-1,m) + D(n,m-1) + D(n-1,m-1) for n,m > 0. D is called as Delannoy number.
The code to print all paths has a small addition to the code for problem 1.
Function printPath(n, m, path) if n == 0 and m == 0 print path return end if n > 0 printPath(n-1, m, path + "right") end if m > 0 printPath(n, m-1, path + "up") end if n > 0 and m > 0 printPath(n-1, m-1, path + "diagonal") end endWe can extend this problem further by asking the number of paths that do not cross the diagonal (same constraint as problem 2). The number of paths in this case is called Schroder number.
Source Code: grid_paths.py
July 9, 2012
Finding Top K Elements in Large Dataset
Problem: Consider the problem of finding top K frequent numbers from a file with N numbers. Now consider that N is very large such that it cannot fit in the memory M available to the program. How do we find the top K frequent elements now with the assumption that K < M < N.
Deterministic solution: show
Single pass probabilistic solution: show
Deterministic solution: show
The idea is similar to solving the problem for small N (N < M) (see here). For large N, divide the problem into chunks of size <= M, then solve it.
In order to divide the problem, consider a uniform hashing function H that takes a key and returns an integer from the set {1,2,....,ceil(N/M)}. So we get N/M chunks with roughly M unique keys per chunk. Since the number of unique numbers (U) is less than N, we expect each chunk to have less than M unique numbers.
Now for each chunk, we compute the frequency of numbers contained in that chunk. This frequency computation can be done in memory and we can maintain a MIN-HEAP of size K which can be directly updated (follow the steps presented here). As a result only two reads of the dataset and one disk write is required to get the top K frequent items. The complexity of the algorithm is O(N log K).
Probably for a small K (K < M^2/N), we can compute the top K per chunk in O(K log M) and combine the top K for all chunks N*K/M (<M) to get the overall top K. The total complexity of the algorithm is O(N + M log M).
Alternate method:
We assume that the disk has a huge capacity which allows us to make a disk based hash table. Imagine that we create a very large file with holes and divide the file into B blocks with a capacity to hold more than N/B numbers and their integer counts. Using a uniform hash function H which takes as input an arbitrary number x and return H(x): the block number from the set {0, 1, ..., B-1}, we can store the numbers and their frequencies on disk map. H would give us the offset to seek within the file and we write number (and their frequency) sequentially once in correct block (or via another hash function H1).
Now we proceed as usual, start counting the numbers in memory using a hash table (former approach). When we encounter a new number that could not be put in memory, we purge some entries from the hash table. The purged entries are written to the disk map. Now to purge the entries, we maintain an array which counts the frequency of the keys that lie within a block. Keys that belong to the most infrequent blocks can be purged first (or blocks that are least recently used or that lead to least disk access, etc).
The following code gives a very basic detail of this approach
In order to divide the problem, consider a uniform hashing function H that takes a key and returns an integer from the set {1,2,....,ceil(N/M)}. So we get N/M chunks with roughly M unique keys per chunk. Since the number of unique numbers (U) is less than N, we expect each chunk to have less than M unique numbers.
Now for each chunk, we compute the frequency of numbers contained in that chunk. This frequency computation can be done in memory and we can maintain a MIN-HEAP of size K which can be directly updated (follow the steps presented here). As a result only two reads of the dataset and one disk write is required to get the top K frequent items. The complexity of the algorithm is O(N log K).
Probably for a small K (K < M^2/N), we can compute the top K per chunk in O(K log M) and combine the top K for all chunks N*K/M (<M) to get the overall top K. The total complexity of the algorithm is O(N + M log M).
Alternate method:
We assume that the disk has a huge capacity which allows us to make a disk based hash table. Imagine that we create a very large file with holes and divide the file into B blocks with a capacity to hold more than N/B numbers and their integer counts. Using a uniform hash function H which takes as input an arbitrary number x and return H(x): the block number from the set {0, 1, ..., B-1}, we can store the numbers and their frequencies on disk map. H would give us the offset to seek within the file and we write number (and their frequency) sequentially once in correct block (or via another hash function H1).
Now we proceed as usual, start counting the numbers in memory using a hash table (former approach). When we encounter a new number that could not be put in memory, we purge some entries from the hash table. The purged entries are written to the disk map. Now to purge the entries, we maintain an array which counts the frequency of the keys that lie within a block. Keys that belong to the most infrequent blocks can be purged first (or blocks that are least recently used or that lead to least disk access, etc).
The following code gives a very basic detail of this approach
BlockFreq = array(B) NumberFreq = hashtable(M) diskwrite = 0 for i = 1:N x = A[i] BlockFreq[H[x]] += 1 if NumberFreq.haskey(x) NumberFreq[x] += 1 continue end if NumberFreq.hasspace() NumberFreq[x] = 1 continue end if DiskMap.haskey(x) DiskMap[x] += 1 else DiskMap[x] = 1 end if diskwrite == 10 purge(NumberFreq, BlockFreq) diskwrite = 0 else diskwrite += 1 end endHere purge is a procedure to purge some set of keys from NumberFreq based on the BlockFreq. Note that this code omits several key details of this process, so the idea presented here is quite crude.
Single pass probabilistic solution: show
Solution 1 is quite efficient as it requires only two disk reads of the dataset, but the bottleneck can be the disk writes during the initial chunk formation. We can reduce that bottleneck by considering a data-structure called Bloom filters.
So consider that we have B uniform hash functions H1, H2, ..., HB and each hash function converts a key to a range {1,2,...,R}. Now imagine an array C of size B x R (<M) that represents count of how many times each key is seen. For each number (say x) that we read from the dataset, compute Hi[x] and increment C[i,Hi[x]] by 1. So we maintain B counts of x in different R buckets. We can say that the true count of x is less than min(C[1,H1[x]], ..., C[B,HB[x]]).
Now if the query is to get all the elements with frequency greater than some threshold then we can use bloom filters to get all such numbers (with some false positives though, which can be filtered using another pass on the dataset). This can save a complete write of the data to the disk. (see the paper: Computing Iceberg Queries Efficiently).
But in our case, we are interested in finding the top K frequent numbers. Following modification can be used to estimate the frequency of each number.
Note that the above algorithm takes a single passes on the dataset (and no disk write) but it is not guaranteed to give the top K frequent items. It can make some mistakes for some less frequent items. In practice the choice of the hashing functions can be critical for the performance of the algorithm.
So consider that we have B uniform hash functions H1, H2, ..., HB and each hash function converts a key to a range {1,2,...,R}. Now imagine an array C of size B x R (<M) that represents count of how many times each key is seen. For each number (say x) that we read from the dataset, compute Hi[x] and increment C[i,Hi[x]] by 1. So we maintain B counts of x in different R buckets. We can say that the true count of x is less than min(C[1,H1[x]], ..., C[B,HB[x]]).
Now if the query is to get all the elements with frequency greater than some threshold then we can use bloom filters to get all such numbers (with some false positives though, which can be filtered using another pass on the dataset). This can save a complete write of the data to the disk. (see the paper: Computing Iceberg Queries Efficiently).
But in our case, we are interested in finding the top K frequent numbers. Following modification can be used to estimate the frequency of each number.
MH = MIN-HEAP(K) for i = 1:N x = A[i] for b = 1:B C[b,Hb(x)] += Sb(x) end if contains(MH,x) increment count of x in the heap else f = median(Hi(x) * Si(x), \forall i) if f > min(MH) remove-min(MH) insert(MH, (x,f)) end end endThe Sb functions is a {-1,+1} hash function and this data-structure is called CountSketch. More details of the method is available in the paper: Finding Frequent Items in Data Streams.
Note that the above algorithm takes a single passes on the dataset (and no disk write) but it is not guaranteed to give the top K frequent items. It can make some mistakes for some less frequent items. In practice the choice of the hashing functions can be critical for the performance of the algorithm.
July 3, 2012
Finding Top K Frequent Items
Problem: Consider a file containing N numbers. For e.x. {2,3,4,3,1,78,78,3} is a file containing N=8 numbers. Assume that N is small enough to fit in the computer's memory M. So in this case N << M. Now we need to find top K frequent numbers in the file. For the above example, output should be {3,78} for K = 2.
Running Time: Time complexity should be either O(N), O(N log K), O(N + K log N)
Solution: show
Finding top K frequent elements is a classical database and data streaming problem and there are several solutions to it ranging from deterministic to probabilistic. I'll illustrate the three obvious ones here.
The first step is to count how many times each number appears in the file. If the file is pre-sorted then we need a single scan over the file.
The problem that remains is to find the most frequent K numbers from the array. The naive approach would be to sort numbers on their frequencies and pick the top K. This would take O(U log U) time where U (=5) is the number of unique elements in the array. If we consider U = O(N) then the time complexity of this sorting is O(N log N). We can do better than that as follows:
Approach 1: O(N) time
Use selection algorithm to find the Kth most frequent number (on the second element of the tuple) using the Selection Algorithm in O(U) time. The Kth most frequent element partitions the array in two parts: first part containing top K most frequent elements and second part containing bottom U-K-1 frequent elements. So we get the top K most frequent elements in no particular order in O(N) time (assuming U = O(N)). They can be sorted in O(K log K) if needed. Note that although this approach runs in O(N) time, the constants hidden in the O-notation can be large. So in practice this approach can be slower than the two approaches described below.
Approach 2: O(N log K) time
Pick first K tuples and put them on MIN-HEAP, where a tuple (x,y) is less than a tuple (a,b) if y is less than b. The time complexity to make the min heap of size K is O(K).
Then for the remaining U - K elements, pick them one by one. If the picked element is lesser than the minimum on the heap, discard that element. Otherwise remove the min element from the head and insert the selected element in the heap. This ensures that heap contains only K elements. This delete-insert operation is O(log K) for each element.
Once we are done picking all the elements, the elements that finally remain in the min-heap are the top K frequent items which can be popped in O(K log K) time. The overall cost of this approach is O(K + (U-K) log K + K log K) = O(K + U log K). Since K < U and U = O(N), we get the time complexity of O(N log K).
Approach 3: O(N + K log N) time
This approach is similar to approach 2 but the main difference is that we make a MAX-HEAP of all the U elements. So the first step is to make the max heap of all the elements in O(U). Then remove the maximum element from the heap K times in O(K log U) time. The K removed elements are the desired most frequent elements. The time complexity of this method is O(U + K log U) and by setting U = O(N) we get O(N + K log N).
Let us stop for a moment and contrast approach 2 from 3. For simplicity let T2 = K + N log K be the time complexity of approach 2 and T3 = N + K log N be the time complexity of the third approach. Figure below plots the ratio T2/T3 for N=100 and for different values of K. We observe that approach 3 is considerably better for small values of K whereas approach 2 is better for large values of K. Though actual difference depends on the constants involved we can still see the merit of one approach over another.
The first step is to count how many times each number appears in the file. If the file is pre-sorted then we need a single scan over the file.
Function COUNTS: A = load_file_in_array last = A[1] ctr = 1 for i = 2:N if last == A[i] ctr += 1 else CountMap{last} = ctr; last = A[i] ctr = 1 end end // required for the last element of the array CountMap{last} = ctr endNote that CountMap is a hashmap that stores counts of all the elements. The procedure COUNTS is quite efficient if the file is pre-sorted. Now if the file is not pre-sorted then sorting increases the time complexity to O(N log N). In that case, we can do better by directly using hashmap to maintain current count of each number without sorting the file as follows:
Function EFFICIENT_COUNTS: A = load_file_in_array for i = 1:N CountMap{A[i]} += 1 end endThe above procedure obtains counts of each element in a single scan of the file. Hence it runs in O(N) time. So now we have all the numbers along with their frequencies in an array (can be extracted by enumerating all keys of the CountMap or by another scan of the file!!). So for the example we have in the problem statement, we get the following tuple: {(2,1), (3,3), (4,1), (1,1), (78,2)}.
The problem that remains is to find the most frequent K numbers from the array. The naive approach would be to sort numbers on their frequencies and pick the top K. This would take O(U log U) time where U (=5) is the number of unique elements in the array. If we consider U = O(N) then the time complexity of this sorting is O(N log N). We can do better than that as follows:
Approach 1: O(N) time
Use selection algorithm to find the Kth most frequent number (on the second element of the tuple) using the Selection Algorithm in O(U) time. The Kth most frequent element partitions the array in two parts: first part containing top K most frequent elements and second part containing bottom U-K-1 frequent elements. So we get the top K most frequent elements in no particular order in O(N) time (assuming U = O(N)). They can be sorted in O(K log K) if needed. Note that although this approach runs in O(N) time, the constants hidden in the O-notation can be large. So in practice this approach can be slower than the two approaches described below.
Approach 2: O(N log K) time
Pick first K tuples and put them on MIN-HEAP, where a tuple (x,y) is less than a tuple (a,b) if y is less than b. The time complexity to make the min heap of size K is O(K).
Then for the remaining U - K elements, pick them one by one. If the picked element is lesser than the minimum on the heap, discard that element. Otherwise remove the min element from the head and insert the selected element in the heap. This ensures that heap contains only K elements. This delete-insert operation is O(log K) for each element.
Once we are done picking all the elements, the elements that finally remain in the min-heap are the top K frequent items which can be popped in O(K log K) time. The overall cost of this approach is O(K + (U-K) log K + K log K) = O(K + U log K). Since K < U and U = O(N), we get the time complexity of O(N log K).
Approach 3: O(N + K log N) time
This approach is similar to approach 2 but the main difference is that we make a MAX-HEAP of all the U elements. So the first step is to make the max heap of all the elements in O(U). Then remove the maximum element from the heap K times in O(K log U) time. The K removed elements are the desired most frequent elements. The time complexity of this method is O(U + K log U) and by setting U = O(N) we get O(N + K log N).
Let us stop for a moment and contrast approach 2 from 3. For simplicity let T2 = K + N log K be the time complexity of approach 2 and T3 = N + K log N be the time complexity of the third approach. Figure below plots the ratio T2/T3 for N=100 and for different values of K. We observe that approach 3 is considerably better for small values of K whereas approach 2 is better for large values of K. Though actual difference depends on the constants involved we can still see the merit of one approach over another.
Subscribe to:
Posts (Atom)