Puzzles, Maths and Algorithms: 2012

August 12, 2012

Bulls and Cows

Bulls and Cows is a two player game. First player thinks of a N digit secret number and second player guesses that number. This is done by second player making a guess and first player telling the number of bulls and cows in the guess. If the digits of guess matches secret and they are in right position, it's a bull, if they match but on different position, they are cows. For e.g. Let secret be 0123 and guess be 0012. The guess has 1 bull, 2 cows. The game continues, until second player gets a response of N bulls.

The problem is to write a computer solver for bulls and cows that would try to guess the secret making only a small number of attempts.

Solution: show

Let D be the unique digits that can be put in the secret. Let N be the length of the secret number. The total number of secrets is D^N.

A naive solution is to try all D^N guesses until N bulls is the response. This approach makes D^N attempts in worst case. Not good.

We can do better than this. Let us make a random first guess and we get a response r. We create a pruned set by selecting all numbers (from D^N possible) that have response r with the guess. Secret must be in the pruned set. We can pick a number from this set as our guess and repeat the logic until we hear N bulls.

There are three ways to pick the next number
a) Pick the first number from the pruned set
b) Pick the number randomly from the pruned set
c) Use a heuristic to pick a number (Note this number can be outside the pruned set).

Our experiments show that b) is better than a). For N = 4, D = 10 with distinct digits in secret (5040 possible secrets) and initial guess of 0123, the pick first strategy makes 6.0051 attempts on average and 9 attempts in a worst case. On the other hand, randomly picking from the pruned set makes 5.9136 attempts on average and 9 attempts in a worst case.

Note that optimal strategy for N=4, D=10 with distinct digits in secret (5040 possible secrets) takes an average of 5.2131 attempts with 7 attempts in the worst case (technical paper). It is also guaranteed that 6 attempts is the worst case lower bound for any algorithm. Since 5.2131 is theoretical best, 5.9136 achieved by random is not bad.

Let us now explore option c. There are (N+1)*(N+2)/2 - 1 response classes starting from (0 bulls, 0 cows) to (N bulls, 0 cows). Note the -1 is there because if the response is N-1 bulls then we cannot have 1 cow. The response to a guess, partitions the D^N numbers (as shown in Fig 1).

Fig 1. Response partition of 10^4 numbers for the secret 0123

Now we know that the secret lies in the pruned set. We need to pick our next guess that partitions the pruned set in the best possible manner. One way to judge the goodness of partition is by information gain [sum_i ni * log (ni)]. The issue now is that to compute information gain, we need to compute responses for D^N * length(pruned set) pairs.

Note that the best guess can be outside the pruned set, that is why we need to compute D^N * length(pruned set) responses and not length(pruned set)^2. To see this consider that pruned set is [089, 189, 289, 389, 489, 589, 689, 789]. If our secret is 789 we need 8 guesses but if we try 012 as guess 1 then we get pruned set [389, 489, 589, 689, 789], then we try 345 as our guess 2 to get a pruned set [689, 789] and now two more guesses are required. So we solved it in 4 attempts instead of 8.

But the bigger issues is that this technique can only be applied if length of pruned set is small, otherwise it can be computationally prohibitive and memory intensive. Our strategy is as follows:

step 1. Pick distinct digits to make first guess.
step 2. Select random guess from the pruned set.
step 3. Perform step 2, three or four times.
step 4. Pick guess using information gain measure.
step 5. Perform step 4, until game is solved.

For N = 4, D = 10, (5040 possible secrets), this strategy makes 5.7201 attempts on average with 8 attempts in a worst case. Summary of algorithms is as follows:

algo	average #attempts	worst #attempts
D=10, N=4, 5040 secrets
pick first	6.0051	9
pick random	5.9136	9
heuristic pick	5.7201	8
optimal	5.2131	7

algo	average #attempts	worst #attempts
D=10, N=4, 10000 secrets
pick first	8.0291	13
pick random	6.6323	11
heuristic pick	6.2001	8

Source code: bullscows.py

Technical resources: optimal strategy, bullscows.pdf

July 21, 2012

Number of Paths in a Rectangular Grid

Problem 1: Consider a rectangular grid of n x m rectangular cells. The problem is to find the number of shortest (or monotonic in this case) paths along the edges of the cell that start at (0,0) and end at (n,m). How do we print all such paths?

A monotonic path consists of moving up or right but not down or left. Figure 1 illustrates such a path for a grid of size 5 x 3.

Figure 1

Solution: show

To go from (0,0) to (n,m), we need to make n moves to the right and m moves up. These moves can be performed in any order and still we reach the point (n,m). So the total number of ways is simply the total number of ways n red balls and m green balls can be arranged in a straight line. This is simply ^n+mC_m.

We can print all such paths using recursion. The following code does this.

Function printPath(n, m, path)
   if n == 0 and m == 0
      print path
      return
   end

   if n > 0
      printPath(n-1, m, path + "right")
   end

   if m > 0
      printPath(n, m-1, path + "up")
   end
end

Problem 2: Now we need to print all monotonic paths that do not go above the diagonal y = x. Figure 2 shows such a path. Note that the path in figure 1 goes above the diagonal, hence not desirable in this case. How many such paths exist for n x m grid (n >= m)?

Figure 2

Solution: show

Let us first discuss how to print all monotonic paths that do not go above the diagonal. In order for a path to be valid, at all points (x,y) on the path, y should be less than or equal to x. We can simply ensure this constraint while printing paths, using the following code.

Function printPath(x, y, n, m, path)
   if x == n and y == m:
      print path
      return
   end

   if x < n
      printPath(x+1, y, n, m, path + "right")
   end

   if y < m and y < x
      printPath(x, y+1, n, m, path + "up")
   end
end

The total number of such paths are [(n-m+1)/(n+1)] * ^n+mC_m. For n = m, this simplifies to [1/(n+1)] * ²ⁿC_n which is known as Catalan number. The wiki page also contains other interesting problems that have Catalan number as their solution.

Now, In order to obtain the number of paths, we need to subtract the number of paths that cross the diagonal from the total number of paths ^n+mC_m.

Consider a path that crosses the diagonal. Let that path cross the diagonal for the first time at point (x,y). Since this is the first point after crossing the diagonal, so y should be equal to x+1. So the path has taken x right and y up moves. It is yet to take n-x right and m-y up moves to reach (n,m). If we flip the n-x right to n-x up and m-y up to right, then starting from point (x,y), we will reach the point (x+m-y,y+n-x). Setting y = x+1, we get that the resulting end point is (m-1,n+1). Since m <= n, the resulting end point is above the diagonal. So we can do surgery on all paths that cross the diagonal to make them reach (m-1,n+1). So the total number of paths that cross the diagonal must be the total number of paths from (0,0) to (m-1,n+1). This is ^n+mC_m-1.

So the desired number is = ^n+mC_m - ^n+mC_m-1 = [(n-m+1)/(n+1)] * ^n+mC_m. This is a proof based on Andre's reflection method.

Alternately:
Andre's reflection method was used to originally solve a very interesting problem called Ballot problem. The ballot problem states that in a two candidate election, candidate A polled p votes and candidate B polled q votes (p > q). What is the probability that candidate A was leading over candidate B throughout the counting.

Well the probability is simply (p-q)/(p+q).

We can use the Ballot theorem to find the number of monotonic paths below the diagonal (that do not even touch the diagonal). We can assume right move to be candidate A and up move to be candidate B and the two polled n,m votes respectively. So the probability of number of right leading number of up is (n-m)/(n+m). So the number of monotonic paths that are below the diagonal (do not even touch) is [(n-m)/(n+m)] * ^n+mC_m.

Now since we are allowed to touch the diagonal, we can assume instead going to (n+1,m). This is because the first two moves would be right moves (otherwise we might touch the diagonal). Then we eliminate the first right move and allow a diagonal touch. So the total number of ways is [(n+1-m)/(n+1+m)] * ^n+m+1C_m = [(n-m+1)/(n+1)] * ^n+mC_m.

Problem 3: Now let us consider that apart from up and right moves we can take up-right (diagonal) moves. Figure 3 shows such a path. How many such paths exist for n x m grid?

Figure 3

Solution: show

This problem is similar to problem 1, except that paths can contain diagonal moves. One diagonal move consumes 1 right and 1 up move.

So let us assume that we make r diagonal moves, where r <= min(n,m). In this case, we need to make n-r right moves and m-r up moves apart from r diagonal moves to reach (n,m). The number of ways, we can do this is = (n+m-r)! / [ (n-r)! * (m-r)! * r! ]. If we set r = 0, then we get the solution to problem 1. But in this case r can go from 0 to min(m,n). So the total number of ways is =

Sum_r (n+m-r)! / [(n-r)!*(m-r)!*r!], where r in [0, min(m,n)]

Alternately, we can derive the recurrence relation for the total number of paths. Let D(n,m) be the total number of paths from (0,0) to (n,m) with up, right and diagonal moves. Clearly, we can reach (n,m) from (n-1,m) by a right move, from (n,m-1) by a up move, and from (n-1,m-1) by a diagonal move. These are the only three atomic steps. So, D(n,m) = D(n-1,m) + D(n,m-1) + D(n-1,m-1) for n,m > 0. D is called as Delannoy number.

The code to print all paths has a small addition to the code for problem 1.

Function printPath(n, m, path)
   if n == 0 and m == 0
      print path
      return
   end

   if n > 0
      printPath(n-1, m, path + "right")
   end

   if m > 0
      printPath(n, m-1, path + "up")
   end

   if n > 0 and m > 0
      printPath(n-1, m-1, path + "diagonal")
   end
end

We can extend this problem further by asking the number of paths that do not cross the diagonal (same constraint as problem 2). The number of paths in this case is called Schroder number.

Source Code: grid_paths.py

July 9, 2012

Finding Top K Elements in Large Dataset

Problem: Consider the problem of finding top K frequent numbers from a file with N numbers. Now consider that N is very large such that it cannot fit in the memory M available to the program. How do we find the top K frequent elements now with the assumption that K < M < N.

Deterministic solution: show

The idea is similar to solving the problem for small N (N < M) (see here). For large N, divide the problem into chunks of size <= M, then solve it.

In order to divide the problem, consider a uniform hashing function H that takes a key and returns an integer from the set {1,2,....,ceil(N/M)}. So we get N/M chunks with roughly M unique keys per chunk. Since the number of unique numbers (U) is less than N, we expect each chunk to have less than M unique numbers.

Now for each chunk, we compute the frequency of numbers contained in that chunk. This frequency computation can be done in memory and we can maintain a MIN-HEAP of size K which can be directly updated (follow the steps presented here). As a result only two reads of the dataset and one disk write is required to get the top K frequent items. The complexity of the algorithm is O(N log K).

Probably for a small K (K < M^2/N), we can compute the top K per chunk in O(K log M) and combine the top K for all chunks N*K/M (<M) to get the overall top K. The total complexity of the algorithm is O(N + M log M).

Alternate method:
We assume that the disk has a huge capacity which allows us to make a disk based hash table. Imagine that we create a very large file with holes and divide the file into B blocks with a capacity to hold more than N/B numbers and their integer counts. Using a uniform hash function H which takes as input an arbitrary number x and return H(x): the block number from the set {0, 1, ..., B-1}, we can store the numbers and their frequencies on disk map. H would give us the offset to seek within the file and we write number (and their frequency) sequentially once in correct block (or via another hash function H1).

Now we proceed as usual, start counting the numbers in memory using a hash table (former approach). When we encounter a new number that could not be put in memory, we purge some entries from the hash table. The purged entries are written to the disk map. Now to purge the entries, we maintain an array which counts the frequency of the keys that lie within a block. Keys that belong to the most infrequent blocks can be purged first (or blocks that are least recently used or that lead to least disk access, etc).

The following code gives a very basic detail of this approach

BlockFreq = array(B)
NumberFreq = hashtable(M)
diskwrite = 0
for i = 1:N
   x = A[i]
   BlockFreq[H[x]] += 1

   if NumberFreq.haskey(x)
      NumberFreq[x] += 1
      continue
   end

   if NumberFreq.hasspace()
      NumberFreq[x] = 1
      continue
   end

   if DiskMap.haskey(x)
      DiskMap[x] += 1
   else
      DiskMap[x] = 1
   end

   if diskwrite == 10
      purge(NumberFreq, BlockFreq)
      diskwrite = 0
   else
      diskwrite += 1
   end
end

Here purge is a procedure to purge some set of keys from NumberFreq based on the BlockFreq. Note that this code omits several key details of this process, so the idea presented here is quite crude.

Single pass probabilistic solution: show

Solution 1 is quite efficient as it requires only two disk reads of the dataset, but the bottleneck can be the disk writes during the initial chunk formation. We can reduce that bottleneck by considering a data-structure called Bloom filters.

So consider that we have B uniform hash functions H1, H2, ..., HB and each hash function converts a key to a range {1,2,...,R}. Now imagine an array C of size B x R (<M) that represents count of how many times each key is seen. For each number (say x) that we read from the dataset, compute Hi[x] and increment C[i,Hi[x]] by 1. So we maintain B counts of x in different R buckets. We can say that the true count of x is less than min(C[1,H1[x]], ..., C[B,HB[x]]).

Now if the query is to get all the elements with frequency greater than some threshold then we can use bloom filters to get all such numbers (with some false positives though, which can be filtered using another pass on the dataset). This can save a complete write of the data to the disk. (see the paper: Computing Iceberg Queries Efficiently).

But in our case, we are interested in finding the top K frequent numbers. Following modification can be used to estimate the frequency of each number.

MH = MIN-HEAP(K)
for i = 1:N
   x = A[i]
   for b = 1:B
      C[b,Hb(x)] += Sb(x)
   end

   if contains(MH,x)
      increment count of x in the heap
   else 
      f = median(Hi(x) * Si(x), \forall i)
      if f > min(MH)
         remove-min(MH)
         insert(MH, (x,f))
      end
   end
end

The Sb functions is a {-1,+1} hash function and this data-structure is called CountSketch. More details of the method is available in the paper: Finding Frequent Items in Data Streams.

Note that the above algorithm takes a single passes on the dataset (and no disk write) but it is not guaranteed to give the top K frequent items. It can make some mistakes for some less frequent items. In practice the choice of the hashing functions can be critical for the performance of the algorithm.

July 3, 2012

Finding Top K Frequent Items

Problem: Consider a file containing N numbers. For e.x. {2,3,4,3,1,78,78,3} is a file containing N=8 numbers. Assume that N is small enough to fit in the computer's memory M. So in this case N << M. Now we need to find top K frequent numbers in the file. For the above example, output should be {3,78} for K = 2.

Running Time: Time complexity should be either O(N), O(N log K), O(N + K log N)

Solution: show

Finding top K frequent elements is a classical database and data streaming problem and there are several solutions to it ranging from deterministic to probabilistic. I'll illustrate the three obvious ones here.

The first step is to count how many times each number appears in the file. If the file is pre-sorted then we need a single scan over the file.

Function COUNTS:
   A = load_file_in_array

   last = A[1]
   ctr = 1
   for i = 2:N
      if last == A[i]
         ctr += 1
      else
         CountMap{last} = ctr;
         last = A[i]
         ctr = 1
      end 
   end

   // required for the last element of the array
   CountMap{last} = ctr
end

Note that CountMap is a hashmap that stores counts of all the elements. The procedure COUNTS is quite efficient if the file is pre-sorted. Now if the file is not pre-sorted then sorting increases the time complexity to O(N log N). In that case, we can do better by directly using hashmap to maintain current count of each number without sorting the file as follows:

Function EFFICIENT_COUNTS:
   A = load_file_in_array
   
   for i = 1:N
      CountMap{A[i]} += 1
   end
end

The above procedure obtains counts of each element in a single scan of the file. Hence it runs in O(N) time. So now we have all the numbers along with their frequencies in an array (can be extracted by enumerating all keys of the CountMap or by another scan of the file!!). So for the example we have in the problem statement, we get the following tuple: {(2,1), (3,3), (4,1), (1,1), (78,2)}.

The problem that remains is to find the most frequent K numbers from the array. The naive approach would be to sort numbers on their frequencies and pick the top K. This would take O(U log U) time where U (=5) is the number of unique elements in the array. If we consider U = O(N) then the time complexity of this sorting is O(N log N). We can do better than that as follows:

Approach 1: O(N) time
Use selection algorithm to find the Kth most frequent number (on the second element of the tuple) using the Selection Algorithm in O(U) time. The Kth most frequent element partitions the array in two parts: first part containing top K most frequent elements and second part containing bottom U-K-1 frequent elements. So we get the top K most frequent elements in no particular order in O(N) time (assuming U = O(N)). They can be sorted in O(K log K) if needed. Note that although this approach runs in O(N) time, the constants hidden in the O-notation can be large. So in practice this approach can be slower than the two approaches described below.

Approach 2: O(N log K) time
Pick first K tuples and put them on MIN-HEAP, where a tuple (x,y) is less than a tuple (a,b) if y is less than b. The time complexity to make the min heap of size K is O(K).

Then for the remaining U - K elements, pick them one by one. If the picked element is lesser than the minimum on the heap, discard that element. Otherwise remove the min element from the head and insert the selected element in the heap. This ensures that heap contains only K elements. This delete-insert operation is O(log K) for each element.

Once we are done picking all the elements, the elements that finally remain in the min-heap are the top K frequent items which can be popped in O(K log K) time. The overall cost of this approach is O(K + (U-K) log K + K log K) = O(K + U log K). Since K < U and U = O(N), we get the time complexity of O(N log K).

Approach 3: O(N + K log N) time
This approach is similar to approach 2 but the main difference is that we make a MAX-HEAP of all the U elements. So the first step is to make the max heap of all the elements in O(U). Then remove the maximum element from the heap K times in O(K log U) time. The K removed elements are the desired most frequent elements. The time complexity of this method is O(U + K log U) and by setting U = O(N) we get O(N + K log N).

Let us stop for a moment and contrast approach 2 from 3. For simplicity let T2 = K + N log K be the time complexity of approach 2 and T3 = N + K log N be the time complexity of the third approach. Figure below plots the ratio T2/T3 for N=100 and for different values of K. We observe that approach 3 is considerably better for small values of K whereas approach 2 is better for large values of K. Though actual difference depends on the constants involved we can still see the merit of one approach over another.