Puzzles, Maths and Algorithms

February 16, 2015

Finding Switch - Bulb Association

Problem 1: There are three light bulbs in a room. Outside the room are three switches which operate those bulbs. You know that the bulbs are initially turned off and you can toggle the switches as much as you want. While you toggle the switches, you cant peek inside the room. Finally when you are done with the switches, you enter the room once and you have to tell which switch is associated with which bulb.

Solution: show

Problem 2: Same problem as 1, but now we dont know if the bulbs are initially on or off. So in a way we do not know if turning the switch up turns on the bulb or turns it off. All we know is that all the bulbs are in same state, i.e., all are either on or off. Now we have to toggle the switches and enter the room once and tell which switch operates which bulb.

July 2, 2014

Solving Sudoku Puzzles

Recently, I got hooked on to Sudoku (thanks to a wonderful IPhone app). I found it to be a slightly more constructive time sink and plan to teach it to my daughter some day (she is one year old as of yet). For this, I thought of writing an assistive software to teach kids how to simplify a Sudoku. This motivated me to first build a Sudoku solver and this blog is about it.

A Sudoku is a matrix of cells, typically 9x9, but it can be any number. In this post, we consider N x N Sudoku (where N is a perfect square, i.e. 1x1, 4x4, 9x9, 16x16). Now for such a Sudoku, we first construct a groups of cells called as blocks. We get N row blocks, N column blocks and N sub-grid blocks. All the blocks contain N cells each. A cell can have a value from 1 to N, such that, if a cell is assigned a value, then no other cell in the same block can be assigned that value. To check if a Sudoku is solved, all the blocks must sum to N*(N+1)/2. Another check is one where number uniqueness constraints are satisfied per block and each cell is assigned a value from 1 to N.

To simplify a Sudoku by hand, I typically use the following techniques:

A number is assigned to a cell,
- If a cell can hold only that number.
- If that number can be held by only that cell (in at least one block).
Once a cell is assigned a number, other cells (in the same block) get their possible hints (or values) pruned.
Two numbers are assigned to two cells if those two cells share at least one block (Note: two cells can share maximum of two blocks),
- The two numbers are referred exclusively by these two cells in a given block. In this case other hints of the two numbers are pruned.
- The two cells refer to those two numbers only. In this case the two numbers are removed from the hints of other cells in their shared block.
To finally solve it, simplify Sudoku by applying (1) and (2) until further simplification is not possible. Then pick a cell with least number of hints. Assign it one of those hints and repeat this brute-force method. If failed, then try another hint. If all hints fail then Sudoku is not solvable.

Algorithm 1a, 1b are pretty straightforward. Algorithm 2a and 2b are also easy to see (if you have played Sudoku before). In spirit Algo 2 can be extended for k>2 but the benefit of simplification it provides is marginal compared to the cost of finding set of k cells or k numbers that satisfy the unique reference aspect. Now to encode these algorithms, I wrote a Java program which solves Sudoku of size N x N (where N is a perfect square).

The program is oriented to be event driven, where a cell is asked to remove its hint based on the logic of (1,2,3). In this process if a cell has only one hint remaining, it triggers a fix event. When the fix event is called, this cell is assigned the value and that value is removed from the hint of cells that share block with it. If at anytime any inconsistencies are found, the events return a “false”, indicating that Sudoku is not solvable. This is a fail fast technique and it quickly helps in determining if a Sudoku is non-solvable or not. Otherwise, we would wait for all cells to get filled and then realize that it failed (wasting precious time). A simple brute-force strategy is very easy to code, however the possibilities to check are huge. For e.g., we might end up checking O(9^81) possibilities (which can be prohibitive). The simplification algorithms (1) and (2) come to the rescue by drastically simplifying the Sudoku. For most of the easy problems - the recursive step (3) is not even triggered.

I searched online for Sudoku problems and found that Peter Norvig has a blog about it. He used algo(1,3) but not (2) and has shared a set of easy, hard, and hardest Sudoku puzzles. It amazes me to say that for most hard problem, I take at least 8-10 minutes (and sometimes a hint) but a computer can solve it in order of ~0.1 ms. If you are interested in the maths of Sudoku (e.g. number of possible solutions), please check out the wikipedia page: maths of sudoku.

In the first analysis below, we see how many cells different algorithms are able to solve.

	Input	(1)	(2a)	(2b)	(2)	(1+2a)	(1+2b)	(*)
Easy	35%	89%	54%	68%	70%	96%	94%	96%
Hard	25%	31%	26%	26%	26%	41%	31%	41%
Hardest	30%	45%	35%	35%	36%	50%	45%	50%

Here we see that algorithm (1) itself can solve 89% of the cells for easy problems. However, the joint application of (1+2) is able to solve 96% of the easy cells and 41% of the hard cells (and roughly provide a 10-15% improvement. This is a great simplification, especially for easy Sudoku problems. Amongst (2a) and (2b), I find (2a) to be a better strategy to run in conjunction with (1) as (1+2a) has same result as (*).

In the next analysis, we see how many hints are left after application of different algorithms

	Input	(1)	(2a)	(2b)	(2)	(1+2a)	(1+2b)	(*)
Easy	474	29	126	89	82	10	16	10
Hard	544	230	252	260	250	184	228	181
Hardest	508	161	191	193	189	141	158	141

Similar analysis as above. 2a appears better than 2b. But (*) must better than (1) or (2). So combining these strategies is much more effective. We see that clearly (1+2a) as good as (*) so we can simply drop algorithm 2b. To see how much (if at all) there is a penalty of running the (2b) algo, we turn to time taken.

For measuring time, I use Java's System.NanoTime() function, which gives very precise timing. Experiments are run on my MacBook Pro with 2.6Ghz intel I7 processor. To get the timings, I solve a puzzle 500 times and take the average solving time. We get the following timings.

	Brute Force	(1)	(1+2a)	(1+2b)	(*)
Easy	0.14 ms	0.04 ms	0.03 ms	0.03 ms	0.03 ms
Hard	21 ms	1.16 ms	0.44 ms	0.55 ms	0.42 ms
Hardest	0.61 ms	0.17 ms	0.20 ms	0.19 ms	0.22 ms

Clearly brute force without simplification is a bad idea. (*) improves timing over (1) for hard problems at least. For other problems, numbers are very close to conclude. My experience over other random problems that I tried, suggests that (*) leads to fastest simplification (i.e. reduced iterations) and hence better.

I also tried the problem that Peter mentions taking most time and noted its timing through my Java implementation. It took 6134 ms for algo (1), but only 10ms for (*). A clear evidence of the joint algorithms' usefulness. This improved timing comes from the fact that Sudoku gets drastically simplified before entering the brute-force stage and even for each choice of brute-force, it further gets simplified eliminating possible hints.

The Sudoku program is generic in the sense that it can solve 1x1, 4x4, 9x9, 16x16, … problems. A quick run of the program is as follows:

Solving a Sudoku puzzle of size 1 ...

 ===

| * |

 ===

[status] solved=0 hints=1

 ===

| 1 |

 ===

[status] SOLVED

Time taken=0.022 ms

Solving a Sudoku puzzle of size 4 ...

 ===== =====

| 4 * | 2 * |

| * * | 4 1 |

 ===== =====

| 1 * | * * |

| * * | 1 4 |

 ===== =====

[status] solved=7 hints=36

 ===== =====

| 4 1 | 2 3 |

| 2 3 | 4 1 |

 ===== =====

| 1 4 | 3 2 |

| 3 2 | 1 4 |

 ===== =====

[status] SOLVED

Time taken=0.132 ms

Solving a Sudoku puzzle of size 9 ...

 ======= ======= =======

| * * * | 7 * * | 6 * 3 |

| 7 * 3 | 9 * * | * 1 2 |

| * * * | 5 3 * | * * * |

 ======= ======= =======

| 1 * * | * * * | * * 7 |

| 9 * * | * * * | * * * |

| * 4 5 | * * * | * 9 6 |

 ======= ======= =======

| 3 * * | * * * | 9 * * |

| * * * | * * 1 | * 6 5 |

| * 1 * | 3 6 * | * 7 4 |

 ======= ======= =======

[status] solved=27 hints=486

 ======= ======= =======

| 5 9 4 | 7 1 2 | 6 8 3 |

| 7 8 3 | 9 4 6 | 5 1 2 |

| 6 2 1 | 5 3 8 | 7 4 9 |

 ======= ======= =======

| 1 6 2 | 8 5 9 | 4 3 7 |

| 9 3 7 | 6 2 4 | 1 5 8 |

| 8 4 5 | 1 7 3 | 2 9 6 |

 ======= ======= =======

| 3 5 6 | 4 8 7 | 9 2 1 |

| 4 7 8 | 2 9 1 | 3 6 5 |

| 2 1 9 | 3 6 5 | 8 7 4 |

 ======= ======= =======

[status] SOLVED

Time taken=2.198 ms

Solving a Sudoku puzzle of size 16 ...

 ============= ============= ============= =============

| 8  7  *  *  | *  *  *  *  | *  3  *  *  | 13 *  4  *  |

| *  5  14 *  | *  *  3  10 | 15 9  1  *  | *  6  *  *  |

| 16 *  *  *  | 5  8  7  *  | *  14 *  *  | 9  *  11 12 |

| *  *  4  *  | *  14 6  13 | *  11 10 12 | *  7  *  3  |

 ============= ============= ============= =============

| 14 *  *  8  | *  *  1  *  | *  *  *  3  | 7  4  12 *  |

| 9  *  *  *  | *  6  15 12 | *  *  13 14 | *  3  1  *  |

| 11 *  10 3  | *  *  13 *  | *  8  *  1  | *  *  6  *  |

| 6  *  *  1  | 14 *  4  *  | *  5  *  9  | 11 *  *  13 |

 ============= ============= ============= =============

| *  *  *  *  | 15 *  *  *  | *  *  9  *  | 5  *  2  10 |

| 10 1  *  *  | 6  *  5  *  | 13 15 7  16 | *  *  *  *  |

| *  *  16 11 | *  4  *  8  | 2  *  *  *  | *  13 *  7  |

| *  9  *  7  | 1  3  *  2  | 6  *  8  10 | 16 15 14 4  |

 ============= ============= ============= =============

| 7  *  13 *  | 9  16 *  5  | *  *  14 4  | 3  8  *  2  |

| *  *  3  *  | 10 *  *  *  | *  *  *  *  | *  16 15 *  |

| 1  *  9  *  | *  *  14 4  | *  *  *  *  | *  *  7  *  |

| *  6  8  *  | 3  *  *  *  | 10 7  *  *  | *  *  *  *  |

 ============= ============= ============= =============

[status] solved=116 hints=2240

 ============= ============= ============= =============

| 8  7  11 10 | 2  12 9  1  | 5  3  16 6  | 13 14 4  15 |

| 12 5  14 13 | 4  11 3  10 | 15 9  1  7  | 2  6  16 8  |

| 16 3  1  6  | 5  8  7  15 | 4  14 2  13 | 9  10 11 12 |

| 2  15 4  9  | 16 14 6  13 | 8  11 10 12 | 1  7  5  3  |

 ============= ============= ============= =============

| 14 13 15 8  | 11 2  1  9  | 16 10 6  3  | 7  4  12 5  |

| 9  4  7  5  | 8  6  15 12 | 11 2  13 14 | 10 3  1  16 |

| 11 2  10 3  | 7  5  13 16 | 12 8  4  1  | 15 9  6  14 |

| 6  16 12 1  | 14 10 4  3  | 7  5  15 9  | 11 2  8  13 |

 ============= ============= ============= =============

| 3  8  6  12 | 15 13 16 7  | 14 4  9  11 | 5  1  2  10 |

| 10 1  2  4  | 6  9  5  14 | 13 15 7  16 | 8  12 3  11 |

| 15 14 16 11 | 12 4  10 8  | 2  1  3  5  | 6  13 9  7  |

| 13 9  5  7  | 1  3  11 2  | 6  12 8  10 | 16 15 14 4  |

 ============= ============= ============= =============

| 7  11 13 15 | 9  16 12 5  | 1  6  14 4  | 3  8  10 2  |

| 5  12 3  14 | 10 7  8  6  | 9  13 11 2  | 4  16 15 1  |

| 1  10 9  2  | 13 15 14 4  | 3  16 5  8  | 12 11 7  6  |

| 4  6  8  16 | 3  1  2  11 | 10 7  12 15 | 14 5  13 9  |

 ============= ============= ============= =============

[status] SOLVED

Time taken=1.346 ms

Solving all puzzles from: ./src/sudoku/hardest.txt

Solved       :  11/11

Solved Cells :  270(input) 891(finally) 891(expected)

Cell Hints   :  5589(input) 0(finally)

Time Taken   :  2.59014144 milli-seconds

Avg time     :  0.23546740363636365 milli-seconds

Solving all puzzles from: ./src/sudoku/hard.txt

Solved       :  95/95

Solved Cells :  1953(input) 7695(finally) 7695(expected)

Cell Hints   :  51678(input) 0(finally)

Time Taken   :  41.772783616 milli-seconds

Avg time     :  0.4397135117473684 milli-seconds

Solving all puzzles from: ./src/sudoku/easy.txt

Solved       :  50/50

Solved Cells :  1418(input) 4050(finally) 4050(expected)

Cell Hints   :  23688(input) 0(finally)

Time Taken   :  1.868110848 milli-seconds

Avg time     :  0.03736221696 milli-seconds

A word of advice to readers. While building the Sudoku (or similar programs), you might be tempted to optimize it. I suggest its more important for the code to be readable and debuggable and concise. I tried to optimize my code, but realized its not worth it. Using some tricks i got speed up of 1ms for a specific alto, but that meant highly cryptic data structure. This would lead to increased timing in another algo and a nightmare to debug. Just do basic optimizations. As a side-effect of my attempt, i learnt few useful things to speed up the code in Java.

bytes is more expensive that int. So use int.
int check (e.g. i==0) is very-very slightly faster than boolean check. But don't use it if it compromises readability.
Exception handling (even if exception is never thrown) is expensive.
Java doesn't do tail call optimization. So don't worry too must about trying formatting your recursive routine.
keyword final has no effect on performance in Java. So if you think a one liner function is inlined, its not.
Avoid using array list and use fixed array. Even if array list is of size say k and array of n where k < n, its faster to use array. UNLESS k <<< n.
Hashmap, Hashsets would easily hog performance. It can easily double the running time. So if you can avoid it, its better.
1-d array is cheaper than 2-d array.
Using objects is extremely cheap. I find next to nil penalty on using objects. So use them to simplify code.
Caching can be useful. for example instead of referring A[i] several times, its faster to set x =A[i] and refer to x.
private, public, and other qualifies only increase space of byte code but has no performance impact. So use them for good design principles and not for performance improvements.

Code is available at https://github.com/n1balgo/sudoku. Feel free to check it out and drop in comments or email for feedback (and suggestions and bugs).

May 2, 2014

Stick Breaking

Problem Set 1: Given a stick of length 1 meter. It is cut into two parts randomly.

Problem 1a: What is the expected length of the smaller stick?
Problem 1b: What is the expected length of the bigger stick?

Problem Set 2: Given a stick of length 1 meter. It is cut into three parts randomly.

Problem 2a: What is the expected length of the smallest stick?
Problem 2b: What is the expected length of the largest stick?
Problem 2c: What is the expected length of the stick with second smallest (or second largest) length?
Problem 2d: What is the expected length of the middle stick (stick that is between the two cuts)?
Problem 2e: What is the probability that the three sticks would form a triangle?
Problem 2f: What is the probability that the three sticks would form a right angle triangle?
Problem 2g: What is the probability that the three sticks would form an equilateral triangle?

Problem Set 3: Given a stick of length 1 meter. It is cut into two parts randomly. Then bigger stick is picked and is again cut into two parts randomly.

Problem 3a: What is the expected length of the smallest stick?
Problem 3b: What is the expected length of the largest stick?
Problem 3c: What is the expected length of the stick with second smallest (or second largest) length?
Problem 3d: What is the expected length of the middle stick (stick that is between the two cuts)?
Problem 3e: What is the probability that the three sticks would form a triangle?
Problem 3f: What is the probability that the three sticks would form a right angle triangle?
Problem 3g: What is the probability that the three sticks would form an equilateral triangle?

Generalized Cut Problem: Find the solution to problem 2a-2c, 3a-3c for n-1 cuts.

Solution: show

Solution for problem 1: 1/4 (1a), 3/4 (1b)
Solution for problem 2: 1/9 (2a), 11/18 (2b), 5/18 (2c), 1/3 (2d), 1/4 (2e), 0 (2f), 0 (2g)
Solution for problem 3: 15/16 - 2log(3/2) (3a), …

Let us work out solution for 1a. Smaller stick has length x which is uniformly distributed in the range [0,0.5]. The probability distribution of x is p(x) = 1 / [1/2-0] = 2. Yes it is a pdf, ∫ ₀^0.5p(x)dx = 1. The expected length of the smaller stick is E[x] = ∫ x p(x) dx = ∫ ₀^0.52xdx = 1/4.

2a) Let us work out solution for 2a. Let two cuts are at distance x,y from one end. Let x < y. So the three sticks are of length x, y-x, 1-y. Let z = min(x,y-x,1-y) represent the minimum length of the stick. Let Pr(z>a) indicate the probability that min length stick has length greater than a.

=> Pr(z>a) = 2 * Pr(x>a, y-x>a, 1-y>a) = Pr(x>a, a+x < y < 1-a)

Note multiplication by 2 is for the other case (x > y). Since a+x< y < 1-a, we get x < 1-2a. Also setting x =a, we get a < 1/3 (which is obvious).

=> Pr(z > a) = 2 ∫ _a^1-2a ∫ _a+x^1-adxdy = (1-3a)².

So we get Pr(z<=a) = 1-Pr(z>a). And Pr(z=a) = derivative of Pr(z<=a) at a = 6(1-3a).

=> E[a] = ∫ ₀^1/36a(1-3a)da = 1/9

2b) We can do a similar derivation for z = max(x,y-x,1-y) to get the expected max length. It would be slightly more complicated. Unlike above, we compute here Pr(z< a)

=> Pr(z < a) = Pr(x < a, y-x < a, 1-y < a) = Pr( 1-2a < x < a, 1-a < y < x + a).

We need to be careful here are respect constraints like x > 0, x < y, y < 1. To ensure these constraints are respected. Pr(z < a) is broken into three cases, Case 1: a < 1/2. Case 2: a >= 1/2, 0 < x < 1-a, Case 3: a >= 1/2, 1-a < x < a. Then we have to take differential of obtained Pr(z < a) to get Pr(z=a). [Dont mix terms from the three cases].

Integrate the first case from [1/3 to 1/2]. Second and third case from [1/2 to 1]. We will see the desired output as 1/2+1/9=11/18. (Don't forget to multiply by 2 to cover for x > y case).

2c) It would be easy to obtain: 1 - (2a) - (2b) = 1 - 1/9 - 1/2 - 1/9 = 1/2 - 2/9 = 5/18.

2d) Since there are no constraints, expected length of middle cut is 1/3

2e) For triangle, we require satisfaction of three inequalities
x + y - x > 1-y,
x + 1- y > y - x,
y - x + 1 - y > x
Apart from the implicit x < y inequality.

=> 1/2 < y < x + 1/2, x < 1/2

So x follows a pdf of P(x) = 2. y satisfied the range with probability x. So the probability of triangle formation is ∫ ₀^1/22 x dx = 1/4.

2f) There are discrete stick lengths, e.g. (3/12, 4/12, 5/12) that satisfy this property. There are countably infinite, yet the area encapsulated by them is 0.

Problem set 3 is slightly more complicated. Let us work out 3a).

Let the first cut is at length x from shorter side (x < 1/2). Now from the stick of length (1-x), we make a cut at random of length y. Let y < (1-x)/2. If x is the shortest stick then x < y => x < 1/3.

Case 1: 1/3 < x < 1/2, 0 < y < (1-x)/2 => y is the shortest stick

Case 2: 0 < x < 1/3
Case 2a: 0 < y < x => y is the shortest stick
Case 2b: x < y < (1-x)/2 => x is the shortest stick

We also note that P(x) = 2 (a uniform distribution in [0,1/2]), P(y) = 2/(1-x) (a uniform distribution in [0, (1-x)/2]).

So the Expected min length = ∫ _1/3^1/2 ∫ ₀^(1-x)/2 4y/(1-x) dx dy + ∫ ₀^1/3 ∫ ₀^x 4y/(1-x) dx dy +
∫ ₀^1/3∫ _x^(1-x)/24x/(1-x) dx dy = 15/16 - 2 log(3/2)

For Generalized problem read David and Nagaraja's Order Statistics (pp. 133-135, and p. 153)

The solutions through Monte-Carlo simulations. Following python code solves problem set 2 and 3.

from random import random
def prob2():
    niter = 100000
    n, ntri, nrtri = 0.0, 0, 0
    lsmall, llarge, lmid, lcenter = 0, 0, 0, 0

    while n < niter:
        n += 1
        p = random()
        q = random()
        x = min(p, q)
        y = max(p, q)

        l1 = x
        l2 = y-x
        l3 = 1-y

        mn = min(l1, l2, l3)
        mx = max(l1, l2, l3)
        md = 1 - mn - mx
        lsmall += mn
        llarge += mx
        lmid += md
        lcenter += y

        if l1 + l2 > l3 and l1 + l3 > l2 and l2 + l3 > l1:
            ntri += 1

        if mn**2 + md**2 == mx**2:
            nrtri += 1

    print 'min len:', lsmall/n
    print 'max len:', llarge/n
    print 'mid len:', lmid/n
    print 'center stick len', lcenter/n
    print 'triangle prob:', ntri/float(n)
    print 'right triangle prob:', nrtri/float(n)

def prob3():
    niter = 100000
    n, ntri, nrtri = 0.0, 0, 0
    lsmall, llarge, lmid, lcenter = 0, 0, 0, 0

    while n < niter:
        n += 1
        p = random()
        q = random()

        l1 = 0.5 * (1-p)
        l2 = 0.25 * (1+p) * (1-q)
        l3 = 0.25 * (1+p) * (1+q)

        mn = min(l1, l2, l3)
        mx = max(l1, l2, l3)
        md = 1 - mn - mx
        lsmall += mn
        llarge += mx
        lmid += md
        lcenter += (l2+l3)/2

        if l1 + l2 > l3 and l1 + l3 > l2 and l2 + l3 > l1:
            ntri += 1

        if mn**2 + md**2 == mx**2:
            nrtri += 1

    print 'min len:', lsmall/n
    print 'max len:', llarge/n
    print 'mid len:', lmid/n
    print 'center stick len', lcenter/n
    print 'triangle prob:', ntri/float(n)
    print 'right triangle prob:', nrtri/float(n)

May 1, 2014

Defective Balls

For problem 1-3 consider a weighing scale (with two sides) and it only tells which side is heavier (it doesnt tell by how much).

Problem 1: You have 9 balls of same weight, except that 1 ball is defective and it is known to be lighter. Find the defective ball in two weighing.

Problem 2: You have 9 balls of same weight, except that 1 ball is defective (it can be heavier or lighter, we dont know). Find the defective ball in three weighing.

Problem 3: You have 9 balls of same weight, except that 2 balls are defective and they are known to be lighter. Find the defective balls in four weighing.

Problem 4: There are 9 machines that produce the balls. You get 9 balls from each machine during a production cycle. You are given a weighing scale that gives the exact weight. You are told that one machine produced defective balls in a given production cycle. Find the defective machine through a single weight measurement. Assume that a perfect ball weight 1lbs and defective ball weight 1.1lbs.

Problem 5: Consider problem 4. Now assume that the weight of the defective ball is unknown. Now find the defective machine in two weighing.

Problem 6: There are two jars A and B. A contains 50 perfect balls and B contains 50 defective balls. Arrange the 100 balls in the two jars in a way that the probability of randomly picking a ball (by first randomly picking a jar) is maximized.

Solution: show

December 9, 2013

Palindrome Numbers

Problem 1: Given a number n, check whether it is a palindrome or not. For e.g., n = 303 is a palindrome, where as n = 302 is not. Give an O(log n) algorithm.

Problem 2: Given a number n, find the largest palindrome that is less than or equal to n. For e.g., n = 36, then 33 is the largest palindrome less than or equal to n. Give an O(log n) algorithm.

Problem 3: Given a number n, find all the palindrome less than or equal to n. Give an efficient O(k) algorithm, where k = number of palindromes less than n.

Problem 4: Prove that all palindromes with even length is divisible by 11. For e.x. 6996 is divisible by 11.

Problem 5: Consider a palindrome number generator as follows.

Step 1: n = n + reverse(n)
Step 2: If !is_palindrome(n): Goto Step 1

Given a palindrome number m, find the smallest n which can generate m using this generator. Is it possible that no such n exist?

For e.g., m = 121, we get n = 19. To see this, we get, n1 = 19+91 = 110. In the next step we get n2 = 110 + 011 = 121.

December 1, 2013

Finding Relevant Keywords from Large Corpus of Text

Problem 1: Given a large paragraph of words containing n words (separated by space), and k keywords. Find the smallest distance between these keywords in the paragraph.

To illustrate the problem, consider that the paragraph is A B A C E D O A B. Let us say that we are interested in two keywords A and D. Then the closest they appear is at position 8 and 6 (D O A) and the minimum distance is 1 (i.e. number of non-keyword words between them).

Build an efficient algorithm with O(k) space and O(n log k) running cost.

Solution: show

To solve this problem, we need to register the last seen position of each of k keywords, as we read the paragraph word-by-word. If the next word, and if that matches a keyword, we need to update the last seen position of the matching keyword and also compute the minimum of the position register. That helps us find the smallest distance.

In order to implement this efficiently, we need a hash map and min heap (tree based implementation, not array based). Following is the code to solve this problem.

Let K[i] indicate ith keyword.
Let W[i] indicate ith word in the paragraph. 

// init map and heap
map = {}
heap = []
for i in 1 to k

  // init heap with -Inf as position of the keyword
  node = heap_node()
  node.val = -Inf

  // map stores the keyword and its pointer to node in the heap
  heap.put(node)
  map.put(K[i], node)
end

idx_start = 0
idx_end = 0
min_dist = Inf
for i in 1 to n
  w = W[i]

  // check if w is one of the keywords, if not ignore
  if !map.contains_key(w)
    continue
  end

  // get the heap node corresponding to the
  // key W[i] and update its position
  node = map.get(w)
  node.val = i

  // re-heapify the min heap
  heap.heapify()

  // min of all positions
  idx = heap.min()

  // if idx = -Inf then some keywords are not seen yet.
  if idx == -Inf
    continue
  end

  // compute distance between keywords
  dist = i - idx - k + 1
  if dist < min_dist
    idx_start = idx
    idx_end = i
    min_dist = dist
  end 
end

The above algorithm requires O(k) memory for the map and the heap. For each word, it requires O(log k) to re-heapify the min heap.

Problem 2: Given a large paragraph of words (separated by space). Consider N tuples of names and interests. If a given person and their interest appears within K words then its a match else not. Find all matches from the paragraph.

To illustrate the problem, consider that the paragraph is "John likes swimming and Jack likes Tennis". Let the interest map is <john swimming>, <john tennis>, <jack swimming>, <jack tennis>. For k = 3, 4, 5, 6 the matches found in the paragraph are <john swimming>, <jack swimming>, <jack tennis>.

Propose an efficient algorithm for two cases: k is small, k is very large.

August 12, 2012

Bulls and Cows

Bulls and Cows is a two player game. First player thinks of a N digit secret number and second player guesses that number. This is done by second player making a guess and first player telling the number of bulls and cows in the guess. If the digits of guess matches secret and they are in right position, it's a bull, if they match but on different position, they are cows. For e.g. Let secret be 0123 and guess be 0012. The guess has 1 bull, 2 cows. The game continues, until second player gets a response of N bulls.

The problem is to write a computer solver for bulls and cows that would try to guess the secret making only a small number of attempts.

Solution: show

Let D be the unique digits that can be put in the secret. Let N be the length of the secret number. The total number of secrets is D^N.

A naive solution is to try all D^N guesses until N bulls is the response. This approach makes D^N attempts in worst case. Not good.

We can do better than this. Let us make a random first guess and we get a response r. We create a pruned set by selecting all numbers (from D^N possible) that have response r with the guess. Secret must be in the pruned set. We can pick a number from this set as our guess and repeat the logic until we hear N bulls.

There are three ways to pick the next number
a) Pick the first number from the pruned set
b) Pick the number randomly from the pruned set
c) Use a heuristic to pick a number (Note this number can be outside the pruned set).

Our experiments show that b) is better than a). For N = 4, D = 10 with distinct digits in secret (5040 possible secrets) and initial guess of 0123, the pick first strategy makes 6.0051 attempts on average and 9 attempts in a worst case. On the other hand, randomly picking from the pruned set makes 5.9136 attempts on average and 9 attempts in a worst case.

Note that optimal strategy for N=4, D=10 with distinct digits in secret (5040 possible secrets) takes an average of 5.2131 attempts with 7 attempts in the worst case (technical paper). It is also guaranteed that 6 attempts is the worst case lower bound for any algorithm. Since 5.2131 is theoretical best, 5.9136 achieved by random is not bad.

Let us now explore option c. There are (N+1)*(N+2)/2 - 1 response classes starting from (0 bulls, 0 cows) to (N bulls, 0 cows). Note the -1 is there because if the response is N-1 bulls then we cannot have 1 cow. The response to a guess, partitions the D^N numbers (as shown in Fig 1).

Fig 1. Response partition of 10^4 numbers for the secret 0123

Now we know that the secret lies in the pruned set. We need to pick our next guess that partitions the pruned set in the best possible manner. One way to judge the goodness of partition is by information gain [sum_i ni * log (ni)]. The issue now is that to compute information gain, we need to compute responses for D^N * length(pruned set) pairs.

Note that the best guess can be outside the pruned set, that is why we need to compute D^N * length(pruned set) responses and not length(pruned set)^2. To see this consider that pruned set is [089, 189, 289, 389, 489, 589, 689, 789]. If our secret is 789 we need 8 guesses but if we try 012 as guess 1 then we get pruned set [389, 489, 589, 689, 789], then we try 345 as our guess 2 to get a pruned set [689, 789] and now two more guesses are required. So we solved it in 4 attempts instead of 8.

But the bigger issues is that this technique can only be applied if length of pruned set is small, otherwise it can be computationally prohibitive and memory intensive. Our strategy is as follows:

step 1. Pick distinct digits to make first guess.
step 2. Select random guess from the pruned set.
step 3. Perform step 2, three or four times.
step 4. Pick guess using information gain measure.
step 5. Perform step 4, until game is solved.

For N = 4, D = 10, (5040 possible secrets), this strategy makes 5.7201 attempts on average with 8 attempts in a worst case. Summary of algorithms is as follows:

algo	average #attempts	worst #attempts
D=10, N=4, 5040 secrets
pick first	6.0051	9
pick random	5.9136	9
heuristic pick	5.7201	8
optimal	5.2131	7

algo	average #attempts	worst #attempts
D=10, N=4, 10000 secrets
pick first	8.0291	13
pick random	6.6323	11
heuristic pick	6.2001	8

Source code: bullscows.py

Technical resources: optimal strategy, bullscows.pdf

July 21, 2012

Number of Paths in a Rectangular Grid

Problem 1: Consider a rectangular grid of n x m rectangular cells. The problem is to find the number of shortest (or monotonic in this case) paths along the edges of the cell that start at (0,0) and end at (n,m). How do we print all such paths?

A monotonic path consists of moving up or right but not down or left. Figure 1 illustrates such a path for a grid of size 5 x 3.

Figure 1

Solution: show

To go from (0,0) to (n,m), we need to make n moves to the right and m moves up. These moves can be performed in any order and still we reach the point (n,m). So the total number of ways is simply the total number of ways n red balls and m green balls can be arranged in a straight line. This is simply ^n+mC_m.

We can print all such paths using recursion. The following code does this.

Function printPath(n, m, path)
   if n == 0 and m == 0
      print path
      return
   end

   if n > 0
      printPath(n-1, m, path + "right")
   end

   if m > 0
      printPath(n, m-1, path + "up")
   end
end

Problem 2: Now we need to print all monotonic paths that do not go above the diagonal y = x. Figure 2 shows such a path. Note that the path in figure 1 goes above the diagonal, hence not desirable in this case. How many such paths exist for n x m grid (n >= m)?

Figure 2

Solution: show

Let us first discuss how to print all monotonic paths that do not go above the diagonal. In order for a path to be valid, at all points (x,y) on the path, y should be less than or equal to x. We can simply ensure this constraint while printing paths, using the following code.

Function printPath(x, y, n, m, path)
   if x == n and y == m:
      print path
      return
   end

   if x < n
      printPath(x+1, y, n, m, path + "right")
   end

   if y < m and y < x
      printPath(x, y+1, n, m, path + "up")
   end
end

The total number of such paths are [(n-m+1)/(n+1)] * ^n+mC_m. For n = m, this simplifies to [1/(n+1)] * ²ⁿC_n which is known as Catalan number. The wiki page also contains other interesting problems that have Catalan number as their solution.

Now, In order to obtain the number of paths, we need to subtract the number of paths that cross the diagonal from the total number of paths ^n+mC_m.

Consider a path that crosses the diagonal. Let that path cross the diagonal for the first time at point (x,y). Since this is the first point after crossing the diagonal, so y should be equal to x+1. So the path has taken x right and y up moves. It is yet to take n-x right and m-y up moves to reach (n,m). If we flip the n-x right to n-x up and m-y up to right, then starting from point (x,y), we will reach the point (x+m-y,y+n-x). Setting y = x+1, we get that the resulting end point is (m-1,n+1). Since m <= n, the resulting end point is above the diagonal. So we can do surgery on all paths that cross the diagonal to make them reach (m-1,n+1). So the total number of paths that cross the diagonal must be the total number of paths from (0,0) to (m-1,n+1). This is ^n+mC_m-1.

So the desired number is = ^n+mC_m - ^n+mC_m-1 = [(n-m+1)/(n+1)] * ^n+mC_m. This is a proof based on Andre's reflection method.

Alternately:
Andre's reflection method was used to originally solve a very interesting problem called Ballot problem. The ballot problem states that in a two candidate election, candidate A polled p votes and candidate B polled q votes (p > q). What is the probability that candidate A was leading over candidate B throughout the counting.

Well the probability is simply (p-q)/(p+q).

We can use the Ballot theorem to find the number of monotonic paths below the diagonal (that do not even touch the diagonal). We can assume right move to be candidate A and up move to be candidate B and the two polled n,m votes respectively. So the probability of number of right leading number of up is (n-m)/(n+m). So the number of monotonic paths that are below the diagonal (do not even touch) is [(n-m)/(n+m)] * ^n+mC_m.

Now since we are allowed to touch the diagonal, we can assume instead going to (n+1,m). This is because the first two moves would be right moves (otherwise we might touch the diagonal). Then we eliminate the first right move and allow a diagonal touch. So the total number of ways is [(n+1-m)/(n+1+m)] * ^n+m+1C_m = [(n-m+1)/(n+1)] * ^n+mC_m.

Problem 3: Now let us consider that apart from up and right moves we can take up-right (diagonal) moves. Figure 3 shows such a path. How many such paths exist for n x m grid?

Figure 3

Solution: show

This problem is similar to problem 1, except that paths can contain diagonal moves. One diagonal move consumes 1 right and 1 up move.

So let us assume that we make r diagonal moves, where r <= min(n,m). In this case, we need to make n-r right moves and m-r up moves apart from r diagonal moves to reach (n,m). The number of ways, we can do this is = (n+m-r)! / [ (n-r)! * (m-r)! * r! ]. If we set r = 0, then we get the solution to problem 1. But in this case r can go from 0 to min(m,n). So the total number of ways is =

Sum_r (n+m-r)! / [(n-r)!*(m-r)!*r!], where r in [0, min(m,n)]

Alternately, we can derive the recurrence relation for the total number of paths. Let D(n,m) be the total number of paths from (0,0) to (n,m) with up, right and diagonal moves. Clearly, we can reach (n,m) from (n-1,m) by a right move, from (n,m-1) by a up move, and from (n-1,m-1) by a diagonal move. These are the only three atomic steps. So, D(n,m) = D(n-1,m) + D(n,m-1) + D(n-1,m-1) for n,m > 0. D is called as Delannoy number.

The code to print all paths has a small addition to the code for problem 1.

Function printPath(n, m, path)
   if n == 0 and m == 0
      print path
      return
   end

   if n > 0
      printPath(n-1, m, path + "right")
   end

   if m > 0
      printPath(n, m-1, path + "up")
   end

   if n > 0 and m > 0
      printPath(n-1, m-1, path + "diagonal")
   end
end

We can extend this problem further by asking the number of paths that do not cross the diagonal (same constraint as problem 2). The number of paths in this case is called Schroder number.

Source Code: grid_paths.py

July 9, 2012

Finding Top K Elements in Large Dataset

Problem: Consider the problem of finding top K frequent numbers from a file with N numbers. Now consider that N is very large such that it cannot fit in the memory M available to the program. How do we find the top K frequent elements now with the assumption that K < M < N.

Deterministic solution: show

The idea is similar to solving the problem for small N (N < M) (see here). For large N, divide the problem into chunks of size <= M, then solve it.

In order to divide the problem, consider a uniform hashing function H that takes a key and returns an integer from the set {1,2,....,ceil(N/M)}. So we get N/M chunks with roughly M unique keys per chunk. Since the number of unique numbers (U) is less than N, we expect each chunk to have less than M unique numbers.

Now for each chunk, we compute the frequency of numbers contained in that chunk. This frequency computation can be done in memory and we can maintain a MIN-HEAP of size K which can be directly updated (follow the steps presented here). As a result only two reads of the dataset and one disk write is required to get the top K frequent items. The complexity of the algorithm is O(N log K).

Probably for a small K (K < M^2/N), we can compute the top K per chunk in O(K log M) and combine the top K for all chunks N*K/M (<M) to get the overall top K. The total complexity of the algorithm is O(N + M log M).

Alternate method:
We assume that the disk has a huge capacity which allows us to make a disk based hash table. Imagine that we create a very large file with holes and divide the file into B blocks with a capacity to hold more than N/B numbers and their integer counts. Using a uniform hash function H which takes as input an arbitrary number x and return H(x): the block number from the set {0, 1, ..., B-1}, we can store the numbers and their frequencies on disk map. H would give us the offset to seek within the file and we write number (and their frequency) sequentially once in correct block (or via another hash function H1).

Now we proceed as usual, start counting the numbers in memory using a hash table (former approach). When we encounter a new number that could not be put in memory, we purge some entries from the hash table. The purged entries are written to the disk map. Now to purge the entries, we maintain an array which counts the frequency of the keys that lie within a block. Keys that belong to the most infrequent blocks can be purged first (or blocks that are least recently used or that lead to least disk access, etc).

The following code gives a very basic detail of this approach

BlockFreq = array(B)
NumberFreq = hashtable(M)
diskwrite = 0
for i = 1:N
   x = A[i]
   BlockFreq[H[x]] += 1

   if NumberFreq.haskey(x)
      NumberFreq[x] += 1
      continue
   end

   if NumberFreq.hasspace()
      NumberFreq[x] = 1
      continue
   end

   if DiskMap.haskey(x)
      DiskMap[x] += 1
   else
      DiskMap[x] = 1
   end

   if diskwrite == 10
      purge(NumberFreq, BlockFreq)
      diskwrite = 0
   else
      diskwrite += 1
   end
end

Here purge is a procedure to purge some set of keys from NumberFreq based on the BlockFreq. Note that this code omits several key details of this process, so the idea presented here is quite crude.

Single pass probabilistic solution: show

Solution 1 is quite efficient as it requires only two disk reads of the dataset, but the bottleneck can be the disk writes during the initial chunk formation. We can reduce that bottleneck by considering a data-structure called Bloom filters.

So consider that we have B uniform hash functions H1, H2, ..., HB and each hash function converts a key to a range {1,2,...,R}. Now imagine an array C of size B x R (<M) that represents count of how many times each key is seen. For each number (say x) that we read from the dataset, compute Hi[x] and increment C[i,Hi[x]] by 1. So we maintain B counts of x in different R buckets. We can say that the true count of x is less than min(C[1,H1[x]], ..., C[B,HB[x]]).

Now if the query is to get all the elements with frequency greater than some threshold then we can use bloom filters to get all such numbers (with some false positives though, which can be filtered using another pass on the dataset). This can save a complete write of the data to the disk. (see the paper: Computing Iceberg Queries Efficiently).

But in our case, we are interested in finding the top K frequent numbers. Following modification can be used to estimate the frequency of each number.

MH = MIN-HEAP(K)
for i = 1:N
   x = A[i]
   for b = 1:B
      C[b,Hb(x)] += Sb(x)
   end

   if contains(MH,x)
      increment count of x in the heap
   else 
      f = median(Hi(x) * Si(x), \forall i)
      if f > min(MH)
         remove-min(MH)
         insert(MH, (x,f))
      end
   end
end

The Sb functions is a {-1,+1} hash function and this data-structure is called CountSketch. More details of the method is available in the paper: Finding Frequent Items in Data Streams.

Note that the above algorithm takes a single passes on the dataset (and no disk write) but it is not guaranteed to give the top K frequent items. It can make some mistakes for some less frequent items. In practice the choice of the hashing functions can be critical for the performance of the algorithm.

July 3, 2012

Finding Top K Frequent Items

Problem: Consider a file containing N numbers. For e.x. {2,3,4,3,1,78,78,3} is a file containing N=8 numbers. Assume that N is small enough to fit in the computer's memory M. So in this case N << M. Now we need to find top K frequent numbers in the file. For the above example, output should be {3,78} for K = 2.

Running Time: Time complexity should be either O(N), O(N log K), O(N + K log N)

Solution: show

Finding top K frequent elements is a classical database and data streaming problem and there are several solutions to it ranging from deterministic to probabilistic. I'll illustrate the three obvious ones here.

The first step is to count how many times each number appears in the file. If the file is pre-sorted then we need a single scan over the file.

Function COUNTS:
   A = load_file_in_array

   last = A[1]
   ctr = 1
   for i = 2:N
      if last == A[i]
         ctr += 1
      else
         CountMap{last} = ctr;
         last = A[i]
         ctr = 1
      end 
   end

   // required for the last element of the array
   CountMap{last} = ctr
end

Note that CountMap is a hashmap that stores counts of all the elements. The procedure COUNTS is quite efficient if the file is pre-sorted. Now if the file is not pre-sorted then sorting increases the time complexity to O(N log N). In that case, we can do better by directly using hashmap to maintain current count of each number without sorting the file as follows:

Function EFFICIENT_COUNTS:
   A = load_file_in_array
   
   for i = 1:N
      CountMap{A[i]} += 1
   end
end

The above procedure obtains counts of each element in a single scan of the file. Hence it runs in O(N) time. So now we have all the numbers along with their frequencies in an array (can be extracted by enumerating all keys of the CountMap or by another scan of the file!!). So for the example we have in the problem statement, we get the following tuple: {(2,1), (3,3), (4,1), (1,1), (78,2)}.

The problem that remains is to find the most frequent K numbers from the array. The naive approach would be to sort numbers on their frequencies and pick the top K. This would take O(U log U) time where U (=5) is the number of unique elements in the array. If we consider U = O(N) then the time complexity of this sorting is O(N log N). We can do better than that as follows:

Approach 1: O(N) time
Use selection algorithm to find the Kth most frequent number (on the second element of the tuple) using the Selection Algorithm in O(U) time. The Kth most frequent element partitions the array in two parts: first part containing top K most frequent elements and second part containing bottom U-K-1 frequent elements. So we get the top K most frequent elements in no particular order in O(N) time (assuming U = O(N)). They can be sorted in O(K log K) if needed. Note that although this approach runs in O(N) time, the constants hidden in the O-notation can be large. So in practice this approach can be slower than the two approaches described below.

Approach 2: O(N log K) time
Pick first K tuples and put them on MIN-HEAP, where a tuple (x,y) is less than a tuple (a,b) if y is less than b. The time complexity to make the min heap of size K is O(K).

Then for the remaining U - K elements, pick them one by one. If the picked element is lesser than the minimum on the heap, discard that element. Otherwise remove the min element from the head and insert the selected element in the heap. This ensures that heap contains only K elements. This delete-insert operation is O(log K) for each element.

Once we are done picking all the elements, the elements that finally remain in the min-heap are the top K frequent items which can be popped in O(K log K) time. The overall cost of this approach is O(K + (U-K) log K + K log K) = O(K + U log K). Since K < U and U = O(N), we get the time complexity of O(N log K).

Approach 3: O(N + K log N) time
This approach is similar to approach 2 but the main difference is that we make a MAX-HEAP of all the U elements. So the first step is to make the max heap of all the elements in O(U). Then remove the maximum element from the heap K times in O(K log U) time. The K removed elements are the desired most frequent elements. The time complexity of this method is O(U + K log U) and by setting U = O(N) we get O(N + K log N).

Let us stop for a moment and contrast approach 2 from 3. For simplicity let T2 = K + N log K be the time complexity of approach 2 and T3 = N + K log N be the time complexity of the third approach. Figure below plots the ratio T2/T3 for N=100 and for different values of K. We observe that approach 3 is considerably better for small values of K whereas approach 2 is better for large values of K. Though actual difference depends on the constants involved we can still see the merit of one approach over another.

March 26, 2010

Problems solvable using Hashtable

Hashtable are extremely useful data-structure as they provide storage and retrieval in O(1) time (amortized). Several problems of algorithms can be very efficiently solved using hashtables which otherwise turn out to be quite expensive. In this post, we consider some of these problems:

Problem 1: Remove duplicate elements from an unsorted array of size N.

Solution: show

Lets first see the solution without hashtable. Naive solution would be to compare each pair of elements. If they are same then drop one of them. This solution would take O(N^2) time in average case and O(N) time in best case, when all the elements are same. The following code does that:

for i = 1 to N-1
  if a[i] == NaN: continue
  print a[i]
  
  for j = i+1 to N
    if a[i] == a[j]: a[j] = NaN

Above code prints all non-duplicates from the array. Naive solution is average case O(N^2). In practice if you believe that array consists of few numbers duplicated lots of times, say K numbers duplicated N/K times, then the running time of above algorithm is O(KN).

We can do better than worst case O(N^2) time. If we sort the elements in O(N logN) time and compare adjacent elements in O(N), we can solve the above problem in O(N logN) time.

sort(a) // O(N log N) time, O(1) space using Heap Sort

k = a[1]
for i = 2 to N
  if a[i] == k: a[i] = NaN
  else: k = a[i]

Using heap sort, we can make the above algorithm run in O(N log N) time in worst case and it would take O(1) extra space (apart from N for the array). Using hashtable, we can solve this problem in O(N) time (amortized) and O(N) space (for the hash table). The algorithm looks like following:

for i = 1 to N
  if hash.has_key(a[i]) == false:
    print a[i]
    hash.put(a[i])

Above algorithm stores an unseen element in hash-table and ignores the seen element. To check if an element is seen or not, hash table provides storage/retrieval methods in O(1) time.

Problem 2: Find intersection of K unsorted array of N elements each. Intersection consists of elements that appear in all the K arrays.

Problem 3: How to make a linked list support operations in O(1) time. The operations on linked list can be insertion after any arbitrary valued node, deletion of any arbitrary valued node.

Problem 4: Find all unique pairs of element in an array that sum to S. For ex. If array = {2,4,6,4,6} and S = 8 then answer is {(2,6), (4,4)}

Problem 5: Consider an array containing unique elements. Find a triplet of elements in the array that sum to S (extension of problem 4). Can hash-tables improve the running time of your algorithm.

Problem 6: Consider two strings of size M, N. Perform string matching in size O(M+N).

March 19, 2010

Dragon and Knight

Problem: A dragon and knight live on an island. This island has seven poisoned wells, numbered 1 to 7. If you drink from a well, you can only save yourself by drinking from a higher numbered well. Well 7 is located at the top of a high mountain, so only the dragon can reach it.

One day they decide that the island isn't big enough for the two of them, and they have a duel. Each of them brings a glass of water to the duel, they exchange glasses, and drink. After the duel, the knight lives and the dragon dies.

Why did the knight live? Why did the dragon die?

Solution: show

Problem: Now consider that Dragon and Knight are equally intelligent then who is expected to die.

Answer: Both survive the battle.

Solution: show

March 18, 2010

Number and Age of David's Kids

Problem: The product of the ages of David's children is the square of the sum of their ages. David has less than eight children. None of his children have the same age. None of his children is more than 14 years old. All of his children is at least two years old. How many children does David have, and what are their ages?

Answer: (12,6,4,2)

Solution: show

A way to mathematically analyze this problem is to write the general form of equations, differentiate and maximize w.r.t to largest age. That leads to a Fibonacci number series. But after that complex heuristics have to be employed to devise the solution (and we cannot guarantee if that is the unique).

This is a constraint satisfaction problem. The best way to solve it is write the code. The following code solves the problem in general sense:

Function find_ages(kidAges)  
   s = sum(kidAges)
   p = product(kidAges)
   if s*s == p
      print kidAges

   l = length(kidAges)
   if l >= 7
      return
   else if l == 0
      x = 14
   else
      x = kidAges[l]
   end

   for i = 2 to x-1
      kidAges1 = clone(kidAges)
      kidAges1.add(i)
      find_ages(kidAges1)
   end
end

// call
find_ages([])

After running the code, we see the following solution

[12, 6, 4, 2]

Note that in the above pseudo code, we cloned the kidAges array. If we dont clone then it wont work properly. Down side of cloning is that now we use a huge amount of memory. How do we write code without cloning? Check the python script attached below.

Source Code: david_kids_ages.py

March 11, 2010

Largest Sum of Consecutive Numbers

Problem: Given an array of N integers (both positive and negative), find the sub-sequence with largest sum.

For ex: Let A = {1 2 -5 4 5 -1 2 -11} then largest sum is 10 (start = 4, end = 7)

Solution: show

The naive solution to this problem can be formulated and coded in O(N^3) time. Consider every possible pair of start and end index and calculate sum of the elements within those index and keep track of the max sum. The following code performs that

max = - Inf
imax = jmax = -1
for i = 1 to N
  for j = i to N
    
    // get sum of elements enclosed in index (i,j)
    s = 0
    for k = i to j
      s += A[k]

    if s > max
      max = s
      imax = i
      jmax = j

Clearly the above algorithm is O(N^3). We can definitely improve the running time of the above algorithm. After observing carefully, we can see that an operation that is run several times (O(N^2) times) is computing the sum of elements between indices i,j. Can we do it quickly. Indeed we can.

If we save the cumulative sum of elements in an array CSUM, then CSUM[i] indicates the sum of elements enclosed in indices (1,i), CSUM[j]-CSUM[i-1] would then indicate the sum of elements enclosed in indices (i,j). The following algorithm captures this intuition.

csum[0] = 0
for i = 1 to N
  csum[i] = csum[i-1] + A[i]

max = - Inf
imax = jmax = -1
for i = 1 to N
  for j = i to N
    s = CSUM[j] - CSUM[i-1]
    if s > max 
      max = s
      imax = i 
      jmax = j

This is the best we can do with this approach. This problem can be done in O(N) time using a clever approach. Consider a sub-sequence with sum s > 0. Let the next element n is such that s + n < 0. Clearly this sub-sequence cannot be part of largest subsequence that contains s and n as removing s,n from that sub-sequence we get a larger sum. Easy to prove using contradiction. This idea is captured in the following O(N) algorithm

max = -Inf
imax = jmax = -1
i_temp = -1
s_temp = 0
for i = 1 to N 
  s_temp += A[i]

  if s_temp > max
    // keep track of max so far
    max = s_temp
    imax = i_temp
    jmax = i

  if s_temp < 0
    // abort this sub-sequence
    s_temp = 0
    i_temp = i + 1

Problem: Given an array of N integers (both positive and negative), find the sub-sequence with largest absolute sum. This problem differs from the one at the top as we want the absolute sum (taking mod)

For ex: Let A = {1 2 -5 4 5 -1 2 -11} then largest absolute sum is 11.

Solution: show

Simply keep track of max and min at the same time in the above O(N) solution.

Assuming that we have to use the above algorithm without modification, then we can take these steps to get max absolute sum in O(N) time:

1. Run O(N) algorithm on array A to get the max
2. Create array B by multiply all numbers in A by -1
3. Run O(N) algorithm on array B to get the max
4. Pick the max between the output of two runs of the algorithm

Instead of trying the above two methods, we can try a very nice and simple approach

// build csum matrix as before
csum[0] = 0
for i = 1 to N
  csum[i] = csum[i-1] + A[i]

answer = max(csum) - min(csum)

The answer is maximum element of csum minus the minimum element of csum.

February 3, 2010

Polynomial Evaluation

Problem: Given a polynomial with degree bound of n=m^r. How do we evaluate the polynomial at n different points such that the polynomial can be recovered from these n points (i.e. the coefficients can be calculated back using the points).

Note: We do not want to recover the coefficients but cleverly choosing the points as that impacts the running time of the algorithm.

Solution: show

February 15, 2009

Longest Monotone Sequence and Palindromes

Problem: Given an array of integers, find the longest subsequence of elements which monotonically increases. for ex. array = {1 4 8 2 5 7 3 4 6}, the longest subsequence = {1 2 3 4 6}

Solution: show

Problem: Given a string, find the longest size palindrome in the string. for ex. string = "aaabbccaccbaaa", solution = "bccaccb"

Solution: show

January 31, 2009

Loops in Linked List

Problems with linked lists occur mainly while building or using a bad memory manager. Let us discuss two well known linked list problems.

Problem: Find if two linked lists short at some node. Find the node at which the two lists short.

Solution: show

Subtract the length of smaller list from the larger one. Move a pointer ahead on larger list by the difference of lengths. Now put a pointer on head of smaller list. More two pointers together. The point at which they meet is the point where they short. The following code gives the idea.

d = length(head1) - length(head2)
l1, l2 = head1, head2
if d < 0
  d *= -1 
  l1, l2 = head2, head1

ptr1 = l1
for i <- 1 to d
  ptr1 = ptr1->next

ptr2 = l2
while ptr1 != null
  if ptr2 == ptr1
    return ptr1

ptr1 = ptr1->next
ptr2 = ptr2->next

We can use this idea to find the shorted node to find the solution to next problem.

Problem: A linked list might contain a loop. How do we detect existence of the loop and find the node from which loop starts. Propose an O(n) time algorithm that doesn't take extra space and doesn't modify the linked list.

Solution: show

To find if the linked list contains a loop, run two pointers over the linked list, one that move one node at a time and other that moves two nodes at a time. Both of these pointers meet if the linked list contains a loop, otherwise the pointer moving with twice speed reaches the end while the slower pointer reaches midway.

fastptr = head
slowptr = head

while fastptr != null and fastptr->next != null
  fastptr = fastptr->next->next
  slowptr = slowptr->next
  if fastptr == slowptr
    return "loop exists"
return "loop doesnt exist"

If loop exists, then fast pointer meets the slow pointer before slow pointer completes one rotation of the loop. This fact can be easily proved. Once the pointers meet, we can count the number of nodes in the loop.

loopNodeCount = 0
do 
  loopNodeCount += 1
  slowptr = slowptr->next
while slowptr != fastptr
return loopNodeCount

Now using the trick used in the two linked list shorting problem above, we start two pointers, one from the head and other after advancing "loopNodeCount" nodes ahead. The node at which both the pointers meet is the loop node.

advptr = head
for i in range(1, loopNodeCount)
  advptr = advptr->next

normalptr = head
while normalptr != advptr
  normalptr = normalptr->next
  advptr = advptr->next
return advptr

Monty Hall Problem

This is a very famous probability problem. This problem illustrates that probability is sometimes more convincing than "what we perceive as obvious". I'll post another problem that defeats "obvious reasoning".

Monty Hall problem: Suppose you're on a game show, and you're given the choice of three doors: Behind one door is a car; behind the others, goats. You pick a door, say No. 1, and the host, who knows what's behind the doors, opens another door, say No. 3, which has a goat. He then says to you, "Do you want to pick door No. 2?" Is it to your advantage to switch your choice?

Solution: show

Red Green Cards: A box has three cards. First card has both sides green, second card has both sides red, third card has one side green and one side red. A card is picked at random and its one side is observed to be green. What is the probability that the other side of the card is also green.

Solution: show

January 30, 2009

Finding Second Smallest Element

Jargon: Order Statistics problem is to find the kth smallest element in an unsorted array A[1..n]. This problem can be solved in O(n) time in average case using randomized select algorithm or in O(n) time in worst case time using an algorithm that chooses the pivot element carefully.

Problem: How to find second smallest element in an array A[1..n] of size n. We would like to do in much less than 2n comparisons.

Answer: Time Complexity = n + ceil(log n) - 2

Solution: show

Finding the smallest element is trivial. The following code would find the smallest element using n-1 comparisons.

min = A[1]
for i = 2 to n
  if A[i] < min 
    min = A[i]
return min

The trivial algorithm to find the second minimum is to keep another variable (smin) along with min and if an element knocks out smin then check if it knocks out min and accordingly do the book-keeping. It can be done using 2n-3 comparisons and takes constant space. A point to observe is that the second smallest element is knocked out by the smallest one at some stage. If we preserve which element is knocked out by whom then finding the second smallest element would just require us to find the smallest amongst the elements knocked out by the smallest element. But in order to do that we need to build the algorithm in a way that we are playing a pair wise tournament. Lets say we play a tournament and knockout n/2 elements after n/2 comparisons. The recurrence relation looks like this:

T(n) = T(n/2) + n/2

Solving this recurrence gives that T(n) = n - 1. The height of the recurrence tree is lg n. So if we just check these elements knocked by root (min) we get the second minimum.For. E.x.

1   5 2   6  
\ /   \ /  
 1     2
  \   /
    1

As soon as we build the tree using n-1 comparisons, we can check the second largest amongst the number knocked out by 1. i.e. 2, 5. so the second minimum is 2. The number of comparison is n + ceil(log n) - 2, where log is taken to the base 2. We implement this algorithm by maintaining a list of indices knocked by an element. The python code is towards the end of this post.

Another algorithm that is very easy to implement but sadly takes 2n - 2logn - 1 in terms of array element comparisons. Simply build a min heap (rooted at 1). Return the smaller of numbers at index 2 or 3 in the array. The cost of min heap building is = 2 * [n/4 + 2n/8 + ...... + (logn - 1) * n / 2^logn]. The solution to this arithmetic - geometric series is 2n - 2logn - 2. Comparing elements at 2nd and 3rd index adds one to the cost. Its very easy to implement though.

Code for n + log n - 2 algorithm: second_smallest.py
Alternate implementation using linked list: second_smallest_linkedlist.py

January 23, 2009

String Permutations

Printing all permutations of a string is a very common interview question. We'll discuss this problem and some interesting variations of it. The original problem of string permutation says, "print all permutations of a string". As an example, if the string is "abc" there are 6 permutations {abc, acb, bac, bca, cab, cba}. Assume that string contains no duplicate elements.

Solution: show

There are many ways in which this problem can be solved. The most easiest of the ways is to code it using simple yet powerful recursion technique.

function permute(str, d)
  if d == length(str)
    print str
  else
    for i <- d to length(str)
      swap(str[d] <-> str[i]) // swap character for permutation
      permute(str, d + 1)
      swap(str[d] <-> str[i]) // undo swap for parent call

Above code performs the swapping operation twice. First time to generate all possible permutations of character at that level and second time to restore to the original string. It is important to restore to the original string (as passed to this call), otherwise the algorithm ends up printing redundant permutations. Additionally, this code assumes that there are no duplicates in the string, else it fails.

There are other ways in which string permutation can be printed that require circular shifting, but it cant get any simpler than above.

Source Code: permute_unique_chars.py

Lets us assume that string can contain duplicates. How do we print all non-redundant permutations of the string. For ex. If string is "abab" the permutations are {baab, abba, abab, baba, bbaa, aabb}

Solution: show

Clearly the solution to first problem fails here. This is due to fact that we generate all permutations of the string after the swapping step. Same character could get swapped to dth spot again, leading to redundant permutations. A very simple trick solves the game. The trick is to pre-sort the string on alphabetical order. Now we need to ensure that while we are swapping, to generate permutation for that particular character, if that matches the last swapped character we ignore generating permutations. Effectively, we are doing a cyclic shift and before the end of the call we reset the cyclic shift. The following code implements the same:

function permute(str, d)
  if d == length(str)
    print str
  else
    lastSwap = Nil
    for i <- d to length(str)
      if lastSwap == str[i]
        continue
      else
        lastSwap = str[i]

      swap(str[d] <-> str[i]) // swap character for permutation
      permute(str, d + 1)

    for i <- d to length(str)-1
       str[i] = str[i+1]
    str[length(str)] = last

If the string is pre-sorted alphabetically then the above algorithm works magically. Note that in this case, after permute is called, we do not undo the swap. This ensures that we are cyclic shifting and hence the intermediate str from d to l is also sorted. After the for loop call, we reverse the cyclic shift. The reason is that sorting ensures that if there are duplicates then the second occurrence of the duplicate element is picked immediately after we finish generating permutations for the first occurrence of the duplicate element. If sorting is not allowed, then you need to maintain a local hashtable (local to each function call), that maintains characters swapped to dth spot and ensures that duplicate candidates are discarded. Presort and make life simple.

Source Code: permute_chars.py

Some other constraints could be added to make the problem more interesting. One such constrain could be of partial ordering of characters. For Ex. 'a' should appear before 'c', and 'b' should appear before 'c' in all permutations. There can exist ordering that leads to no permutation of the string such as 'a' precedes 'b', 'b' precedes 'c', 'c' precedes 'a'. Additionally implementing the permutation algorithm in these cases require checking for ordering violation at each permutation level or just at the leaf of recursion, depending on whether checking for ordering violations is more costly or generating all permutations.

An interesting problem, that talks of special kind of ordering and can be very efficiently computed is as follows: Assume that the string is made up of equal number of '{' and '}'. You need to print all permutations of this string which could satisfy a C compiler i.e. an ending bracket is balanced by an opening bracket. For ex. for '{{}}' the permutations are [{{}}, {}{}].

Solution: show

The hard question is how many such permutations exist. For N open and N close brackets, the number of such permutations is Nth Catalan number = 1/(n+1) * factorial(2*N)/factorial(N)^2. The solution is derived from the problem: Number of paths in rectangular grid.

Permutation problems can be easily solved using recursion. The trick is to formulate all the possible cases at a particular level and code according to the formulation. The steps used below illustrate this approach. Let's formulate the problem in general sense. Assume that at a given step of recursion, we have n open brackets and m closed brackets that are not used so far. Now we can formuate each and every case for a given (n,m) and code the possible substeps.

Problem Formulation:
(n,m) ---> NOT POSSIBLE,   if m < n
(n,m) ---> PRINT AND DONE, if m = 0 and n = 0
(n,m) ---> (n-1, m)        if n > 0   
              +
           (n, m-1)        if m > 0 and m > n

The above formulation is generic and takes into account all possible input values of (n,m). If we put following constraint on the initial value of n and m such as following, then actual implementation can omit some of the above cases:

Constraints: 
n > 0
m == n

The actual implementation that follows above constraints and problem formuation:

function permute(str, n, m)
  if m == 0
    print str
  else
    if n > 0
      permute (str + '{', n - 1, m);

    if m > n
      permute (str + '}', n, m - 1);

To enforce the above mentioned constraints, permute function should be kept as inner/private function and the actual function that should be exposed is as follows:

function GenerateCStyleBrackets(N)
  if N <= 0
    return

  str = new String[2 * N]
  permute(str, N, N) // str is passed as pointer.

The memory cost is O(N). The maximum stack depth is also O(N). So total space cost is O(N).

Source Code: permute_brackets.py

January 22, 2009

Party Friends

Problem: In any party, there exists two people with same number of friends. Is is true or false. Assume that party has more than one people.

Solution: show