The notes herewith are derived from the lectures of Dr. Ding-Zhu Du as presented in his CS 6363: Design and Analysis of Computer Algorithms course given in the Spring semester of 2026 as well as the assigned textbook for this course. That is, Introduction to Algorithms (3rd edition) by T.H. Corman, C.E. Leiserson, R.L. Rivest and C. Stein.
These notes are organised exactly as presented in Dr. Du's course, with each topic within a chapter spanning a single lecture. These notes are longer as they have been augmented: they are a hodgepodge of my notes, Dr. Du's lectures, CLRS, personal anecdotes and other resources. Since Dr. Du's lecture slides were not comprehensive, I took the liberty of fortifying them with rigor, intuition, proof, illustrations, additional examples, and content.
This course is organised into three parts, Algorithms with Self-Reduction, The Incremental Method, and Complexity and Approximation. This arrangement of topics is quite intentional. To start off, we will study Algorithms with Self-Reduction. It is a structural property of the problem, where the optimal solution of the problem faced has within it optimal solutions to smaller sub-problems. Dividing a problem into smaller parts and conquering the pieces, dynamically programming up to the solution using previously computed values, and greedily making what seems to be the most immediate optimal choice are all ways a problem can be solved recursively, or iteratively. Any way the problem is solved, the crux of the solution involves some notion of shrinking the problem into smaller parts such that the latter is still the same kind of problem. The problem contains itself!
On first blush, it might seem like self-reducibility is the same thing as something else called optimal substructure. Though they are similar, they are not the same. Self-reducibility enables optimal substructure. Optimal substructure is a characteristic of a problem whereas self-reducibility is a process that involves repeatedly invoking the decision version of the problem.
From there, we will move to Part II of the course, The Incremental Method. Part I gave way for us to solve problems in a top-down or bottom-up fashion. This is nice, but there does come times when we do not start with a problem. Rather, we might start with a bad—or, sub-optimal—solution. What we will do here is iteratively, incrementally, improve it until optimality is achieved. In fact, the topics of Part II are quite apropos. In a flow network, there is no reliance on the parts of the network being optimal. As a matter of fact, it is possible to have a sub-flow in the network that is good locally but terrible globally. Instead of an optimal substructure, we will use Residual Graphs to undo bad decisions and eventually attain optimality. Unlike in Part I, Part II does not feature reducing the problem size, because there is no need to when the system writ-large matters more (and improving the whole thing from there).
Parts I and II centered around developing techniques to solve problems. Part III, Complexity and Approximation, begs a different question: "Can we do this—efficiently?" Up to Part III, the algorithms learnt all ran in polynomial time. Part III introduces a class of problems for which no efficient algorithm is known to exist. This is where self-reducibility ermeges again, but as a tool of complexity, not a feature of the problems. Formally, a problem is said to be self-reducible if its search problem reduces (via a Cook reduction) in polynomial time to its corresponding decision problem. In other words, there exists an algorithm to solve the search problem in a polynomial number of steps, where each step is either:
In Part III, two things will be discussed:
Taking from earlier, self-reducibility emerges once again. Instead of being synonymous with recursivity, we use it to form a critical link:
If you can answer the decision-version (True/False) of an NP-Complete problem, the actual solution can be found through searching.
Consider a theoretical scenario where a black-box contains an oracle that whispers if a Hamiltonian Cycle (each node visited strictly once) exists in a graph $G=(V,E)$. Using self-reduction can help us find it by testing each edge one-by-one. This proves that finding a solution is about as difficult as checking.
Part III establishes that some problems, those NP-Hard ones, just can not be solved perfectly and quickly. So, there are two possible resolutions.
An algorithm is a computational procedure that takes some value, or a set of values, as input and produces some values, or a set of values, as output. So, we say an algorithm is a sequence or combination of computational steps that transform the input into the output.
In sequential computation, an algoirthm is a sequence of computational steps. However, that is not true in parallel computation, where a large problem is broken into smaller, constituent problems that are all solved simultaneously.
Remark. In the scope of this course, only algorithms in sequential computation will be studied.
We begin our study of sequential algorithms by immediately exploring the realm of sorting algorithms. In general, we express the input and output of a sorting algorithm as the following:
Input: A sequence of $n$ numbers $\{a_1, a_2, \dots, a_n\}$
Output: A permutation of the input sequence $\{a_1', a_2', \dots, a_n'\}$ such that $ a_1' \leq a_2' \leq \dots a_n'.$
For example, take the sequence $5, 2, 4, 6, 1, 3$. Feed that into a sorting algorithm and you should get $1, 2, 3, 4, 5, 6$ as the output.
$\text{INSERTION-SORT}$ is one of the most ubiquitous of the sorting algorithms. It permits for sorting a small number of elements efficiently. The intuition for it goes like this. Imagine you begin with an empty hand and a cascade of cards faced-down on the table. The goal here is to sort each card such that you are holding all the cards in ascending order. Easy: pick up a card to start yourself off. After that, select a card, one at a time, paying mind to the cards you are currently holding until there are no cards left on the table. What you have done is after selecting your first card, you proceeded to pick up another card and compare it with each of the cards you have in your hand. Eventually, the cards on the table will run out and you would be holding a fully sorted number of cards.
You can find the pseudocode for $\text{INSERTION-SORT}$ below. Before you skip ahead, though,
let us preface what you will see. $\text{INSERTION-SORT}$ takes an array $A[1 \dots n]$ as
a parameter; that array is a sequence of length $n$ that is to be sorted. The number $n$ of elements
in $A$ is found using A.length. It is important to realise that $\text{INSERTION-SORT}$ sorts
the numbers in-place—that means it arranges the numbers within the array $A$ with at
most a constant number of them stored outside the array at any time. Put differently, an in-place
algorithm operates directly on the input data structure without needing a separate copy of the data structure
(like an auxillary array) that needs extra space proportional to the input size. Once it is finished, $\text{INSERTION-SORT}$ will
output the array $A$ with the original input sequence fully sorted.
INSERTION-SORT(A)
1: for j = 2 to A.length do
2: key = A[j]
3: // Insert A[j] into the sorted sequence A[1..j-1]
4: i = j - 1
5: while i > 0 and A[i] > key do
6: A[i + 1] = A[i]
7: i = i - 1
8: end while
9: A[i+1] = key
10: end for
Now why is it that $\text{INSERTION-SORT}$ start at $j = 2$? Remember the card analogy from earlier? We needed to pick up a card to start ourselves off. That means a single card is itself sorted. In fact, that is a nice segue into a brief showing of the correctness of $\text{INSERSTION-SORT}$, which will be done by way of a loop invariant.
Loop invariants help illustrate exactly why an algorithm is correct. You will need three things:
For our purposes, the loop invariant for $\text{INSERTION-SORT}$ is:
At the start of each iteration of theforloop oflines 1-10, the subarray $A[1..j-1]$ consists of the elements originally in $A[1..j-1]$ but in sorted order.
It is clear how all three criterion for the loop invariant are satisfied, right? Our colloquy on $j=1$ satisfies
the first one; the fact that the body of the for loop moves $A[j-1]$, $A[j-2]$ and so on one position to the right
till a proper position for $A[j]$ is found takes care of the second one; and once $j > $ A.length, $j = n + 1$ and
plugging that into $A[1..j-1]$ (from the loop invariant above) gets us $A[1..n]$, which is exactly our input—just sorted.
At least as important as demonstrating correctness is being able to analyse an algorithm, which has come to mean predicting the resources the algorithm will require. Chief among such resources is the time—the running time. But before we get to that, we must first define the model of computation and of resources we will use. This course assumes the generic random-access machine (RAM) model where instructions are executed one after another (re: sequential computation from earlier).
So how long does it take for $\text{INSERTION-SORT}$ to sort its input? Well it depends: sorting a single digit number of elements will always take less time than sorting millions of them. But there are nuances to that: $\text{INSERTION-SORT}$ takes longer on an array in which the elements are jumbled up than one of the same size in which only the first two consecutive numbers are out of place. Generally speaking, the time taken by an algorithm grows with teh size of the input. Hence, we say that the running time of a program is a function of the size of the input.
Now, the best notion for the input size depends on the problem being studied (and there are lots of those!). The most natural measure, of course, is simply the number of items in the input. But there are others: like multiplying two integers. That paradigm would fail, though, since the number of items in the input is just two, which does not accurately represent the computational effort required; we find that in the total number of bits a computer needs to represent the input in ordinary binary notation is much better. And if your input is something more complex, like a graph $G = (V, E)$? In that instance, the size of the input is best expressed as two numbers: the number of vertices and of edges. The point is: there is no silver bullet to measuring the input size.
If you are asked to calculate the running time for a different input size, you can treat the time complexity exactly like a standard mathematical function. You simply take the original running time formula and substitute the new input size wherever you see the original variable. So, if the running time of an algorithm is $T(n)$ on an input of size $n$, a new input of size $x$ would have that algorithm run in $T(x)$ time. I.e., just plug it in!
The running time of an algorithm on a particular input is the number of primitive operations executed. A constant $O(1)$ of time is required to execute each line of pseudocode. One line might take a different amount of time than another, but the point is: each execution of the $ith$ line takes $c_i$ time where $c_i$ is a constant. Comments are free—they are not executable statements so they take no time. Another thing to note is that loops and subroutines need to be examined closely as they do not necessarily take constant time. Also, all memory accesses take a constant amount of time.
Let us analyse the running time complexity of $\text{INSERTION-SORT.}$ We will use $c_i$ to represent the time needed to execute statement
$i$ in the pseudocode above. Line 1 is simply the header of the for loop and it is executed $n$ times. Lines 2, 4
and 8 go $n-1$ times. Line 5 is tricky since the number of executions is actually variable—it depends on the contents
of the array $A$ and the value of $j$. So, we will let $x$ be the number of times Line 5 is executed. Lines 6 and 7
execute so long as the while is true, so they will go $x-1$ times. All of this means we can describe the running time of
$\text{INSERTION-SORT}$ as a function $T(n)$ where $$T(n) = n \cdot c_1 + (n-1)(c_2 + c_3 + c_8) + x \cdot c_4 + (x-1)(c_5 + c_6).$$
But we dont know what $x$ is! Be that as it may, we can at least find an upper bound for it. Let $x_j$ be the maximum number of times Line 5
is executed. We know that the while loop will check if $i \geq 0$ and if the element at index $i$ is greater than the key.
The loop itself starts at $i = j -1$ and continues so long as $i$ is at least 0. Counting backwards, we go from $(j-1)$ to $(j-2)$ to $(j-3), \dots, 3, 2, 1, 0$.
If you count all those elements, that is exactly $j$ steps. Ergo, $x_j = j$ and an upper bound on $x$ is $$\Sigma^n_{j-2} \text{ }x_j = \Sigma^n_{j-2} \text{ }j
= (\Sigma^n_{j-1} \text{ }j) - 1 = \left ( \frac{n(n+1)}{2} \right ) - 1 = \frac{n^2+n-2}{2}.$$ Plugging that into the equation above and then
doing some algebraic calculations, you will get something which can be expressed as $$T(n) = an^2 + bn + c$$ where $a, b, c$ are constants that
depend on the statement costs $c_i$. It is a quadratic function of $n$.
In the worse case, it is clear that $T(n) \in O(n^2)$. We can also see that for a constant $k = min(a, b, c)$, $kn^2 \leq T(n)$, so that gives us a lower bound expressable as $T(n) \in \Omega(n^2)$. Taken together, this means $\text{INSERTION-SORT}$ is tight bound, $T(n) \in \Theta(n^2)$.
Remark. In the Turing Machine Model, the input size of a sorting algorithm is $$log_2a_1 + log_2a_2+ \dots + log_2a_n.$$ In the
context of insertion sort, that tells us why the while loop has $x$ executions.
Another canonical sorting algorithm is called $\text{MERGE-SORT}$. It follows closely the divide and conquer paradigm, where to solve a given problem, the algorithm breaks it apart into smaller subproblems, solves each of them recurseively, and then combines the solutions to form a solution to the original problem posed. Using $\text{MERGE-SORT}$, let us see the three steps the divide and conquer strategy employs at each level of the recursion:
We say the recusion "bottoms out" when the sequence to be sorted has a length of 1, in which case there is no work to be done since every sequence of length one is itself sorted (re: colloquy about $j = 1$ above).
The key operation of $\text{MERGE-SORT}$ is the combine step, which uses a helper procedure called $\text{MERGE}(A, p, q, r)$. There, $p, q, r$ are indices into the array such that $p \leq q \lt r$. $\text{MERGE}$ assumes arrays $A[p \dots q]$ and $A[q + 1 \dots r]$ are sorted; merge them both and you get a final array $A[p \dots r]$. To help understand why it is important, let us return to the card-playing example we spoke about earlier. If we have two piles of cards, faced up, and sorted, we want to combine the two stacks such that we have a final, still sorted, stack which faces down on the table. The basic step is to choose the smaller of the two cards on top of the face-up piles, removing it (which exposes the next card underneath) and place it face down onto the table. Repeat that procedure until you have that final stack of cards. The act of picking up a card and putting it down is $O(1)$. Doing that for $n$ cards means merging takes $\Theta(n)$ time where $n = r - p + 1$ is the total number of elements being merged. Take a look at the pseudocode for $\text{MERGE}(A, p, q, r)$ below.
MERGE(A, p, q, r)
1: n_1 = q - p + 1
2: n_2 = r - q
3: let L[1..n_1 + 1] and R[1..n_2 + 1] be new arrays
4: for i = 1 to n_1 do
5: L[i] = A[p + i - 1]
6: end for
7: for j = 1 to n_2 do
8: R[j] = A[q + j]
9: end for
10: // ꝏ is a sentinel value. Whenever a card with ꝏ
11: // is exposed, it cannot be the smaller card unless
12: // both piles have their sentinel cards exposed.
13: L[n_1 + 1] = ꝏ
14: R[n_2 + 1] = ꝏ
15: i = 1
16: j = 1
17: for k = p to r do
18: if L[i] ≤ R[j] do
19: A[k] = L[i]
20: i = i + 1
21: else A[k] = R[j] do
22: j = j + 1
23: end if
24: end for
Line 1 computes the
of $A[p \dots q]$, and line 2 computes the length of
$A[q + 1 \dots r]$. We create arrays L and R (“left” and “right”), of lengths $n_1 + 1$
and $n_2 + 1$, respectively, in line 3; the extra one in each array will hold a
sentinel. The for loop of lines 4–6 copies the subarray $A[p \dots q]$ into $L[1 \dots n_1]$,
and the for loop of lines 7–9 copies the subarray $A[q + 1 \dots r]$ into $R[1 \dots n_2]$.
Lines 13-14 put the sentinels at the ends of the arrays $L$ and $R$. Lines 17-24 perform the
$r - p + 1$ basic steps by maintaining the following loop invariant (which explains exactly the magic of $\text{MERGE}$) directly from CLRS:
At the start of each iteration of the for loop of lines17-24, the subarray $A[p \dots k - 1]$ contains the $k - p$ smallest elements of $L[1 \dots n_1 + 1]$ and $R[1 \dots n_2 + 1]$, in sorted order. Moreover, $L[i]$ and $R[j]$ are the smallest elements of their arrays that have not been copied back into $A$.
$\text{MERGE}$ runs in $\Theta(n)$ time, where $n = r - p + 1$ (you could say then that $\text{MERGE}$ runs in
$\Theta(r - p + 1)$ time). Can you tell why? Lines 1–3 and 13–16 takes constant time, the for loops of
lines 4-9 take $\Theta(n_1 + n_2) = \Theta(n)$ time, and there are $n$ iterations of the for
loop of lines 17–24, each of which (the lines) takes constant time. So, we have $$O(1) + O(1) + \Theta(n) + \Theta(n) \in \Theta(n)$$
Really, the bulk of the work in $\text{MERGE-SORT}$ is in the $\text{MERGE}$ procedure. As a subroutine, $\text{MERGE}$ is invoked in $\text{MERGE-SORT}$ as it sorts a subarray $A(p .. r)$. If $p \geq r$, $A$ has at most one element and is considered sorted. If not, the divide step will compute an index $q$ that partitions $A(p .. r)$ into two smaller arrays, $A[p .. q]$ and $A[q + 1 .. r]$ containing $\lceil \frac{n}{2} \rceil$ and $\lfloor \frac{n}{2} \rfloor$ elements respectively.
Remark. Why did we say "[i]f $p \geq r$, $A$ has at most one element and is considered sorted?" Well, when you calculate the number of elements in an array, the formula you would use is $$\text{value-for-starting-index} - \text{value-for-ending-index} + 1.$$ Based on that formula, the math breaks down like this:
With that out of the way, here is the star of the subchapter!
MERGE-SORT(A, p, r)
1: IF p < r
2: LET q = ⌊(p + r) / 2⌋
3: MERGE-SORT(A, p, q)
4: MERGE-SORT(A, q + 1, r)
5: MERGE(A, p, q, r)
If you look carefully, we called $\text{MERGE-SORT}$ inside $\text{MERGE-SORT}$! That is called a recursrive call.
When an algorithm contains a recursive call, we can often describe its running time using a recurrence equation or recurrence or recurrence relation, which describes the overall running time on a problem of size $n$ in terms of the running time on smaller inputs. From there, we get a neat formula we can solve.
A recurrence for the running time of a divide-and-conquer algorithm is derived directly from its three functional stages. Let $T(n)$ represent the total running time on an input of size $n$. Let us begin with the base case: if the problem size is sufficiently small (e.g., $n \leq c$ for some constant $c$), the algorithm executes a direct, straightforward solution in constant time, denoted as $\Theta(1)$. Here is the recursive step: if $n > c$, the division of the problem yields $a$ subproblems, each of which is $\frac{1}{b}$ the size of the original. (Note: In $\text{MERGE-SORT}$, $a = b = 2$, but these values differ in other algorithms). The time to solve a single subproblem is $T(n/b)$; consequently, solving $a$ subproblems requires $aT(n/b)$ time. Let $D(n)$ be the time required to divide the problem and $C(n)$ be the time to combine the sub-solutions. This is our overhead. The resulting general recurrence is:
To analyze $\text{MERGE-SORT}$ specifically, we assume the input size $n$ is a power of 2. While the algorithm functions correctly for all $n$, this assumption ensures that every Divide step produces two subsequences of exactly $n/2$, simplifying the recurrence-based analysis without loss of generality.
We define $T(n)$ as the worst-case running time for $\text{MERGE-SORT}$ on $n$ elements. For the base case of $n = 1$, the time is constant. For $n > 1$, the total time is the sum of:
The solution to this recurrence is $T(n) = \Theta(n \lg n)$. To understand this intuitively without the Master Theorem, we can replace the $\Theta$ notation with a constant $c$ that represents both the base-case cost and the per-element cost of the divide/combine steps:
By visualising this as a recursion tree, we can compute the total cost by summing the costs across all levels. The tree consists of $\lg n + 1$ levels. Each level $i$ (from $0$ to $\lg n$) contributes a total cost of $cn$, as the number of subproblems ($2^i$) multiplied by the cost per subproblem ($c \cdot n/2^i$) always equals $cn$. Summing these levels gives: $cn(\lg n + 1) = cn \lg n + cn$. By focusing on the asymptotic dominant term and discarding constants, we confirm the result $T(n) = \Theta(n \lg n)$. Here is the recursion tree we just described:
Remark. Earlier, we said the height of the recursion is $lgn$. That is because the branching factor $a$ is equal to two. That means the recursion tree drawn from that is a binary tree. In this next coming section, you will find a proof of this fact of the height of a binary tree.
Here's a question for you: is an array a data structure? Well, a data structure is a data organisation associated with a set of operations, standard algorithms, for efficiently using the data. So, yes, an array is a data structure! It contains a set of elements of the same memory size, each identified by an index, and of the same data type.
Dr. Du would have you believe that an array is not a data structure. It is quite unclear as to why that is, but in case it appears on an exam of his, his belief is included in these notes for posterity.
A data structure is a standard part in constructing algorithms. There are many data structures that actually build off the humble array: the stack, the queue, the list, and the heap.
A heap is a nearly complete binary tree which can easily be implemented on an array. For example, the binary tree becomes
| 6 | 5 | 3 | 2 | 4 | 1 |
|---|
The fact that the heap is a nearly complete binary tree is critical. Every level, except the bottom, is complete. On the bottom, all the nodes are placed starting from the left. Each node of the tree corresponds to an element of the array. The tree is completely filled on all levels except possibly the lowest, which is filled from the left up to a point.
An array $A$ that represents
a heap is an object with two attributes: A.length, which (as usual) gives the number of elements in the array,
and A.heap-size, which represents how many elements in the heap are stored within array $A$.
That is, although $A[1 \dots$ A.length$]$ may contain numbers, only the elements in $A[1 \dots$ A.heap-size$]$,
where $0 \leq$ A.heap-size $\leq$ A.length, are valid elements of the heap.
The root of the tree is $A[1]$, and given the index $i$ of a node, we can easily compute the indices of its parent, left child, and right child using the following procedures:
PARENT(i)
1: return ⌊i/2⌋
LEFT(i)
1: return 2i
RIGHT(i)
1: return 2i + 1
On most computers, the $\text{LEFT}$ procedure can compute $2i$ in one instruction by simply shifting the binary representation of $i$ left by one bit position. Likewise, the $\text{RIGHT}$ procedure can quickly compute $2i + 1$ by shifting the binary representation of $i$ left by one bit position and then adding in a $1$ as the low-order bit. The $\text{PARENT}$ procedure can compute $\lfloor i/2 \rfloor$ by shifting $i$ right one bit position.
There are two kinds of binary heaps: max-heaps and min-heaps. In both kinds, the values in the nodes satisfy a heap property. In a max-heap, the max-heap property is that for every node $i$ other than the root, $$A[\text{PARENT}(i)] \geq A[i]$$ That is, the value of a node is at most the value of its parent. Thus, the largest element in a max-heap is stored at the root. A min-heap is organized in the opposite way; the min-heap property is that for every node $i$ other than the root, $$A[\text{PARENT}(i)] \leq A[i]$$ So, the smallest element in a min-heap is at the root.
The heap can be viewed as a tree. Through that lens, we define the height of a node in a heap to be the number of edges on the longest simple downward path from the node to a leaf, and we define the height of the heap itself to be the height of its root. Since a heap of $n$ elements is based on a complete binary tree, its height is $\Theta(\lg n)$. The basic operations on heaps run in time at most proportional to the height of the tree and thus take $O(\lg n)$ time.
Heaps are important because they allow us to implement things like the priority queue and $\text{HEAP-SORT}$, the latter of which combines the better running time of $\text{MERGE-SORT}$ (vs. $\text{INSERTION-SORT}$) with the in-place sorting of $\text{INSERTION-SORT}$ (vs. $\text{MERGE-SORT}$). Indeed, $\text{HEAP-SORT}$ gives us the best of both worlds (among the sorting algorithms we discussed thus far): running time and spacial complexity! It should be noted that the $\text{HEAP-SORT}$ algorithm described will use the max-heap; it is the direct opposite of its min-heap sister, so switching between the two is not difficult. The min-heap is ofted used to implement priority queues.
But first, let's pause for a moment. Is it clear why the height of a complete binary tree is $\Theta(lgn)$? To prove that the height of a heap of $n$ elements is $\Theta(\lg n)$, we treat the heap as a complete binary tree. By definition, a heap of height $h$ has all levels $0, 1, \dots, h-1$ completely filled, while level $h$ is filled from left to right.
Proof. A complete binary tree of height $h$ has the fewest number of nodes when level $h$ contains exactly one node. In this case, the total number of nodes is the sum of a perfect binary tree of height $h-1$ plus one: $$n \geq (2^0 + 2^1 + \dots + 2^{h-1}) + 1 = (2^h - 1) + 1 = 2^h$$
The tree has the maximum number of nodes when level $h$ is completely full (making it a perfect binary tree). In this case: $$n \leq 2^0 + 2^1 + \dots + 2^h = 2^{h+1} - 1$$
Thus, for any complete binary tree of height $h$, we have the following inequality: $$2^h \leq n \leq 2^{h+1} - 1 < 2^{h+1}$$
Taking the logarithm base 2 ($\lg$) of all parts of the inequality $2^h \leq n < 2^{h+1}$ gives: $$\lg(2^h) \leq \lg n < \lg(2^{h+1})$$ $$h \leq \lg n < h + 1$$
Since $h$ must be an integer, the only value that satisfies $h \leq \lg n < h + 1$ is: $$h = \lfloor \lg n \rfloor$$
Because $h = \lfloor \lg n \rfloor$, we conclude that the height of a heap is $\Theta(\lg n)$. This confirms that the basic operations on heaps, which are proportional to the height, run in $O(\lg n)$ time.
Theorem: Any comparison sort algorithm requires $\Omega(n \lg n)$ comparisons in the worst case.
Proof. Let us model a comparison sort operating on an input sequence of $n$ elements as a full binary decision tree. In this tree, each internal node represents a comparison between two elements, and each leaf represents a possible sorted permutation of the input. Because a correct sorting algorithm must be capable of producing any valid permutation depending on the initial input, the decision tree must possess at least $n!$ reachable leaves.
Let $h$ represent the height of this decision tree, which corresponds to the maximum number of comparisons made in the worst-case scenario (the longest simple path from the root down to a leaf). A binary tree of height $h$ can have at most $2^h$ leaves. By combining this structural limitation with our requirement that the tree must contain at least $n!$ leaves, we establish the following inequality:
$$2^h \ge n!$$To solve for the height $h$, we take the base-2 logarithm of both sides. Since the logarithm is a monotonically increasing function, the direction of the inequality is preserved:
$$h \ge \lg(n!)$$Next, we expand the factorial into a summation of logarithms using the property that the logarithm of a product is the sum of the logarithms:
$$h \ge \sum_{k=1}^{n} \lg k$$To establish a tight mathematical lower bound for this summation without relying on Stirling's approximation, we can evaluate it by purposefully discarding the smaller half of the terms. Since all terms in the summation are positive for $k \ge 1$, dropping the first half of the terms (from $k=1$ up to $k=\lfloor n/2 \rfloor - 1$) yields a sum that is strictly less than or equal to the original sum:
$$\sum_{k=1}^{n} \lg k \ge \sum_{k=\lfloor n/2 \rfloor}^{n} \lg k$$Now, we can further bound this new, smaller sum from below. We do this by replacing every remaining term with the absolute smallest term in this restricted series, which is $\lg(n/2)$. Because there are at least $n/2$ terms remaining in the sum, multiplying the smallest term by the number of terms gives us:
$$\sum_{k=\lfloor n/2 \rfloor}^{n} \lg k \ge \frac{n}{2} \lg\left(\frac{n}{2}\right)$$Using the quotient rule for logarithms ($\lg(a/b) = \lg a - \lg b$), we can expand this expression:
$$\frac{n}{2} \lg\left(\frac{n}{2}\right) = \frac{n}{2}(\lg n - \lg 2) = \frac{n}{2}(\lg n - 1)$$Distributing the terms gives us $\frac{n}{2} \lg n - \frac{n}{2}$. For sufficiently large values of $n$, the linear term subtracted at the end becomes negligible compared to the dominant $\frac{n}{2} \lg n$ term. Therefore, the function grows at a rate strictly proportional to $n \lg n$.
By the transitive property, returning to our original inequality for the tree height $h$, we have formally established:
$$h = \Omega(n \lg n)$$This concludes the proof. The worst-case number of comparisons for any comparison sort is bounded below by a function proportional to $n \lg n$. $\blacksquare$
That out of the way, in order to maintain the max-heap property, we call the procedure $\text{MAX-HEAPIFY}$. Its inputs are an array $A$ and an index $i$ into the array. When it is called, $\text{MAX-HEAPIFY}$ assumes that the binary trees rooted at $\text{LEFT}(i)$ and $\text{RIGHT}(i)$ are max-heaps, but that $A[i]$ might be smaller than its children, thus violating the max-heap property. $\text{MAX-HEAPIFY}$ lets the value at $A[i]$ “float down” in the max-heap so that the subtree rooted at index $i$ obeys the max-heap property.
MAX-HEAPIFY(A, i)
1: l = LEFT(i)
2: r = RIGHT(i)
3: if l ≤ A.heap-size and A[l] > A[i]
4: largest = l
5: else largest = i
6: if r ≤ A.heap-size and A[r] > A[largest]
7: largest = r
8: if largest ≠ i
9: exchange A[i] with A[largest]
10: MAX-HEAPIFY(A, largest)
The $\text{MAX-HEAPIFY}$ procedure is an essential tool for maintaining the max-heap property by correcting potential violations at a specific node $i$. During each recursive step, the algorithm performs a three-way comparison between the parent element $A[i]$ and its immediate descendants, $A[\text{LEFT}(i)]$ and $A[\text{RIGHT}(i)]$, to identify which of the three contains the greatest value. The index of this maximum value is then stored in the variable $\text{largest}$. If the parent $A[i]$ is found to be the maximum, the local max-heap property is already satisfied for that subtree, and the procedure terminates. However, if one of the children holds a value greater than the parent, the algorithm executes a swap between $A[i]$ and $A[\text{largest}]$. While this exchange restores the required relationship at node $i$, the original value of $A[i]$ has now been displaced into a lower level of the tree. Because this value might be smaller than its new children, it could trigger a violation in the subtree now rooted at $\text{largest}$. To address this, $\text{MAX-HEAPIFY}$ is called recursively on that specific subtree, effectively "shaping" the heap from the top down until the element finds its correct, valid position.
The temporal complexity of the $\text{MAX-HEAPIFY}$ procedure, when applied to a subtree of size $n$ rooted at node $i$, is derived from two distinct components. First, there is a constant $\Theta(1)$ cost associated with performing the local comparisons between $A[i]$, $A[\text{LEFT}(i)]$, and $A[\text{RIGHT}(i)]$, followed by the potential swap of elements. Second, there is the recursive cost of performing the same operation on a subtree rooted at one of node $i$’s children. In the most lopsided case—specifically when the bottom level of the binary tree is exactly half-filled—the larger of the two subtrees can contain at most $2n/3$ elements. This imbalance leads to the following recurrence relation for the worst-case running time:
$$T(n) \leq T(2n/3) + \Theta(1)$$Applying Case 2 of the Master Theorem (where the cost of the "divide" and "combine" steps is constant relative to the subproblem size) yields a solution of $T(n) = O(\lg n)$. This logarithmic bound aligns with the height $h$ of the tree, allowing us to also characterize the running time as $O(h)$. This suggests that the cost is directly proportional to the number of levels the element must "sink" to find its valid position.
To transform an arbitrary array $A[1 \dots n]$ into a max-heap, we utilize the $\text{BUILD-MAX-HEAP}$ procedure in a bottom-up fashion. This strategy relies on the observation that the latter half of the array consists entirely of leaves, which are inherently trivial 1-element max-heaps.
Proof of Leaf Indices. To rigorously identify the leaf nodes, consider a node at index $i$ in a heap of size $n$. By the properties of a nearly complete binary tree, its left child is positioned at $2i$. A node is classified as a leaf if it possesses no children, which occurs when $2i > n$. Solving for $i$, we find the smallest integer satisfying this inequality is $i = \lfloor n/2 \rfloor + 1$. Consequently, all nodes indexed from $\lfloor n/2 \rfloor + 1$ through $n$ are leaves.
Theorem: Any comparison sort algorithm requires $\Omega(n \lg n)$ comparisons in the worst case.
Proof:
Let us model a comparison sort operating on an input sequence of $n$ elements as a full binary decision tree. In this tree, each internal node represents a comparison between two elements, and each leaf represents a possible sorted permutation of the input. Because a correct sorting algorithm must be capable of producing any valid permutation depending on the initial input, the decision tree must possess at least $n!$ reachable leaves.
Let $h$ represent the height of this decision tree, which corresponds to the maximum number of comparisons made in the worst-case scenario (the longest simple path from the root down to a leaf). A binary tree of height $h$ can have at most $2^h$ leaves. By combining this structural limitation with our requirement that the tree must contain at least $n!$ leaves, we establish the following inequality:
$$2^h \ge n!$$To solve for the height $h$, we take the base-2 logarithm of both sides. Since the logarithm is a monotonically increasing function, the direction of the inequality is preserved:
$$h \ge \lg(n!)$$Next, we expand the factorial into a summation of logarithms using the property that the logarithm of a product is the sum of the logarithms:
$$h \ge \sum_{k=1}^{n} \lg k$$To establish a tight mathematical lower bound for this summation without relying on Stirling's approximation, we can evaluate it by purposefully discarding the smaller half of the terms. Since all terms in the summation are positive for $k \ge 1$, dropping the first half of the terms (from $k=1$ up to $k=\lfloor n/2 \rfloor - 1$) yields a sum that is strictly less than or equal to the original sum:
$$\sum_{k=1}^{n} \lg k \ge \sum_{k=\lfloor n/2 \rfloor}^{n} \lg k$$Now, we can further bound this new, smaller sum from below. We do this by replacing every remaining term with the absolute smallest term in this restricted series, which is $\lg(n/2)$. Because there are at least $n/2$ terms remaining in the sum, multiplying the smallest term by the number of terms gives us:
$$\sum_{k=\lfloor n/2 \rfloor}^{n} \lg k \ge \frac{n}{2} \lg\left(\frac{n}{2}\right)$$Using the quotient rule for logarithms ($\lg(a/b) = \lg a - \lg b$), we can expand this expression:
$$\frac{n}{2} \lg\left(\frac{n}{2}\right) = \frac{n}{2}(\lg n - \lg 2) = \frac{n}{2}(\lg n - 1)$$Distributing the terms gives us $\frac{n}{2} \lg n - \frac{n}{2}$. For sufficiently large values of $n$, the linear term subtracted at the end becomes negligible compared to the dominant $\frac{n}{2} \lg n$ term. Therefore, the function grows at a rate strictly proportional to $n \lg n$.
By the transitive property, returning to our original inequality for the tree height $h$, we have formally established:
$$h = \Omega(n \lg n)$$This concludes the proof. The worst-case number of comparisons for any comparison sort is bounded below by a function proportional to $n \lg n$. $\blacksquare$
Since these leaf nodes already satisfy the max-heap property, $\text{BUILD-MAX-HEAP}$ iterates backward from the last non-leaf node (index $\lfloor n/2 \rfloor$) down to the root (index $1$), invoking $\text{MAX-HEAPIFY}$ at each step. This ensures that as we move upward, the subtrees rooted at each node are progressively converted into valid max-heaps.
BUILD-MAX-HEAP(A)
1: A.heap-size = A.length
2: for i = ⌊A.length/2⌋ downto 1
3: MAX-HEAPIFY(A, i)
To show why $\text{BUILD-MAX-HEAP}$ works correctly, we use the following loop invariant:
At the start of each iteration of the for loop of lines 2–3, each node $i + 1, i + 2, \dots, n$ is the root of a max-heap.
A simple upper bound is $O(n \lg n)$, but we can derive a tighter bound. The time for $\text{MAX-HEAPIFY}$ depends on the height $h$ of the node. An $n$-element heap has height $\lfloor \lg n \rfloor$ and at most $\lceil n/2^{h+1} \rceil$ nodes of any height $h$. The total cost is:
$$\sum_{h=0}^{\lfloor \lg n \rfloor} \left\lceil \frac{n}{2^{h+1}} \right\rceil O(h) = O\left( n \sum_{h=0}^{\lfloor \lg n \rfloor} \frac{h}{2^h} \right)$$
Using the summation $\sum_{h=0}^{\infty} \frac{h}{2^h} = 2$, we find the running time is $O(n)$. Hence, we can build a max-heap from an unordered array in linear time.
To build a min-heap, we use $\text{MIN-HEAPIFY}$. The logic is identical to $\text{MAX-HEAPIFY}$ but swaps the comparison to find the smallest value among the node and its children:
MIN-HEAPIFY(A, i)
1: l = LEFT(i)
2: r = RIGHT(i)
3: if l ≤ A.heap-size and A[l] < A[i]
4: smallest = l
5: else smallest = i
6: if r ≤ A.heap-size and A[r] < A[smallest]
7: smallest = r
8: if smallest ≠ i
9: exchange A[i] with A[smallest]
10: MIN-HEAPIFY(A, smallest)
The running time of $\text{MIN-HEAPIFY}$ is $O(h)$ or $O(\lg n)$, exactly the same as $\text{MAX-HEAPIFY}$.
Now, we will get into the $\text{HEAP-SORT}$ algorithm. The heapsort algorithm starts by using $\text{BUILD-MAX-HEAP}$ to build a max-heap on the input array $A[1 \dots n]$, where $n = A.\text{length}$. Since the maximum element of the array is stored at the root $A[1]$, we can put it into its correct final position by exchanging it with $A[n]$. If we now discard node $n$ from the heap—and we can do so by simply decrementing $A.\text{heap-size}$—we observe that the children of the root remain max-heaps, but the new root element might violate the max-heap property. All we need to do to restore the max-heap property, however, is call $\text{MAX-HEAPIFY}(A, 1)$, which leaves a max-heap in $A[1 \dots n - 1]$. The heapsort algorithm then repeats this process for the max-heap of size $n - 1$ down to a heap of size $2$.
To argue correctness, let the loop invariant be:
At the start of each iteration of the for loop of lines 2–5, the subarray $A[1 \dots i]$ is a max-heap containing the $i$ smallest elements of $A[1 \dots n]$, and the subarray $A[i + 1 \dots n]$ contains the $n - i$ largest elements of $A[1 \dots n]$, sorted.
The pseudocode for the $\text{HEAP-SORT}$ algorithm on an input of an array $A$ can be found below.
HEAPSORT(A)
1: BUILD-MAX-HEAP(A)
2: for i = A.length downto 2
3: exchange A[1] with A[i]
4: A.heap-size = A.heap-size - 1
5: MAX-HEAPIFY(A, 1)
The $\text{HEAPSORT}$ procedure takes time $O(n \lg n)$, since the call to $\text{BUILD-MAX-HEAP}$ takes time $O(n)$ and each of the $n - 1$ calls to $\text{MAX-HEAPIFY}$ takes time $O(\lg n)$.
If the array is already sorted in increasing order, $\text{BUILD-MAX-HEAP}$ takes $O(n)$ time. During $\text{HEAPSORT}$, we repeatedly swap the root with a leaf node and call $\text{MAX-HEAPIFY.}$ Because the leaves are relatively small elements, $\text{MAX-HEAPIFY}$ will almost always sift the element all the way down to the bottom of the tree, taking $\Theta(\lg i)$ time per call. Thus, the total running time is $\Theta(n \lg n)$.
If the array is sorted in decreasing order, it is already a valid max-heap. $\text{BUILD-MAX-HEAP}$ is still $O(n)$. The subsequent sorting process is identical to the increasing order scenario, as we still swap small leaves to the root. The running time remains $\Theta(n \lg n)$.
In the worst case, the element swapped from the last position $i$ to the root during each iteration of the loop is small enough that $\text{MAX-HEAPIFY}$ must push it down to a leaf level. This requires $\lfloor \lg i \rfloor$ comparisons. Summing this over all $n-1$ iterations yields a total time proportional to $\sum_{i=2}^{n} \lg i = \lg(n!) = \Omega(n \lg n)$. Combined with the $O(n \lg n)$ upper bound, the worst-case running time is strictly $\Theta(n \lg n)$.
Even in the best case, when all elements are distinct, the heapsort algorithm takes $\Omega(n \lg n)$ time. When we exchange the root $A[1]$ with the leaf $A[i]$, we are placing a small element at the root. Because all elements are distinct, this element will generally have to sift down a significant portion of the tree height to restore the max-heap property. It has been mathematically proven that the sum of the depths these elements must travel bounds the best-case performance to $\Omega(n \lg n)$. Therefore, Heapsort is uniformly $\Theta(n \lg n)$ across best, average, and worst cases for distinct elements.
Heapsort is nice because it is quite memory efficient as is, but there is a faster sorting algorithm out there: Quicksort. The quicksort algorithm has a worst-case running time of $\Theta(n^2)$ on an input array of $n$ numbers. Despite this slow worst-case running time, quicksort is often the best practical choice for sorting because it is remarkably efficient on the average: running $\Theta(n \lg n)$. It also has the advantage of sorting in place, and it works well even in virtual-memory environments.
Quicksort, like merge sort, applies the divide-and-conquer paradigm. Here are the three-steps on a typical subarray $A[p \dots r]$:
The following procedure implements quicksort:
QUICKSORT(A, p, r)
1: if p < r
2: q = PARTITION(A, p, r)
3: QUICKSORT(A, p, q - 1)
4: QUICKSORT(A, q + 1, r)
To sort an entire array $A$, the initial call is $\text{QUICKSORT}(A, 1, A.\text{length})$.
The key to the algorithm is the $\text{PARTITION}$ procedure, which rearranges the subarray $A[p \dots r]$ in place.
PARTITION(A, p, r)
1: x = A[r]
2: i = p - 1
3: for j = p to r - 1
4: if A[j] ≤ x
5: i = i + 1
6: exchange A[i] with A[j]
7: exchange A[i + 1] with A[r]
8: return i + 1
To further illustrate the operation of $\text{PARTITION}$, consider the array $A = \langle 13, 19, 9, 5, 12, 8, 7, 4, 21, 2, 6, 11 \rangle$. With the pivot $x = 11$, elements are sequentially compared and swapped. The smaller elements ($\le 11$) are accumulated at the front, resulting in the intermediate state $\langle 9, 5, 8, 7, 4, 2, 6, 19, 21, 13, 12, 11 \rangle$ just before the final swap. The pivot $11$ is then swapped with the first larger element $19$, correctly positioning the pivot and returning its final index.
$\text{PARTITION}$ always selects an element $x = A[r]$ as a pivot element around which to partition the
subarray $A[p \dots r]$. As the procedure runs, it partitions the array into four (possibly
empty) regions. At the start of each iteration of the for loop in lines 3–6, the regions
satisfy certain properties. We state these properties as a loop
invariant:
At the beginning of each iteration of the loop of lines3–6, for any array index $k$,
- If $p \le k \le i$, then $A[k] \le x$.
- If $i + 1 \le k \le j - 1$, then $A[k] > x$.
- If $k = r$, then $A[k] = x$.
The indices between $j$ and $r - 1$ are not covered by any of the three cases, and the values in these entries have no particular relationship to the pivot $x$.
We need to show that this loop invariant is true prior to the first iteration, that each iteration of the loop maintains the invariant, and that the invariant provides a useful property to show correctness when the loop terminates.
The final two lines of $\text{PARTITION}$ finish up by swapping the pivot element with the leftmost element greater than $x$, thereby moving the pivot into its correct place in the partitioned array, and then returning the pivot’s new index. The output of $\text{PARTITION}$ now satisfies the specifications given for the divide step. In fact, it satisfies a slightly stronger condition: after line 2 of $\text{QUICKSORT}$, $A[q]$ is strictly less than every element of $A[q + 1 \dots r]$.
It is worth noting how $\text{PARTITION}$ behaves when all elements in the array $A[p \dots r]$ have the same value. In this scenario, the condition $A[j] \le x$ is always met, meaning $i$ increments continuously. The algorithm ends up swapping the pivot with itself, returning $q = r$. If we wanted $\text{PARTITION}$ to return the median index $\lfloor(p + r)/2\rfloor$ in this specific case, we would modify it to group elements strictly less than the pivot, and then handle equality explicitly to center the pivot. Furthermore, if we desired to modify $\text{QUICKSORT}$ to sort in nonincreasing order instead, we would simply change the condition in line 4 of $\text{PARTITION}$ from $A[j] \le x$ to $A[j] \ge x$.
The running time of $\text{PARTITION}$ on the subarray $A[p \dots r]$ is $\Theta(n)$, where $n = r - p + 1$. This linear time bound follows directly from the fact that the procedure's for loop iterates exactly $n - 1$ times, and each iteration executes only a constant amount of $O(1)$ work consisting of comparisons and variable updates.
The running time of quicksort depends on whether the partitioning is balanced or unbalanced, which in turn depends on which elements are used for partitioning. If the partitioning is balanced, the algorithm runs asymptotically as fast as merge sort. If the partitioning is unbalanced, however, it can run asymptotically as slowly as insertion sort. In this section, we shall informally investigate how quicksort performs under the assumptions of balanced versus unbalanced partitioning.
The worst-case behavior for quicksort occurs when the partitioning routine produces one subproblem with $n - 1$ elements and one with $0$ elements. (We prove this claim soon) Let us assume that this unbalanced partitioning arises in each recursive call. The partitioning costs $\Theta(n)$ time. Since the recursive call on an array of size $0$ just returns, $T(0) = \Theta(1)$, and the recurrence for the running time is
$$T(n) = T(n - 1) + T(0) + \Theta(n) = T(n - 1) + \Theta(n)$$
Intuitively, if we sum the costs incurred at each level of the recursion, we get an arithmetic series, which evaluates to $\Theta(n^2)$. Indeed, it is straightforward to use the substitution method to prove that the recurrence $T(n) = T(n - 1) + \Theta(n)$ has the solution $T(n) = \Theta(n^2)$. We can guess that $T(n) \le cn^2$ for some constant $c$. Substituting this into the recurrence gives $T(n) \le c(n-1)^2 + dn = cn^2 - 2cn + c + dn$. For a sufficiently large choice of $c$, the $-2cn$ term dominates, proving the upper bound $O(n^2)$. A similar argument proves the lower bound $\Omega(n^2)$.
Thus, if the partitioning is maximally unbalanced at every recursive level of the algorithm, the running time is $\Theta(n^2)$. Therefore the worst-case running time of quicksort is no better than that of insertion sort. This worst-case running time occurs when the input array is already completely sorted (either in increasing or decreasing order, provided elements are distinct). If it is sorted in decreasing order, the pivot chosen at $A[r]$ is always the smallest element, perfectly isolating it from the rest and creating sizes $n-1$ and $0$. A similar degradation to $\Theta(n^2)$ occurs when all elements in the array have the identical same value, as discussed previously.
This has practical implications. For instance, converting time-of-transaction ordering to check-number ordering involves sorting an almost-sorted input. Because $\text{QUICKSORT}$ performs poorly on nearly sorted data (skewing towards $\Theta(n^2)$), the procedure $\text{INSERTION-SORT}$ would actually tend to beat $\text{QUICKSORT}$ on this specific problem, as it runs in $O(n)$ time when few elements are out of order.
In the most even possible split, $\text{PARTITION}$ produces two subproblems, each of size no more than $n/2$, since one is of size $\lfloor n/2 \rfloor$ and one of size $\lceil n/2 \rceil - 1$. In this case, quicksort runs much faster. The recurrence for the running time is then
$$T(n) = 2T(n/2) + \Theta(n)$$
where we tolerate the sloppiness from ignoring the floor and ceiling and from subtracting 1. By case 2 of the master theorem, this recurrence has the solution $T(n) = \Theta(n \lg n)$. By equally balancing the two sides of the partition at every level of the recursion, we get an asymptotically faster algorithm.
The average-case running time of quicksort is much closer to the best case than to the worst case, as the analyses in later in the notes will show. The key to understanding why is to understand how the balance of the partitioning is reflected in the recurrence that describes the running time.
Suppose, for example, that the partitioning algorithm always produces a 9-to-1 proportional split, which at first blush seems quite unbalanced. We then obtain the recurrence
$$T(n) = T(9n/10) + T(n/10) + cn$$
on the running time of quicksort, where we have explicitly included the constant $c$ hidden in the $\Theta(n)$ term. Notice that every level of the tree has cost $cn$, until the recursion reaches a boundary condition at depth $\log_{10} n = \Theta(\lg n)$, and then the levels have cost at most $cn$. The recursion terminates at depth $\log_{10/9} n = \Theta(\lg n)$. The total cost of quicksort is therefore $O(n \lg n)$. Thus, with a 9-to-1 proportional split at every level of recursion, which intuitively seems quite unbalanced, quicksort runs in $O(n \lg n)$ time—asymptotically the same as if the split were right down the middle.
Indeed, even a 99-to-1 split yields an $O(n \lg n)$ running time. In fact, any split of constant proportionality $1 - \alpha$ to $\alpha$ yields a recursion tree of minimum depth $-\lg n / \lg \alpha$ and maximum depth $-\lg n / \lg(1 - \alpha)$, which bounds it to $\Theta(\lg n)$, where the cost at each level is $O(n)$. The running time is therefore $O(n \lg n)$ whenever the split has constant proportionality.
To develop a clear notion of the randomized behavior of quicksort, we must make an assumption about how frequently we expect to encounter the various inputs. The behavior of quicksort depends on the relative ordering of the values in the array elements given as the input, and not by the particular values in the array. As in our probabilistic analysis of the hiring problem, we will assume for now that all permutations of the input numbers are equally likely.
When we run quicksort on a random input array, the partitioning is highly unlikely to happen in the same way at every level, as our informal analysis has assumed. We expect that some of the splits will be reasonably well balanced and that some will be fairly unbalanced. For example, for any constant $0 < \alpha \le 1/2$, the probability is exactly $1 - 2\alpha$ that on a random input array, $\text{PARTITION}$ produces a split more balanced than $1 - \alpha$ to $\alpha$. This is because the split is well-balanced if the pivot chosen happens to fall within the middle $1 - 2\alpha$ fraction of the sorted elements. Consequently, about 80 percent of the time $\text{PARTITION}$ produces a split that is more balanced than 9 to 1, and about 20 percent of the time it produces a split that is less balanced than 9 to 1.
In the average case, $\text{PARTITION}$ produces a mix of “good” and “bad” splits. In a recursion tree for an average-case execution of $\text{PARTITION}$, the good and bad splits are distributed randomly throughout the tree. Suppose, for the sake of intuition, that the good and bad splits alternate levels in the tree, and that the good splits are best-case splits and the bad splits are worst-case splits. Consider two levels of a recursion tree for quicksort. The partitioning at the root costs $n$ and produces a “bad” split: two subarrays of sizes $0$ and $n - 1$. At the next level, the subarray of size $n - 1$ undergoes best-case partitioning into subarrays of size $(n - 1)/2 - 1$ and $(n - 1)/2$. Let’s assume that the boundary-condition cost is $1$ for the subarray of size $0$.
The combination of the bad split followed by the good split produces three subarrays of sizes $0$, $(n - 1)/2 - 1$, and $(n - 1)/2$ at a combined partitioning cost of $\Theta(n) + \Theta(n - 1) = \Theta(n)$. Certainly, this situation is no worse than that in having a single level of a recursion tree that is very well balanced. In both parts, the partitioning cost for the subproblems is $n$. Yet the subproblems remaining to be solved are no larger than the corresponding subproblems remaining to be solved. Intuitively, the $\Theta(n - 1)$ cost of the bad split can be absorbed into the $\Theta(n)$ cost of the good split, and the resulting split is good. Thus, the running time of quicksort, when levels alternate between good and bad splits, is like the running time for good splits alone: still $O(n \lg n)$, but with a slightly larger constant hidden by the $O$-notation.
In exploring the average-case behavior of quicksort, we have made an assumption that all permutations of the input numbers are equally likely. In an engineering situation, however, we cannot always expect this assumption to hold. Hope is not lost, though: we can sometimes add randomization to an algorithm in order to obtain good expected performance over all inputs. Many people regard the resulting randomized version of quicksort as the sorting algorithm of choice for large enough inputs.
We analyze the expected running time of a randomized algorithm and not its worst-case running time because the worst-case behavior is no longer dependent on any specific adversarial input, but solely on an exceptionally rare sequence of poor random number choices, making the expected bound a much more realistic metric of performance.
Earlier, we randomized our algorithm by explicitly permuting the input. We could do so for quicksort also, but a different randomization technique, called random sampling, yields a simpler analysis. Instead of always using $A[r]$ as the pivot, we will select a randomly chosen element from the subarray $A[p \dots r]$. We do so by first exchanging element $A[r]$ with an element chosen at random from $A[p \dots r]$. By randomly sampling the range $p, \dots, r$, we ensure that the pivot element $x = A[r]$ is equally likely to be any of the $r - p + 1$ elements in the subarray. Because we randomly choose the pivot element, we expect the split of the input array to be reasonably well balanced on average.
The changes to $\text{PARTITION}$ and $\text{QUICKSORT}$ are small. In the new partition procedure, we simply implement the swap before actually partitioning:
RANDOMIZED-PARTITION(A, p, r)
1: i = RANDOM(p, r)
2: exchange A[r] with A[i]
3: return PARTITION(A, p, r)
The new quicksort calls $\text{RANDOMIZED-PARTITION}$ in place of $\text{PARTITION}$:
RANDOMIZED-QUICKSORT(A, p, r)
1: if p < r
2: q = RANDOMIZED-PARTITION(A, p, r)
3: RANDOMIZED-QUICKSORT(A, p, q - 1)
4: RANDOMIZED-QUICKSORT(A, q + 1, r)
When $\text{RANDOMIZED-QUICKSORT}$ runs, the number of calls made to the random-number generator $\text{RANDOM}$ scales with the number of partitioning steps. In the worst case (where the split is $n-1$ and $0$), it makes $\Theta(n)$ calls. In the best case (perfectly balanced splits), there are $n-1$ internal nodes in the recursion tree, so it still makes exactly $\Theta(n)$ calls.
Above, we gave some intuition for the worst-case behavior of quicksort and for why we expect it to run quickly. Now, we analyze the behavior of quicksort more rigorously. We begin with a worst-case analysis, which applies to either $\text{QUICKSORT}$ or $\text{RANDOMIZED-QUICKSORT}$, and conclude with an analysis of the expected running time of $\text{RANDOMIZED-QUICKSORT}$.
We saw that a worst-case split at every level of recursion in quicksort produces a $\Theta(n^2)$ running time, which, intuitively, is the worst-case running time of the algorithm. We now prove this assertion.
Using the substitution method, we can show that the running time of quicksort is $O(n^2)$. Let $T(n)$ be the worst-case time for the procedure $\text{QUICKSORT}$ on an input of size $n$. We have the recurrence
$$T(n) = \max_{0 \le q \le n - 1} (T(q) + T(n - q - 1)) + \Theta(n)$$
where the parameter $q$ ranges from $0$ to $n - 1$ because the procedure $\text{PARTITION}$ produces two subproblems with total size $n - 1$. We guess that $T(n) \le cn^2$ for some constant $c$. Substituting this guess into recurrence, we obtain
$$T(n) \le \max_{0 \le q \le n - 1} (cq^2 + c(n - q - 1)^2) + \Theta(n)$$ $$= c \cdot \max_{0 \le q \le n - 1} (q^2 + (n - q - 1)^2) + \Theta(n)$$
The expression $q^2 + (n - q - 1)^2$ achieves a maximum over the parameter’s range $0 \le q \le n - 1$ at either endpoint. To verify this claim, note that the second derivative of the expression with respect to $q$ is $4$, which is strictly positive. A positive second derivative implies the function is convex, meaning its maximums on a closed interval must reside at the boundaries ($q = 0$ or $q = n - 1$). This observation gives us the bound $\max_{0 \le q \le n - 1} (q^2 + (n - q - 1)^2) \le (n - 1)^2 = n^2 - 2n + 1$. Continuing with our bounding of $T(n)$, we obtain
$$T(n) \le cn^2 - c(2n - 1) + \Theta(n) \le cn^2$$
since we can pick the constant $c$ large enough so that the $c(2n - 1)$ term dominates the $\Theta(n)$ term. Thus, $T(n) = O(n^2)$. Earlier, there was a specific case in which quicksort takes $\Omega(n^2)$ time: when partitioning is unbalanced. Alternatively, if we substitute an $\Omega(n^2)$ lower-bound guess into recurrence, the boundaries once again dictate that $T(n)$ grows at least as fast as $c'n^2$, confirming $T(n) = \Omega(n^2)$. Thus, the (worst-case) running time of quicksort is tightly $\Theta(n^2)$. Contrastingly, quicksort's best-case running time, established when $q = \lfloor(n-1)/2\rfloor$, resolves to $T(n) = 2T(n/2) + \Theta(n)$, giving $\Omega(n \lg n)$.
We have already seen the intuition behind why the expected running time of $\text{RANDOMIZED-QUICKSORT}$ is $O(n \lg n)$: if, in each level of recursion, the split induced by $\text{RANDOMIZED-PARTITION}$ puts any constant fraction of the elements on one side of the partition, then the recursion tree has depth $\Theta(\lg n)$, and $O(n)$ work is performed at each level. Even if we add a few new levels with the most unbalanced split possible between these levels, the total time remains $O(n \lg n)$. We can analyze the expected running time of $\text{RANDOMIZED-QUICKSORT}$ precisely by first understanding how the partitioning procedure operates and then using this understanding to derive an $O(n \lg n)$ bound on the expected running time. This upper bound on the expected running time, combined with the $\Omega(n \lg n)$ best-case bound, yields a $\Theta(n \lg n)$ expected running time. We assume throughout that the values of the elements being sorted are distinct.
The $\text{QUICKSORT}$ and $\text{RANDOMIZED-QUICKSORT}$ procedures differ only in how they select pivot elements; they are the same in all other respects. We can therefore couch our analysis of $\text{RANDOMIZED-QUICKSORT}$ by discussing the $\text{QUICKSORT}$ and $\text{PARTITION}$ procedures, but with the assumption that pivot elements are selected randomly from the subarray passed to $\text{RANDOMIZED-PARTITION}$.
The running time of $\text{QUICKSORT}$ is dominated by the time spent in the $\text{PARTITION}$ procedure. Each time the $\text{PARTITION}$ procedure is called, it selects a pivot element, and this element is never included in any future recursive calls to $\text{QUICKSORT}$ and $\text{PARTITION}$. Thus, there can be at most $n$ calls to $\text{PARTITION}$ over the entire execution of the quicksort algorithm. One call to $\text{PARTITION}$ takes $O(1)$ time plus an amount of time that is proportional to the number of iterations of the for loop in lines 3–6. Each iteration of this for loop performs a comparison in line 4, comparing the pivot element to another element of the array $A$. Therefore, if we can count the total number of times that line 4 is executed, we can bound the total time spent in the for loop during the entire execution of $\text{QUICKSORT}$.
Lemma Let $X$ be the number of comparisons performed in line 4 of $\text{PARTITION}$ over the entire execution of $\text{QUICKSORT}$ on an $n$-element array. Then the running time of $\text{QUICKSORT}$ is $O(n + X)$.
Proof. By the discussion above, the algorithm makes at most $n$ calls to $\text{PARTITION}$, each of which does a constant amount of work and then executes the for loop some number of times. Each iteration of the for loop executes line 4.
Theorem: Any comparison sort algorithm requires $\Omega(n \lg n)$ comparisons in the worst case.
Proof:
Let us model a comparison sort operating on an input sequence of $n$ elements as a full binary decision tree. In this tree, each internal node represents a comparison between two elements, and each leaf represents a possible sorted permutation of the input. Because a correct sorting algorithm must be capable of producing any valid permutation depending on the initial input, the decision tree must possess at least $n!$ reachable leaves.
Let $h$ represent the height of this decision tree, which corresponds to the maximum number of comparisons made in the worst-case scenario (the longest simple path from the root down to a leaf). A binary tree of height $h$ can have at most $2^h$ leaves. By combining this structural limitation with our requirement that the tree must contain at least $n!$ leaves, we establish the following inequality:
$$2^h \ge n!$$To solve for the height $h$, we take the base-2 logarithm of both sides. Since the logarithm is a monotonically increasing function, the direction of the inequality is preserved:
$$h \ge \lg(n!)$$Next, we expand the factorial into a summation of logarithms using the property that the logarithm of a product is the sum of the logarithms:
$$h \ge \sum_{k=1}^{n} \lg k$$To establish a tight mathematical lower bound for this summation without relying on Stirling's approximation, we can evaluate it by purposefully discarding the smaller half of the terms. Since all terms in the summation are positive for $k \ge 1$, dropping the first half of the terms (from $k=1$ up to $k=\lfloor n/2 \rfloor - 1$) yields a sum that is strictly less than or equal to the original sum:
$$\sum_{k=1}^{n} \lg k \ge \sum_{k=\lfloor n/2 \rfloor}^{n} \lg k$$Now, we can further bound this new, smaller sum from below. We do this by replacing every remaining term with the absolute smallest term in this restricted series, which is $\lg(n/2)$. Because there are at least $n/2$ terms remaining in the sum, multiplying the smallest term by the number of terms gives us:
$$\sum_{k=\lfloor n/2 \rfloor}^{n} \lg k \ge \frac{n}{2} \lg\left(\frac{n}{2}\right)$$Using the quotient rule for logarithms ($\lg(a/b) = \lg a - \lg b$), we can expand this expression:
$$\frac{n}{2} \lg\left(\frac{n}{2}\right) = \frac{n}{2}(\lg n - \lg 2) = \frac{n}{2}(\lg n - 1)$$Distributing the terms gives us $\frac{n}{2} \lg n - \frac{n}{2}$. For sufficiently large values of $n$, the linear term subtracted at the end becomes negligible compared to the dominant $\frac{n}{2} \lg n$ term. Therefore, the function grows at a rate strictly proportional to $n \lg n$.
By the transitive property, returning to our original inequality for the tree height $h$, we have formally established:
$$h = \Omega(n \lg n)$$This concludes the proof. The worst-case number of comparisons for any comparison sort is bounded below by a function proportional to $n \lg n$. $\blacksquare$
Our goal, therefore, is to compute $X$, the total number of comparisons performed in all calls to $\text{PARTITION}$. We will not attempt to analyze how many comparisons are made in each call to $\text{PARTITION}$. Rather, we will derive an overall bound on the total number of comparisons. To do so, we must understand when the algorithm compares two elements of the array and when it does not. For ease of analysis, we rename the elements of the array $A$ as $z_1, z_2, \dots, z_n$, with $z_i$ being the $i$th smallest element. We also define the set $Z_{ij} = \{z_i, z_{i+1}, \dots, z_j\}$ to be the set of elements between $z_i$ and $z_j$, inclusive.
When does the algorithm compare $z_i$ and $z_j$? To answer this question, we first observe that each pair of elements is compared at most once. Why? Elements are compared only to the pivot element and, after a particular call of $\text{PARTITION}$ finishes, the pivot element used in that call is never again compared to any other elements.
Our analysis will use indicator random variables. We define
$$X_{ij} = I\{z_i \text{ is compared to } z_j\}$$
where we are considering whether the comparison takes place at any time during the execution of the algorithm, not just during one iteration or one call of $\text{PARTITION}$. Since each pair is compared at most once, we can easily characterize the total number of comparisons performed by the algorithm:
$$X = \sum_{i=1}^{n-1} \sum_{j=i+1}^n X_{ij}$$
Taking expectations of both sides, and then using linearity of expectation and the Lemma we proved above, we obtain
$$E[X] = E\left[ \sum_{i=1}^{n-1} \sum_{j=i+1}^n X_{ij} \right] = \sum_{i=1}^{n-1} \sum_{j=i+1}^n E[X_{ij}] = \sum_{i=1}^{n-1} \sum_{j=i+1}^n \text{Pr}\{z_i \text{ is compared to } z_j\} $$
It remains to compute $\text{Pr}\{z_i \text{ is compared to } z_j\}$. Our analysis assumes that the $\text{RANDOMIZED-PARTITION}$ procedure chooses each pivot randomly and independently. Let us think about when two items are not compared. Consider an input to quicksort of the numbers 1 through 10 (in any order), and suppose that the first pivot element is 7. Then the first call to $\text{PARTITION}$ separates the numbers into two sets: $\{1, 2, 3, 4, 5, 6\}$ and $\{8, 9, 10\}$. In doing so, the pivot element 7 is compared to all other elements, but no number from the first set (e.g., 2) is or ever will be compared to any number from the second set (e.g., 9).
In general, because we assume that element values are distinct, once a pivot $x$ is chosen with $z_i < x < z_j$, we know that $z_i$ and $z_j$ cannot be compared at any subsequent time. If, on the other hand, $z_i$ is chosen as a pivot before any other item in $Z_{ij}$, then $z_i$ will be compared to each item in $Z_{ij}$, except for itself. Similarly, if $z_j$ is chosen as a pivot before any other item in $Z_{ij}$, then $z_j$ will be compared to each item in $Z_{ij}$, except for itself. In our example, the values 7 and 9 are compared because 7 is the first item from $Z_{7,9}$ to be chosen as a pivot. In contrast, 2 and 9 will never be compared because the first pivot element chosen from $Z_{2,9}$ is 7. Thus, $z_i$ and $z_j$ are compared if and only if the first element to be chosen as a pivot from $Z_{ij}$ is either $z_i$ or $z_j$.
We now compute the probability that this event occurs. Prior to the point at which an element from $Z_{ij}$ has been chosen as a pivot, the whole set $Z_{ij}$ is together in the same partition. Therefore, any element of $Z_{ij}$ is equally likely to be the first one chosen as a pivot. Because the set $Z_{ij}$ has $j - i + 1$ elements, and because pivots are chosen randomly and independently, the probability that any given element is the first one chosen as a pivot is $1/(j - i + 1)$. Thus, we have
$$\text{Pr}\{z_i \text{ is compared to } z_j\} = \text{Pr}\{z_i \text{ or } z_j \text{ is first pivot chosen from } Z_{ij}\}$$ $$= \text{Pr}\{z_i \text{ is first pivot chosen from } Z_{ij}\} + \text{Pr}\{z_j \text{ is first pivot chosen from } Z_{ij}\}$$ $$= \frac{1}{j - i + 1} + \frac{1}{j - i + 1} = \frac{2}{j - i + 1}$$
The second line follows because the two events are mutually exclusive. Combining both equations, we get
$$E[X] = \sum_{i=1}^{n-1} \sum_{j=i+1}^n \frac{2}{j - i + 1}$$
We can evaluate this sum using a change of variables ($k = j - i$) and the bound on the harmonic series in equation (A.7):
$$E[X] = \sum_{i=1}^{n-1} \sum_{j=i+1}^n \frac{2}{j - i + 1} = \sum_{i=1}^{n-1} \sum_{k=1}^{n-i} \frac{2}{k + 1} < \sum_{i=1}^{n-1} \sum_{k=1}^n \frac{2}{k} = \sum_{i=1}^{n-1} O(\lg n) = O(n \lg n) $$
By mathematically evaluating the harmonic summation further, we can observe that it is strictly bounded below by $\Omega(n \lg n)$ as well, proving that $\text{RANDOMIZED-QUICKSORT}$'s expected running time is strictly $\Theta(n \lg n)$. Thus we conclude that, using $\text{RANDOMIZED-PARTITION}$, the expected running time of quicksort is $O(n \lg n)$ when element values are distinct.
In practice, we can further improve the running time of quicksort by taking advantage of the fast running time of insertion sort when its input is “nearly” sorted. Upon calling quicksort on a subarray with fewer than $k$ elements, let it simply return without sorting the subarray. After the top-level call to quicksort returns, we run insertion sort on the entire array to finish the sorting process. This hybrid algorithm runs in $O(nk + n \lg(n/k))$ expected time, as the truncated recursion tree has depth $\lg(n/k)$, taking $O(n \lg(n/k))$ time, followed by $O(nk)$ time for insertion sort. In theory and practice, $k$ should be chosen based on architecture factors like cache lines, often settling between 10 and 20. Another practical enhancement is modifying $\text{PARTITION}$ by randomly picking three elements and partitioning about their median. This makes the probability of an unbalanced $\alpha$-to-$(1-\alpha)$ split much lower, as the median of three is centrally biased.
Before we go, the version of $\text{PARTITION}$ given in this lecture is not the original partitioning algorithm. Here is the original partition algorithm, which is due to C. A. R. Hoare:
HOARE-PARTITION(A, p, r)
1: x = A[p]
2: i = p - 1
3: j = r + 1
4: while TRUE
5: repeat
6: j = j - 1
7: until A[j] ≤ x
8: repeat
9: i = i + 1
10: until A[i] ≥ x
11: if i < j
12: exchange A[i] with A[j]
13: else return j
To demonstrate the operation of $\text{HOARE-PARTITION}$, let us trace it on the array $A = \langle 13, 19, 9, 5, 12, 8, 7, 4, 11, 2, 6, 21 \rangle$ with the pivot $x=13$ ($A[p]$). In the first pass of the while loop, $j$ scans left until it hits $6$, and $i$ scans right until it hits $13$. Since $i < j$, we swap them, yielding $\langle 6, 19, 9, 5, 12, 8, 7, 4, 11, 2, 13, 21 \rangle$. In the second pass, $j$ scans down to $2$, and $i$ scans up to $19$. We swap them, yielding $\langle 6, 2, 9, 5, 12, 8, 7, 4, 11, 19, 13, 21 \rangle$. In the third pass, $j$ scans down to $11$, and $i$ scans up to $19$. At this point, $i$ ($10$) is no longer less than $j$ ($9$). The loop terminates and returns $j = 9$.
By now, we have introduce several sorting algorithms that can be given a proper name: they are the comparison sorts, meaning the sorted order they determine is based only on comparisons between the input elements. That makes sense! All this time, through scanning and dividing and conquering, what we have done thus far is in essence pick up two things, compare them, and put them in their place. We also saw earlier that most of these algorithms (minus $\text{INSERTION-SORT}$) run in $O(nlogn)$ time (well, $\text{MERGE-SORT}$ and $\text{HEAP-SORT}$ do in the worst-case and $\text{QUICKSORT}$ does in its average case). It turns out that there is another interesting property of comparison sort algorithms: they all run in at least $\Omega(nlogn)$ in the worst case to sort $n$ integers. This means $\text{MERGE-SORT}$ and $\text{HEAP-SORT}$ are asymptotically optimal, and there are no comparison sort that is faster by more than a constant factor.
In a comparison sort algorithm, we determine the final ordering of an input sequence exclusively by evaluating elements against one another. When given two elements, we perform a test to determine their relative order, utilizing only basic comparisons without inspecting the raw values of the elements or gathering information through other means. For the sake of theoretical analysis, we can assume without loss of generality that all input elements are distinct. Because we assume no duplicate values exist, equality checks are entirely unnecessary, and we can model the algorithm using only greater-than or less-than comparisons.
We can abstractly visualize any comparison sort using a decision tree. This will help later on when we prove that property of comparison-based sorting algorithms. A decision tree is a full binary tree where every internal node represents a specific comparison between two elements, and each leaf node represents a final, sorted permutation of the original input. When the sorting algorithm executes, it traces a specific, simple path from the root node down to a single leaf. Because a strictly correct sorting algorithm must be capable of producing any valid ordering depending on the input, all $n!$ possible permutations of an $n$-element sequence must appear as reachable leaves at the bottom of this decision tree.
This tree structure provides profound insights into the theoretical limits of sorting. For instance, we can determine the absolute best-case scenario by looking for the smallest possible depth of a leaf in such a decision tree. For an input of size $n$, the minimum possible depth is $n-1$. This is because, at an absolute minimum, the algorithm must verify that every single element is correctly ordered relative to its adjacent neighbor (such as when validating an already sorted array). Doing this requires exactly $n-1$ consecutive comparisons, establishing a hard floor for the shortest path through the tree.
Conversely, the worst-case running time of the algorithm corresponds to the height of the decision tree, which is the longest path from the root to any reachable leaf. Because the tree must contain at least $n!$ leaves to account for all possible permutations, and a binary tree of height $h$ can house a maximum of $2^h$ leaves, we arrive at the fundamental inequality $n! \le 2^h$. Taking the base-2 logarithm of both sides simplifies this to $h \ge \lg(n!)$.
To rigorously establish asymptotically tight bounds on this height without relying on Stirling's approximation, we can evaluate the summation of $\lg(n!)$. Since $$\lg(n!) = \sum_{k=1}^n \lg k$$ we can easily deduce the upper bound by replacing all terms with the maximum value: $$\sum_{k=1}^n \lg k \le \sum_{k=1}^n \lg n = n \lg n$$ meaning it is $O(n \lg n)$.
For the lower bound, we can discard the smaller half of the terms to establish a baseline: $$\sum_{k=1}^n \lg k \ge \sum_{k=n/2}^n \lg k \ge \sum_{k=n/2}^n \lg(n/2) = \frac{n}{2}(\lg n - 1)$$ This proves that $\lg(n!)$ is bounded from below by a function proportional to $n \lg n$, meaning it is $\Omega(n \lg n)$. Because both bounds match, the worst-case number of comparisons is tightly bound at $\Theta(n \lg n)$. Consequently, algorithms like heapsort and merge sort, which run in $O(n \lg n)$ time, are considered asymptotically optimal comparison sorts.
It might be more helpful to see a formal proof that examines this rigorously. It is below.
Theorem: Any comparison sort algorithm requires $\Omega(n \lg n)$ comparisons in the worst case.
Proof: Let us model a comparison sort operating on an input sequence of $n$ elements as a full binary decision tree. In this tree, each internal node represents a comparison between two elements, and each leaf represents a possible sorted permutation of the input. Because a correct sorting algorithm must be capable of producing any valid permutation depending on the initial input, the decision tree must possess at least $n!$ reachable leaves.
Let $h$ represent the height of this decision tree, which corresponds to the maximum number of comparisons made in the worst-case scenario (the longest simple path from the root down to a leaf). A binary tree of height $h$ can have at most $2^h$ leaves. By combining this structural limitation with our requirement that the tree must contain at least $n!$ leaves, we establish the following inequality:
$$2^h \ge n!$$To solve for the height $h$, we take the base-2 logarithm of both sides. Since the logarithm is a monotonically increasing function, the direction of the inequality is preserved:
$$h \ge \lg(n!)$$Next, we expand the factorial into a summation of logarithms using the property that the logarithm of a product is the sum of the logarithms:
$$h \ge \sum_{k=1}^{n} \lg k$$To establish a tight mathematical lower bound for this summation without relying on Stirling's approximation, we can evaluate it by purposefully discarding the smaller half of the terms. Since all terms in the summation are positive for $k \ge 1$, dropping the first half of the terms (from $k=1$ up to $k=\lfloor n/2 \rfloor - 1$) yields a sum that is strictly less than or equal to the original sum:
$$\sum_{k=1}^{n} \lg k \ge \sum_{k=\lfloor n/2 \rfloor}^{n} \lg k$$Now, we can further bound this new, smaller sum from below. We do this by replacing every remaining term with the absolute smallest term in this restricted series, which is $\lg(n/2)$. Because there are at least $n/2$ terms remaining in the sum, multiplying the smallest term by the number of terms gives us:
$$\sum_{k=\lfloor n/2 \rfloor}^{n} \lg k \ge \frac{n}{2} \lg\left(\frac{n}{2}\right)$$Using the quotient rule for logarithms ($\lg(a/b) = \lg a - \lg b$), we can expand this expression:
$$\frac{n}{2} \lg\left(\frac{n}{2}\right) = \frac{n}{2}(\lg n - \lg 2) = \frac{n}{2}(\lg n - 1)$$Distributing the terms gives us $\frac{n}{2} \lg n - \frac{n}{2}$. For sufficiently large values of $n$, the linear term subtracted at the end becomes negligible compared to the dominant $\frac{n}{2} \lg n$ term. Therefore, the function grows at a rate strictly proportional to $n \lg n$.
By the transitive property, returning to our original inequality for the tree height $h$, we have formally established:
$$h = \Omega(n \lg n) \quad \blacksquare$$One might wonder if it is possible to design a comparison sort that bypasses this worst-case limit and runs in linear time, $O(n)$, for at least a significant fraction of the possible inputs. Mathematically, this is impossible. If an algorithm were to run in a linear time of $cn$ (where $c$ is a constant) for a fraction $f$ of the $n!$ total inputs, the decision tree would need to have $f \cdot n!$ leaves situated at a depth of $cn$ or less. Because a tree of depth $cn$ can hold at most $2^{cn}$ leaves, the inequality $f \cdot n! \le 2^{cn}$ must hold true. By taking the logarithm of both sides, we get $\lg f + \lg(n!) \le cn$. Since we know $\lg(n!)$ grows at a rate of $\Theta(n \lg n)$, this inequality quickly falls apart for large values of $n$. This holds true whether the fraction $f$ is $1/2$, $1/n$, or even $1/2^n$. For example, if $f = 1/2^n$, the equation becomes $-n + n \lg n \le cn$, which simplifies to $n(\lg n - 1) \le cn$. Even with an exponentially small fraction of inputs, the logarithmic $n \lg n$ growth will always eventually overpower the linear $cn$ right side, proving that no comparison sort can be linear for any meaningful fraction of inputs.
We can also apply this decision tree logic to restricted variants of the sorting problem. Suppose you are given a sequence of $n$ elements divided into $n/k$ separate subsequences, each containing exactly $k$ elements. If you are given the guarantee that all elements in a specific subsequence are strictly smaller than the elements in the following subsequence, you only need to sort the $k$ elements within each individual block. The total number of valid permutations for this specific setup is the product of the permutations of each independent block, which is $(k!)^{n/k}$. To find the lower bound on the number of comparisons needed for this variant, we again calculate the minimum height of the decision tree: $h \ge \lg((k!)^{n/k})$. Using standard logarithm exponent rules, we can pull the exponent down to get $h \ge \frac{n}{k} \lg(k!)$. Because we previously proved that $\lg(k!) = \Omega(k \lg k)$, we can substitute this into our formula to get $\frac{n}{k} \cdot \Omega(k \lg k)$. The $k$ variables cancel out, leaving us with a firm lower bound of $\Omega(n \lg k)$ comparisons required to solve this partially sorted variant.
To understand how algorithms like Counting Sort, Radix Sort (later), or Bucket Sort (also later) completely bypass the $\Omega(n \lg n)$ limit, we have to look at the fundamental rule they break: they do not rely on comparing elements against each other.
The decision tree model—which mathematically proves the $\Omega(n \lg n)$ bound—strictly assumes that the only way an algorithm can gain information about the input sequence is through binary questions (e.g., "Is $A[i] \le A[j]$?"). Every comparison forces a split down the left or right branch of the tree.
Non-comparison sorts cheat this system by using the actual numerical values of the elements directly as structural data, typically as array indices. For example, in Counting Sort, if the algorithm encounters the number 5, it doesn't ask "Is 5 greater than the previous number?" Instead, it simply increments a counter at index 5 of an auxiliary frequency array. By doing this, it completely circumvents the binary decision tree.
Because these algorithms map values to memory locations rather than comparing them, their time complexity isn't restricted by the height of a decision tree. Instead, their running time is usually $O(n + k)$, where $n$ is the number of elements and $k$ is the range of possible values. When $k$ is relatively small (e.g., sorting integers up to $O(n)$), these sorts achieve true linear $O(n)$ time.
And with that, let us dive into Counting Sort!
Unlike the comparison-based algorithms we have analyzed previously, Counting Sort operates under a specific assumption: each of the $n$ input elements is an integer within a known, finite range from $0$ to $k$, where $k$ is some integer. Because it relies on this assumption, Counting Sort does not need to compare elements against one another. Instead, it uses the actual numerical values of the inputs as indices to directly calculate their final positions. When the maximum value $k$ is roughly proportional to the number of elements—that is, $k = O(n)$—this algorithm achieves a remarkably efficient linear running time of $\Theta(n)$.
The core mechanism of Counting Sort involves determining, for every input element $x$, the exact number of elements that are strictly less than $x$. By knowing how many items belong before it, the algorithm can place $x$ directly into its correct slot in the sorted output array. For instance, if exactly 17 elements are smaller than $x$, we know with absolute certainty that $x$ belongs in the 18th position. However, we must slightly adjust this logic to account for duplicate values. If multiple elements share the same value, we cannot simply place them all into the exact same index.
To implement this, we assume our input is an array $A[1 \dots n]$, meaning $A.\text{length} = n$. We require two additional structures: an output array $B[1 \dots n]$ to hold the final sorted sequence, and an auxiliary storage array $C[0 \dots k]$ to keep track of our counts.
COUNTING-SORT(A, B, k)
1: let C[0 &dots; k] be a new array
2: for i = 0 to k
3: C[i] = 0
4: for j = 1 to A.length
5: C[A[j]] = C[A[j]] + 1
6: // C[i] now contains the number of elements equal to i.
7: for i = 1 to k
8: C[i] = C[i] + C[i - 1]
9: // C[i] now contains the number of elements less than or equal to i.
10: for j = A.length downto 1
11: B[C[A[j]]] = A[j]
12: C[A[j]] = C[A[j]] - 1
To truly understand how this code functions, let's manually trace the operation of $\text{COUNTING-SORT}$ on a specific input array: $A = \langle 6, 0, 2, 0, 1, 3, 4, 6, 1, 3, 2 \rangle$. Here, our maximum value is $k = 6$, and we have $n = 11$ elements.
First, the loop in lines 2–3 initializes our working array $C$ to all zeros. Next, the loop in lines 4–5 iterates through the input array $A$, tallying the frequencies of each number. After this step, $C$ looks like this: $C = \langle 2, 2, 2, 2, 1, 0, 2 \rangle$. (For example, $C[0] = 2$ because the number $0$ appears twice in $A$).
Lines 7–8 then transform $C$ into a prefix sum array. We add the value of the previous index to the current index, creating a running total. After this loop, $C = \langle 2, 4, 6, 8, 9, 9, 11 \rangle$. This is the crucial step: $C[3]$ is now $8$, which tells us there are exactly 8 elements less than or equal to the number 3.
Finally, the loop in lines 10–12 populates the output array $B$. We iterate backwards through $A$ (from index 11 down to 1). The last element is $A[11] = 2$. We look at $C[2]$, which is $6$. This tells us to place the $2$ at index $6$ in the output array $B$. We then decrement $C[2]$ to $5$. We do this so that the next time we encounter a $2$, it will be placed at index $5$, directly to the left of the first one, preventing overwrites.
Let us break down the total running time of $\text{COUNTING-SORT}$. The initialization loop (lines 2–3) takes $\Theta(k)$ time. The frequency counting loop (lines 4–5) takes $\Theta(n)$ time. The prefix sum loop (lines 7–8) takes $\Theta(k)$ time. Finally, the placement loop (lines 10–12) iterates $n$ times, taking $\Theta(n)$ time. Adding these together yields an overall time complexity of $\Theta(k + n)$. As established, when the maximum value $k$ is $O(n)$, the $k$ term is absorbed, and the sort operates in strictly $\Theta(n)$ time.
By completely abandoning the comparison model—using element values strictly as memory addresses instead—Counting Sort beautifully circumvents the $\Omega(n \lg n)$ lower bound that constrains algorithms like Merge Sort and Heapsort.
The logic used to construct the prefix sum array $C$ is actually incredibly versatile. Suppose we are tasked with preprocessing $n$ integers (ranging from $0$ to $k$) so that we can answer arbitrary queries about how many integers fall within a specific range $[a \dots b]$ in pure $O(1)$ time. We can achieve this by simply running lines 1–9 of the Counting Sort algorithm, taking $\Theta(n + k)$ preprocessing time to build the $C$ array. Because $C[x]$ stores the total number of elements less than or equal to $x$, we can find the number of elements in the range $[a \dots b]$ by calculating $C[b] - C[a - 1]$. (We must handle the edge case where $a = 0$ by simply returning $C[b]$). The subtraction takes constant $O(1)$ time, demonstrating the raw utility of the prefix sum technique.
One of the most important characteristics of Counting Sort is that it is stable. A sorting algorithm is stable if elements with identical values appear in the output array in the exact same relative order as they did in the input array. If we have two instances of the number $3$, the one that appeared first in $A$ will definitively appear first in $B$.
Proof of Stability. We can prove $\text{COUNTING-SORT}$ is stable by analyzing the final loop (lines 10–12). Because the loop iterates backwards (downto 1), the last occurrence of a duplicate value in array $A$ is processed first. It is placed at the index dictated by $C[A[j]]$. Crucially, the code then decrements $C[A[j]]$. This guarantees that the next time we process the same value (which, because we are moving backwards, will be an occurrence that appeared earlier in the original input sequence), it will be placed at the decremented index—meaning it is placed to the left of the previously placed duplicate. Thus, earlier elements in $A$ map to earlier positions in $B$, preserving relative order. $\quad \blacksquare$
If we were to rewrite line 10 to iterate forwards—for j = 1 to A.length—the algorithm would still successfully sort the array. It would place the elements in their correct clusters and successfully decrement the counters. However, the modified algorithm would lose its stability. By processing the first occurrence of a duplicate first and decrementing the counter, that first occurrence would end up in the rightmost slot of its cluster in the final array $B$. The relative order of duplicates would be completely reversed.
While stability might seem like a minor detail when sorting plain integers, it is absolutely essential when those integers are keys attached to larger packets of "satellite data" (like database records). More importantly, as we will explore next, Counting Sort is frequently utilized as a dedicated subroutine within Radix Sort. For Radix Sort to function correctly and sort multi-digit numbers digit-by-digit, its underlying sorting subroutine must absolutely be stable.
Radix Sort has a fascinating history, originating from the mechanical card-sorting machines found in early computer eras. These machines processed physical punch cards, each featuring 80 columns where holes could be punched in one of 12 distinct vertical places (with 10 places dedicated to the decimal digits 0 through 9). A mechanical sorter could be programmed to examine a single column at a time across an entire deck, distributing the cards physically into one of 12 corresponding bins based on where the hole was punched. The human operator would then gather the cards bin by bin, stacking them so that cards from the 0 bin sat perfectly on top of those from the 1 bin, and so forth. Because the machine could only process a single column (or digit) per pass, sorting an entire deck based on a multi-digit number required a structured, multi-pass algorithm.
Intuitively, human beings naturally sort numbers by looking at their most significant digit first—the largest place value. If you were handed a deck of cards, you might sort them by the hundreds digit, creating 10 separate piles. Then, you would recursively sort each of those individual piles by the tens digit, and finally the ones digit. While mathematically valid, the logistical overhead of this approach is a nightmare for an iterative machine or a human operator. Every time you sort a pile by the next digit, you split it into 10 smaller sub-piles. By the time you reach the $d$-th digit, you could be managing up to $10^{d-1}$ intermediate piles simultaneously. In the absolute worst case—where the dataset contains every possible combination of digits—an operator sorting $d$-digit decimal numbers would have to keep track of $\frac{10^d - 1}{9}$ distinct piles over the course of the entire process! Radix sort brilliantly avoids this chaotic fragmentation by working entirely in reverse.
To understand exactly why, consider the worst-case scenario where we sort $d$-digit decimal numbers containing every possible digit combination. During the first pass, we distribute the deck into 10 piles. To proceed recursively, the operator must set aside 9 of those piles and sort just the first one. Sorting that single pile creates 10 new sub-piles. Again, 9 are set aside, and 1 is sorted. This continues down to the $d$-th digit. At its peak, the operator must keep track of 9 deferred piles at every level from 1 to $d-1$, plus the 10 active piles currently being sorted at the bottom level. This means the operator is managing $9(d - 1) + 10 = 9d + 1$ distinct piles simultaneously! Furthermore, the total number of sorting passes (where a pass distributes a pile into 10 bins) expands exponentially at each level. The total passes required is the sum of a geometric series: $1 + 10 + 10^2 + \dots + 10^{d-1}$, which evaluates exactly to $\frac{10^d - 1}{9}$ total passes. Radix sort brilliantly avoids this chaotic fragmentation and exponential pile management by working entirely in reverse.
Radix Sort solves the multi-pass problem counterintuitively: it sorts the dataset based on the least significant digit first. The algorithm distributes the items into bins based on the lowest-order digit (the ones place), gathers them all back up into a single cohesive deck, and then repeats the process on the next digit up (the tens place). The deck is combined in strict order after every single pass. By the time the algorithm completes its pass on the highest-order digit, the entire array is perfectly sorted.
RADIX-SORT(A, d)
1: for i = 1 to d
2: use a stable sort to sort array A on digit i
To truly visualize how this works, let's trace the algorithm on an array of 3-letter English words, treating each letter as a "digit" (where the 3rd letter is the least significant).
Initial Pass 1 Pass 2 Pass 3
Values (Sort by 3rd) (Sort by 2nd) (Sort by 1st)
------- ------------- ------------- -------------
COW SE[A] T[A]B [B]AR
DOG TE[A] B[A]R [B]IG
SEA MO[B] E[A]R [B]OX
RUG TA[B] T[A]R [C]OW
ROW DO[G] S[E]A [D]IG
MOB RU[G] T[E]A [D]OG
BOX DI[G] D[I]G [E]AR
TAB BI[G] B[I]G [F]OX
BAR BA[R] M[O]B [M]OB
EAR EA[R] D[O]G [N]OW
TAR TA[R] C[O]W [R]OW
DIG CO[W] R[O]W [R]UG
BIG RO[W] N[O]W [S]EA
TEA NO[W] B[O]X [T]AB
NOW BO[X] F[O]X [T]AR
FOX FO[X] R[U]G [T]EA
Look closely at what happens between Pass 2 and Pass 3 in the diagram above. Notice the words BIG and BOX. During Pass 2, they were sorted into their relative positions (BIG came before BOX because 'I' comes before 'O'). When Pass 3 sorts by the first letter, both words start with 'B'. If the sorting algorithm used for Pass 3 simply threw them into the 'B' bin randomly, the work done in Pass 2 would be destroyed, and BOX might end up before BIG.
This is exactly why the subroutine algorithm used inside Radix Sort must be stable. Stability ensures that when two elements have a "tie" on the current digit being evaluated, they strictly retain their relative order from the previous passes. Among our common comparison sorts, Insertion Sort and Merge Sort are natively stable. Heapsort and Quicksort, however, are fundamentally unstable due to the long-distance swapping mechanics they employ. Fortunately, it is possible to make any inherently unstable sorting algorithm stable. You can achieve this by transforming the input array elements into paired tuples, appending the element's original starting index to it (e.g., modifying the value 42 at index 5 to become the tuple (42, 5)). If the unstable algorithm encounters a tie between two elements' primary values, you simply instruct it to break the tie by comparing their original indices. While this cleverly forces stability, it requires $\Theta(n)$ additional auxiliary space to store those indices, along with a slight constant-time overhead for the secondary comparisons.
We can formally prove that Radix Sort works correctly using mathematical induction on the column $i$ being sorted.
Proof. For the base case ($i = 1$), the algorithm uses a stable sort to order the elements strictly by their lowest-order digit, which trivially leaves the array sorted with respect to the first digit. For the inductive step, assume that after $i - 1$ passes, the array is perfectly sorted with respect to the lowest $i - 1$ digits. Now, the algorithm performs its $i$-th pass, sorting the elements by the $i$-th digit. If two elements have different values at digit $i$, the stable sort correctly places the smaller digit first, establishing the correct global order regardless of the lower digits. If two elements have the same value at digit $i$, the sort sees them as a tie. Because the intermediate sort is stable, it will leave these two tied elements in the exact same relative order they were in before this pass. By our inductive hypothesis, their relative order was already perfectly correct based on the lower $i - 1$ digits. Therefore, after the $i$-th pass, the array is fully and correctly sorted up to the $i$-th digit. $\blacksquare$
Lemma. (Radix Sort Time Complexity). Given $n$ numbers, each having $d$ digits, where each digit can take on up to $k$ possible values, $\text{RADIX-SORT}$ correctly sorts the numbers in $\Theta(d(n + k))$ time, provided the stable sorting subroutine takes $\Theta(n + k)$ time.
Proof. The overall correctness of the algorithm is established by the induction proof above. To analyze the running time, we depend entirely on the stable sorting algorithm utilized as the intermediate subroutine. Since each digit is an integer in the range from $0$ to $k - 1$ (meaning it can take on $k$ possible values), Counting Sort is the optimal choice when $k$ is not excessively large. Executing a single pass of Counting Sort over $n$ elements, where each element can take on $k$ values, requires exactly $\Theta(n + k)$ time. Because the $\text{RADIX-SORT}$ algorithm simply loops and executes this stable sort exactly $d$ times (once for each digit), the total running time is $\Theta(d(n + k))$. $\blacksquare$
This reveals an incredible property: Radix sort can sort massively expanded ranges in linear time. For example, suppose you need to sort $n$ integers that range from $0$ all the way up to $n^3 - 1$. A comparison sort would take $\Theta(n \lg n)$ time. However, if we simply treat these integers as numbers written in "base $n$", any number up to $n^3 - 1$ can be perfectly represented using exactly 3 digits. Here, $d = 3$ and our range $k = n$. Applying Radix Sort with Counting Sort yields a time complexity of $\Theta(3(n + n))$, which simplifies down to an astoundingly fast $\Theta(n)$ linear time.
Lemma. (Optimal Bitwise Chunking for Radix Sort). Given $n$ machine words (numbers) each composed of $b$ bits, and any positive integer $r \le b$, $\text{RADIX-SORT}$ correctly sorts these numbers in $\Theta((b/r)(n + 2^r))$ time.
Proof. For any chosen integer $r \le b$, we can conceptually divide each $b$-bit key into smaller chunks of $r$ bits. This effectively treats the key as a number containing $d = \lceil b/r \rceil$ separate "digits". Because each digit is comprised of $r$ bits, the decimal value of each digit must be an integer in the range from $0$ to $2^r - 1$. Therefore, we can utilize Counting Sort for the intermediate sorting passes, with our maximum value parameter set to $k = 2^r - 1$.
Inside modern computers, we aren't limited to decimal digits. We can treat a 32-bit integer as a sequence of smaller binary chunks. If we divide the integer into chunks of $r$ bits, we effectively create a number with $d = \lceil b/r \rceil$ "digits".
Example: Chunking a 32-bit integer (b = 32) into 8-bit digits (r = 8)
Binary Representation: 10110101 11000011 00001111 10101010
|______| |______| |______| |______|
Digit Place: Pass 4 Pass 3 Pass 2 Pass 1
Decimal Value of Chunk: 181 195 15 170
According to our previous lemma, each individual pass of Counting Sort takes $\Theta(n + k)$ time. Substituting $k = 2^r - 1$, this becomes $\Theta(n + 2^r)$. Because we have separated the number into $d = b/r$ digits, the algorithm must perform $b/r$ passes. Multiplying the number of passes by the cost per pass yields a total running time of $\Theta(d(n + 2^r)) = \Theta((b/r)(n + 2^r))$.
To achieve the absolute fastest sorting speed for given values of $n$ and $b$, we must strategically pick the chunk size $r$ (where $r \le b$) that minimizes the expression $(b/r)(n + 2^r)$.
If our word size $b$ is relatively small—specifically, if $b < \lfloor \lg n \rfloor$—then for any value of $r \le b$, the term $2^r$ will be strictly less than $n$. Consequently, the term $(n + 2^r)$ is simply dominated by $n$, becoming $\Theta(n)$. In this case, choosing $r = b$ perfectly minimizes the $(b/r)$ coefficient to $1$, yielding a running time of $(b/b)(n + 2^b) = \Theta(n)$, which is asymptotically optimal.
However, if our word size $b$ is larger ($b \ge \lfloor \lg n \rfloor$), we choose a chunk size of $r = \lfloor \lg n \rfloor$. This specific choice beautifully balances the formula: the $(n + 2^r)$ term becomes $(n + 2^{\lfloor \lg n \rfloor})$, which is exactly $2n$, or $\Theta(n)$. Plugging this back into our total time equation gives $\Theta(b/(\lg n) \cdot n) = \Theta(bn / \lg n)$. If we were to increase $r$ above $\lfloor \lg n \rfloor$, the $2^r$ term in the numerator would begin growing exponentially faster than the $r$ term in the denominator, devastating our time complexity. Conversely, if we decreased $r$ below $\lfloor \lg n \rfloor$, the $b/r$ multiplier would grow unnecessarily large while the $(n + 2^r)$ term remained stalled at $\Theta(n)$.
If Radix sort can achieve strictly linear $\Theta(n)$ running times under the right conditions, why isn't it the default sorting algorithm used everywhere instead of Quicksort's $\Theta(n \lg n)$? The answer lies in the hardware realities hidden within asymptotic notation. The constant factors tucked away inside the $\Theta$-notation for Radix Sort are significantly larger. While Radix sort might execute fewer passes over the array than Quicksort makes recursive calls, each Radix pass involves heavy array allocations, prefix sum calculations, and memory hopping due to the Counting Sort scatter phases. Quicksort, conversely, operates strictly in-place. It runs incredibly tight, cache-friendly loops that modern CPU architectures execute blazingly fast. Therefore, when primary memory is tight or hardware cache utilization is prioritized, an in-place comparison sort like Quicksort often dramatically outperforms Radix sort in real-world wall-clock time.
Bucket Sort is a remarkably fast algorithm that operates under a specific assumption about its input data: it assumes that the elements are drawn from a uniform distribution over the half-open interval $[0, 1)$. While Counting Sort achieves its speed by assuming elements are integers confined to a narrow range, Bucket Sort achieves an average-case running time of $O(n)$ by assuming the input is generated by a random process that spreads the elements evenly and independently across the interval.
The underlying mechanism is intuitive. Bucket Sort divides the interval $[0, 1)$ into $n$ equally sized subintervals, which we call buckets. It then iterates through the $n$ input numbers, distributing each one into its corresponding bucket. Because the input numbers are uniformly distributed, we statistically expect that the numbers will spread out evenly, with no single bucket accumulating too many elements. To produce the final sorted output, the algorithm simply sorts the contents of each individual bucket and then concatenates the buckets together in sequential order.
To implement this, we assume our input is an $n$-element array $A$, where every element satisfies $0 \le A[i] < 1$. We use an auxiliary array $B[0 \dots n-1]$ composed of linked lists to serve as our buckets.
BUCKET-SORT(A)
1: let B[0 &dots; n - 1] be a new array
2: n = A.length
3: for i = 0 to n - 1
4: make B[i] an empty list
5: for i = 1 to n
6: insert A[i] into list B[⌊n · A[i]⌋]
7: for i = 0 to n - 1
8: sort list B[i] with INSERTION-SORT
9: concatenate the lists B[0], B[1], &dots;, B[n - 1] together in order
To observe how this algorithm operates in practice, let's trace it on a specific 10-element array: $A = \langle .79, .13, .16, .64, .39, .20, .89, .53, .71, .42 \rangle$. Since $n = 10$, we multiply each element by $10$ and take the floor to find its bucket index.
Array B (Buckets after step 6, before sorting):
B[0]: empty
B[1]: .13 → .16
B[2]: .20
B[3]: .39
B[4]: .42
B[5]: .53
B[6]: .64
B[7]: .79 → .71
B[8]: .89
B[9]: empty
During step 8, $\text{INSERTION-SORT}$ quickly reorders the slightly jumbled buckets (like $B[7]$, swapping $.79$ and $.71$). Step 9 then concatenates them from $B[0]$ to $B[9]$, yielding the perfectly sorted array.
To prove that Bucket Sort is correct, consider any two elements $A[i]$ and $A[j]$ where $A[i] \le A[j]$. Because the floor function is monotonically increasing, it is guaranteed that $\lfloor n \cdot A[i] \rfloor \le \lfloor n \cdot A[j] \rfloor$. This means the smaller element $A[i]$ will either be placed in the exact same bucket as $A[j]$, or it will be placed in a bucket with a lower index. If they land in the same bucket, the inner Insertion Sort puts them in the correct order. If they land in different buckets, the final concatenation step natively places the contents of the lower-indexed bucket before the higher-indexed one. Thus, the relative order is preserved and correct.
While the average case is exceptional, we must address the algorithm's worst-case scenario. If the input data maliciously violates the uniform distribution assumption—for example, if all $n$ elements are clumped extremely close together (e.g., all starting with $.11...$)—they will all be mapped into the exact same bucket. Because Bucket Sort relies on Insertion Sort (which has a quadratic $\Theta(n^2)$ worst-case running time) to sort the internal buckets, the overall time complexity degenerates to $\Theta(n^2)$.
Fortunately, there is a simple modification to fix this vulnerability. If we simply replace $\text{INSERTION-SORT}$ in line 8 with an algorithm that guarantees a worst-case $O(n \lg n)$ time bound (such as Merge Sort or Heapsort), the overall worst-case running time of Bucket Sort improves to $O(n \lg n)$. Because a well-distributed average case would still leave the buckets mostly empty or containing very few elements, sorting them would still take effectively linear time, preserving the $\Theta(n)$ average-case bound while protecting against quadratic failure.
To mathematically prove the $O(n)$ average-case time, we must analyze the cost of the calls to $\text{INSERTION-SORT}$. Let $n_i$ be the random variable denoting the exact number of elements placed into bucket $B[i]$. Since Insertion Sort's time is quadratic relative to the number of elements it processes, the total running time of Bucket Sort is bounded by:
$$T(n) = \Theta(n) + \sum_{i=0}^{n-1} O(n_i^2)$$We want to find the expected value of this running time, $E[T(n)]$. Using the linearity of expectation, we can pass the expectation operator straight through the summation:
$$E[T(n)] = \Theta(n) + \sum_{i=0}^{n-1} O(E[n_i^2])$$To solve this, we must calculate $E[n_i^2]$, which is the expected value of the square of the bucket size.
It is a common mathematical pitfall to confuse the square of an expectation, $E^2[X]$, with the expectation of a square, $E[X^2]$. They are rarely equal. For example, let $X$ be the random variable representing the number of heads obtained in two flips of a fair coin. The possibilities are 0, 1, or 2 heads, with probabilities $1/4$, $1/2$, and $1/4$ respectively. The expected value is $E[X] = 0(1/4) + 1(1/2) + 2(1/4) = 1$. Therefore, the square of the expectation is $E^2[X] = 1^2 = 1$. However, the expectation of the square is $E[X^2] = 0^2(1/4) + 1^2(1/2) + 2^2(1/4) = 0 + 0.5 + 1 = 1.5$. Because variances exist in data, $E[X^2]$ is larger.
Returning to Bucket Sort, to rigorously find $E[n_i^2]$, we define an indicator random variable $X_{ij}$ which equals $1$ if element $A[j]$ falls into bucket $i$, and $0$ otherwise. The total elements in bucket $i$ is $n_i = \sum_{j=1}^{n} X_{ij}$. We can now expand $E[n_i^2]$:
$$E[n_i^2] = E\left[ \left( \sum_{j=1}^{n} X_{ij} \right)^2 \right] = E\left[ \sum_{j=1}^{n} X_{ij}^2 + \sum_{1 \le j \le n} \sum_{1 \le k \le n, k \neq j} X_{ij} X_{ik} \right]$$Using linearity of expectation again, we split this into two parts. First, the expected value of an indicator variable squared is simply its probability. Since the input is uniform over $n$ buckets, the probability of falling into bucket $i$ is $1/n$. Thus, $E[X_{ij}^2] = 1/n$. Second, when $k \neq j$, the variables $X_{ij}$ and $X_{ik}$ are statistically independent. The expectation of their product is the product of their expectations: $E[X_{ij}X_{ik}] = E[X_{ij}]E[X_{ik}] = (1/n) \cdot (1/n) = 1/n^2$.
Substituting these values back in:
$$E[n_i^2] = \sum_{j=1}^{n} \frac{1}{n} + \sum_{1 \le j \le n} \sum_{1 \le k \le n, k \neq j} \frac{1}{n^2}$$There are exactly $n$ terms in the first sum, yielding $n \cdot (1/n) = 1$. There are $n(n-1)$ terms in the double summation, yielding $n(n-1) \cdot (1/n^2) = (n-1)/n$. Adding them together:
$$E[n_i^2] = 1 + \frac{n-1}{n} = 2 - \frac{1}{n}$$By proving that the expected square of any bucket size is a small constant (specifically less than 2), we can finalize our time complexity. Plugging this back into our primary equation yields $T(n) = \Theta(n) + n \cdot O(2 - 1/n) = \Theta(n)$. The average-case running time is rigorously linear.
Bucket Sort's logic relies entirely on finding a way to distribute elements evenly. But what if the data isn't uniformly distributed over $[0, 1)$?
Suppose we are given $n$ random variables drawn from any known continuous probability distribution function, defined as $P(x) = \text{Pr}\{X \le x\}$. As long as this cumulative distribution function (CDF) can be computed in $O(1)$ time, we can sort these numbers in linear average-case time. In statistics, the Probability Integral Transform states that if a random variable $X$ has a continuous CDF $P$, then the random variable $Y = P(X)$ is uniformly distributed exactly over $[0, 1]$. Therefore, to sort our inputs, we simply map each element $x_i$ to $p_i = P(x_i)$. We use these uniform $p_i$ values to determine the bucket indices ($\lfloor n \cdot p_i \rfloor$), place the original $x_i$ values into those buckets, and proceed with Bucket Sort exactly as normal.
We can apply this exact probabilistic mapping to real-world geometric problems. Imagine we are given $n$ points $(x_i, y_i)$ uniformly distributed within a 2D unit circle (where $0 < x_i^2 + y_i^2 \le 1$). We want to sort these points based on their distance from the origin, $d_i = \sqrt{x_i^2 + y_i^2}$, in $\Theta(n)$ time.
Because the points are distributed uniformly by area, the probability of a point falling within a certain distance $d$ from the origin is strictly proportional to the area of a circle with radius $d$. The area of the full unit circle is $\pi(1)^2 = \pi$. The area of the sub-circle up to distance $d$ is $\pi d^2$. Therefore, the CDF for the distance is $P(D \le d) = (\pi d^2) / \pi = d^2$.
By squaring the distance, we map the non-uniform distance metric into a perfectly uniform distribution over $[0, 1)$. To sort the points in expected linear time, we simply insert each point $p_i$ into the bucket at index $\lfloor n \cdot (x_i^2 + y_i^2) \rfloor$. The points will distribute evenly into the $n$ buckets, allowing Bucket Sort to operate efficiently and complete the task in average-case $\Theta(n)$ time.
The problem of sorting is fundamentally linked to the problem of selection. While sorting organizes an entire dataset, the Selection Problem asks us to find a single, specific element based on its relative rank.
Formally, the selection problem is defined by the following inputs and outputs:
The most intuitive, naive approach to solve this is to simply apply a comparison sort (like Merge Sort or Heapsort) to the entire array to completely order the numbers, which takes $O(n \lg n)$ time, and then directly index the $i$-th element, which takes $O(1)$ time. However, we can perform selection much faster than sorting.
If we restrict ourselves strictly to a comparison-based model to find the element, we can evaluate the absolute lower bound. Because there are only $n$ possibilities for which element could be the $i$-th smallest, a decision tree modeling this process must have at least $n$ leaves. Because the height $h$ of a binary tree with $n$ leaves satisfies $2^h \ge n$, the height must be at least $\Omega(\log n)$. Thus, $O(\log n)$ is a strict lower bound for the number of comparisons required in the best possible worst-case scenario.
We can achieve an expected running time of $O(n)$ by utilizing the partitioning logic from Quicksort. The Randomized-Select algorithm divides the array around a randomly chosen pivot, but unlike Quicksort, it only recurses into the single partition that is guaranteed to contain the $i$-th smallest element.
RANDOMIZED-SELECT(A, p, r, i)
1: if p == r
2: return A[p]
3: q = RANDOMIZED-PARTITION(A, p, r)
4: k = q - p + 1
5: if i ≤ k
6: return RANDOMIZED-SELECT(A, p, q, i)
7: else if i - 1 == k
8: return A[q + 1]
9: else
10: return RANDOMIZED-SELECT(A, q + 2, r, i - k - 1)
Nota Bene. The specific boundary logic in lines 5-10 adapts to the partition style utilized. The structure shown recursively searches the left partition if the target $i$ falls within the first $k$ elements, checks if the pivot itself is the target, and otherwise searches the right partition, correctly shifting the target rank to $i - k - 1$.
The expected running time $E[T(n)]$ is determined by the size of the partition we recurse into. Assuming the pivot is chosen uniformly at random, each partition split is equally likely. Therefore, we sum the expected times of the maximum possible subproblem for each split, weighted by the probability of that split ($1/n$), plus a linear cost $c_1 n$ for the partitioning overhead.
$$E[T(n)] \le \sum_{k=0}^{n-1} \frac{1}{n} \Big( E[T(\max(k, n-k-1))] + c_1 n \Big)$$Proof. We prove $E[T(n)] \le cn$ via substitution (induction).
Base Case: Choose a constant $c$ large enough such that $E[T(1)] \le c$.
Inductive Step: Assume $E[T(k)] \le ck$ holds for all $k < n$ The summation of maximums essentially counts the upper half of the partition sizes twice:
By evaluating the arithmetic series, we bound the sum:
$$E[T(n)] \le \frac{c}{n} \Big( \lfloor n/2 \rfloor + n - 1 \Big) \Big( (n-1) - \lceil n/2 \rceil + 1 \Big) + c_1 n$$This simplifies to bound the recurrence strictly below $c \cdot \frac{3}{4} n + c_1 n$. For this to be less than or equal to $cn$, we simply choose our constant $c$ such that $c > 4c_1$. This confirms the expected running time is linear. $\blacksquare$
While RANDOMIZED-SELECT is fast on average, it degrades to $O(n^2)$ if we get extremely unlucky with pivot selections. The deterministic SELECT algorithm guarantees $O(n)$ worst-case time by carefully computing a high-quality pivot known as the "median of medians."
The formal pseudocode for the deterministic SELECT algorithm directly mirrors the five-step process. We assume an initial call of $\text{SELECT}(A, 1, n, i)$.
SELECT(A, p, r, i)
1: n = r - p + 1
2: if n == 1
3: return A[p]
4:
5: // Step 1: Divide into groups of 5
6: divide A[p .. r] into ⌈n / 5⌉ groups (all but the last have exactly 5 elements)
7: let M[1 .. ⌈n / 5⌉] be a new array
8:
9: // Step 2: Find the median of each group
10: for j = 1 to ⌈n / 5⌉
11: sort the j-th group using insertion sort
12: M[j] = median of the j-th group
13:
14: // Step 3: Find the true median x of the group medians
15: x = SELECT(M, 1, ⌈n / 5⌉, ⌊⌈n / 5⌉ / 2⌋ + 1)
16:
17: // Step 4: Partition around x
18: find the index of x in A[p &dots; r] and exchange it with A[r]
19: q = PARTITION(A, p, r)
20: k = q - p // k is the number of elements on the low side
21:
22: // Step 5: Recurse on the appropriate partition
23: if i == k + 1
24: return A[q] // x is exactly the i-th smallest element
25: elseif i ≤ k
26: return SELECT(A, p, q - 1, i)
27: else
28: return SELECT(A, q + 1, r, i - k - 1)
To prove this algorithm runs in $O(n)$ time, we must bound the size of the recursive calls. The pivot $x$ is the median of $\lceil n/5 \rceil$ medians. Therefore, at least half of these group medians are greater than or equal to $x$. This gives us $\lceil \frac{1}{2} \lceil \frac{n}{5} \rceil \rceil$ groups. Because each group is sorted, if a group's median is greater than $x$, the two elements below it in that group are also strictly greater than $x$.
Discounting the anomalous final group (which might have fewer than 5 elements) and the single group containing $x$ itself, the number of elements guaranteed to be strictly greater than $x$ is at least:
$$3 \left( \lceil \frac{1}{2} \lceil \frac{n}{5} \rceil \rceil - 2 \right) \ge \frac{3n}{10} - 6$$(Note: The lecture slides bound this deletion size simply as $3 \lceil \frac{1}{2} \lceil \frac{n}{5} \rceil \rceil - 2 \ge \frac{3n}{10} - 2$).
Because at least $\frac{3n}{10} - 2$ elements are definitively partitioned away from our recursive search, the maximum number of elements we can possibly recurse on in Step 5 is bounded by:
$$n - \left(\frac{3n}{10} - 2\right) = \frac{7n}{10} + 2$$This strict bound limits the maximum elements evaluated to $7n/10 + 2$.
We construct the overarching recurrence relation:
$$T(n) \le T(\lceil n/5 \rceil) + T\Big(n - \big(3\lceil \frac{1}{2}\lceil \frac{n}{5} \rceil \rceil - 2\big)\Big) + O(n)$$This is drawn directly from the $\lceil n/5 \rceil$ cost of finding the median of medians and the largest possible resulting partition.
We prove $T(n) = O(n)$ by substitution. Guess $T(n) \le cn$.
$$T(n) \le c\Big(\frac{n}{5} + 1\Big) + c\Big(\frac{7n}{10} + 2\Big) + c'n$$ $$T(n) \le \frac{9cn}{10} + 3c + c'n$$To satisfy $T(n) \le cn$, we must enforce that the terms above $9cn/10$ do not exceed $cn/10$. Thus, $cn/10 - 3c \ge c'n$, which factors to $c(n/10 - 3) \ge c'n$.
We make the math clean by assuming $n \ge 60$. Under this condition, $n/10 - 3 \ge n/20$. Substituting this into our requirement gives $c(n/20) \ge c'n$, which beautifully resolves to $c \ge 20c'$.
To formally lock in the base cases for induction, we must simply choose a constant $c$ that is large enough to satisfy the relation for all $n < 60$. Thus, we set:
$$c = \max(20c', \frac{T(1), T(2)/2, \dots, T(59)}{59})$$By meeting all these requirements, we mathematically guarantee that the algorithm terminates in $O(n)$ time in the absolute worst case.
Here's an interesting puzzle: if we use pure comparisons to solve the selection problem, what is the exact number of comparisons required in the worst-case scenario to find the median?
Finding the exact number of comparisons required to solve the selection problem for the median in the worst-case scenario is a classic open problem in computer science. While the problem is known to have a linear time complexity of \(O(n)\), the precise constant factor remains a subject of theoretical research. Currently, the most efficient algorithm for the upper bound, developed by Dor and Zwick, requires approximately \(2.95n\) comparisons.
On the other hand, the theoretical lower bound—the absolute minimum number of comparisons mathematically necessary—is approximately \(2.02n\) comparisons. This creates a known gap between \(2.02n\) and \(2.95n\); we know the optimal solution lies within this range, but a single "exact" formula for any arbitrary \(n\) has not yet been unified. This difficulty arises because finding a median is significantly more complex than finding a minimum or maximum (which only requires \(n-1\) comparisons), as the algorithm must simultaneously prove that exactly half the elements are smaller and half are larger than the chosen value.
For very small, fixed values of \(n\), the exact numbers are known through exhaustive search. For instance, finding the median of 5 elements requires exactly 6 comparisons. However, as \(n\) grows, the structural overhead of the "Median-of-Medians" approach (or its modern derivatives) keeps the worst-case constant significantly higher than the average-case performance seen in algorithms like Quickselect.
The strategy of Divide and Conquer has been discussed much throughout these notes. So, we will spare the reader of having to rehash it all over again.
The concept of self-reducibility within Divide-and-Conquer algorithms means that a problem can be decomposed into smaller instances of the exact same problem, allowing the same logic to be applied recursively until a base case is reached. This relationship creates a tree structure, often called a recursion tree, where the original problem acts as the root and each division creates branches leading to subproblems. As the algorithm breaks the input down, it moves from the root toward the leaves—the simplest, indivisible versions of the problem—and then climbs back up the tree to combine those partial solutions into a final result. This hierarchical structure is what allows us to calculate the total work performed using mathematical tools like the Master Theorem, which accounts for the cost of splitting and merging at each level of the tree.
The Master Theorem provides a systematic way of solving recurrence relations that arise in the analysis of divide-and-conquer algorithms. It gives asymptotic bounds on $T(n)$ in terms of standard notations like $\Theta$, $O$, and $\Omega$. It applies strictly to recurrences of the following form:
$$T(n) = aT(n/b) + f(n)$$where:
We can illustrate the execution of the Master Theorem as a Solution Tree. In it, there is a node for each recursive call, with the children of that node being the other calls made from that call. The leaves of the tree are the base cases of the recursion, the subproblems (of size less than $k$) that do not recurse. Each node does an amount of work that corresponds to the size of the subproblem $n$ passed to that instance of the recursive call and given by $f(n)$. The total amount of work done by the entire algorithm is the sum of the work performed by all the nodes in the tree. Indeed, this concept of a solution tree ought to be familiar: it is the same concept as the recursion tree.
To solve the Master Theorem by hand, you must compare $f(n)$ to $n^{\log_b a}$. The exponent $\log_b a$ is called the critical exponent. It represents the amount of work done at the leaf nodes of the recursion tree.
To calculate $\log_b a$ in your head, ask yourself: "To what power must I raise $b$ to get $a$?" Here are the essential logarithm rules to help you solve this step:
Whenever you face a recurrence, follow these exact steps to know which case applies:
| Case | Description | Comparison | Intuition | Result |
|---|---|---|---|---|
| Case 1 | Work to split/recombine a problem is dominated by subproblems. i.e. the recursion tree is leaf-heavy | $f(n) = O(n^{c - \epsilon})$ (The leaves grow strictly faster than $f(n)$) |
The work done at the leaf nodes dominates the total time. | $T(n) = \Theta(n^{\log_b a})$ |
| Case 2 | Work to split/recombine a problem is comparable to subproblems. | $f(n) = \Theta(n^c \log^k n)$ ($f(n)$ and the leaves grow at the same rate) |
The work is distributed equally across all levels of the tree. Multiply the leaf work by the height of the tree ($\log n$). | $T(n) = \Theta(n^{\log_b a} \log^{k+1} n)$ |
| Case 3 | Work to split/recombine a problem dominates subproblems. i.e. the recursion tree is root-heavy. | $f(n) = \Omega(n^{c + \epsilon})$ ($f(n)$ grows strictly faster than the leaves) |
The work done at the root (dividing/combining) dominates the total time. (Requires $af(n/b) \leq c f(n)$) | $T(n) = \Theta(f(n))$ |
The theorem fails if the recurrence does not fit the rigid structure:
Sometimes $f(n)$ takes the form $\Theta(n^k \log^p n)$. The advanced Master Theorem expands on the basic 3 cases to handle varying powers of $p$. Here is the breakdown where $a \geq 1$, $b > 1$, $k \geq 0$, and $p$ is a real number:
1. If $a > b^k$ (Equivalent to Case 1):
2. If $a = b^k$ (Equivalent to Case 2):
3. If $a < b^k$ (Equivalent to Case 3):
To help solidify these concepts, let us work through some examples.
Recurrence: $T(n) = T(n/2) + O(1)$ (Binary Search)
Recurrence: $T(n) = 2T(n/2) + O(n)$ (Merge Sort)
Recurrence: $T(n) = 3T(n/2) + n^2$ (Root Dominated)
Recurrence: $T(n) = 2T(n/2) + n \log^2 n$ (Advanced Case (Logarithmic factor in $f(n)$))
In case more practice is needed, enjoy some more problems, solved.
$T(n) = 3T(n/2) + n^2$
$T(n) = 4T(n/2) + n^2$
$T(n) = T(n/2) + 2^n$
$T(n) = 2nT(n/2) + n^n$
$T(n) = 16T(n/4) + n$
$T(n) = 2T(n/2) + n \log n$
$T(n) = 2T(n/2) + n / \log n$
$T(n) = 2T(n/4) + n^{0.51}$
$T(n) = 0.5T(n/2) + 1/n$
$T(n) = 16T(n/4) + n!$
$T(n) = \sqrt{2}T(n/2) + \log n$
$T(n) = 3T(n/2) + n$
$T(n) = 3T(n/3) + \sqrt{n}$
$T(n) = 4T(n/2) + cn$
$T(n) = 3T(n/4) + n \log n$
$T(n) = 3T(n/3) + n/2$
$T(n) = 6T(n/3) + n^2 \log n$
$T(n) = 4T(n/2) + n / \log n$
$T(n) = 64T(n/8) - n^2 \log n$
$T(n) = 7T(n/3) + n^2$
$T(n) = 4T(n/2) + \log n$
$T(n) = T(n/2) + n(2 - \cos n)$
A geometric series is the sum of a sequence of terms where the ratio between consecutive terms is constant. Finding a closed-form expression for the sum of a finite geometric series is a foundational technique in algebra, discrete mathematics, and the analysis of algorithm complexities (such as analyzing tree structures). The following theorem demonstrates how to derive this formula by exploiting the self-similar nature of the series.
Theorem: $1 + x + x^2 + x^3 + \dots + x^k = \frac{x^{k+1} - 1}{x - 1}$
Proof. Let $S$ represent the sum of the finite series. The proof relies on multiplying the entire series by the common ratio $x$. When we subtract the original series $S$ from this newly multiplied series $xS$, the intermediate terms elegantly cancel each other out—a phenomenon similar to a telescoping sum—leaving only the highest degree term and the constant.
$$ \begin{aligned} S &= 1 + x + x^2 + x^3 + \dots + x^k \\ xS &= \phantom{1 + } x + x^2 + x^3 + \dots + x^k + x^{k+1} \\ xS - S &= x^{k+1} - 1 \\ S(x - 1) &= x^{k+1} - 1 \\ S &= \frac{x^{k+1} - 1}{x - 1} \quad \blacksquare \end{aligned} $$While the algebraic telescoping sum provides a closed-form solution that can be computed in $O(1)$ time, computing the series incrementally via a standard loop takes $O(n)$ time. However, we can also approach this using a Divide and Conquer strategy. It exploits the self-similarity of the sequence. Instead of adding terms one by one, we split the series of $n$ terms into two halves. If $n$ is even, the right half of the series is exactly the left half multiplied by $r^{n/2}$.
$$S_n = (a + ar + \dots + ar^{n/2-1}) + r^{n/2}(a + ar + \dots + ar^{n/2-1}) = S_{n/2} + r^{n/2} S_{n/2}$$To avoid recalculating powers of $r$ from scratch (which would degrade our time complexity), our recursive function will return a tuple containing both the sum up to $k$ terms ($S_k$) and the common ratio raised to the $k$-th power ($R_k$).
DC-GEOMETRIC-SUM(a, r, n)
1: if n == 1
2: return (a, r)
3:
4: k = ⌊n / 2⌋
5: (S_k, R_k) = DC-GEOMETRIC-SUM(a, r, k)
6:
7: if n % 2 == 0
8: S_n = S_k + (R_k * S_k)
9: R_n = R_k * R_k
10: return (S_n, R_n)
11: else
12: S_n = S_k + (R_k * S_k) + (a * R_k * R_k)
13: R_n = R_k * R_k * r
14: return (S_n, R_n)
Proof. Let $T(n)$ represent the worst-case running time to compute the sum of $n$ terms. In our algorithm, the problem of size $n$ is divided into exactly one subproblem of size $\lfloor n/2 \rfloor$ (line 5). The work done outside the recursive call (lines 7 through 14) consists purely of basic arithmetic operations (addition, multiplication) and modulo checks, all of which execute in constant time $O(1)$.
This structure yields the following recurrence relation:
$$T(n) = T(n/2) + O(1)$$We solve this using the Master Theorem for divide-and-conquer recurrences of the form $T(n) = aT(n/b) + f(n)$. Here, $a=1$, $b=2$, and $f(n) = O(1) = O(n^0)$. We calculate the critical exponent $\log_b(a) = \log_2(1) = 0$. Since $f(n) = \Theta(n^{\log_b(a)})$, we are in Case 2 of the Master Theorem. Therefore, the running time resolves strictly to a logarithmic bound:
$$T(n) = \Theta(\log n) \quad \blacksquare$$We prove the correctness of DC-GEOMETRIC-SUM by strong mathematical induction on $n$, the number of terms.
Proof. Base Case ($n=1$): When $n=1$, the algorithm triggers the base condition at line 1 and returns $(a, r)$. The true sum of a geometric series with 1 term is simply $a \cdot r^0 = a$, and the ratio raised to the 1st power is $r^1 = r$. Thus, the base case holds trivially.
Inductive Hypothesis: Assume that for all integers $1 \le k < n$, the function call DC-GEOMETRIC-SUM(a, r, k) correctly returns the tuple $(S_k, r^k)$, where $S_k = \sum_{i=0}^{k-1} a r^i$.
Inductive Step: We must demonstrate that DC-GEOMETRIC-SUM(a, r, n) correctly evaluates to $(S_n, r^n)$. Let $k = \lfloor n/2 \rfloor$. Because $k < n$, our inductive hypothesis guarantees that the recursive call on line 5 accurately assigns $S_k = \sum_{i=0}^{k-1} a r^i$ and $R_k = r^k$. We examine the two mutually exclusive parity cases for $n$:
Case 1: $n$ is even. Then $n = 2k$. The algorithm computes the new sum as $S_n = S_k + R_k S_k$. Expanding this using our hypothesis:
$$S_n = \sum_{i=0}^{k-1} a r^i + r^k \sum_{i=0}^{k-1} a r^i = \sum_{i=0}^{k-1} a r^i + \sum_{i=k}^{2k-1} a r^i = \sum_{i=0}^{2k-1} a r^i = \sum_{i=0}^{n-1} a r^i$$The updated power is calculated as $R_k \cdot R_k = r^k \cdot r^k = r^{2k} = r^n$. Both tuple values are mathematically correct.
Case 2: $n$ is odd. Then $n = 2k + 1$. The algorithm computes $S_n = S_k + R_k S_k + a \cdot R_k \cdot R_k$. Expanding this substitution:
$$S_n = \sum_{i=0}^{k-1} a r^i + r^k \sum_{i=0}^{k-1} a r^i + a(r^k)^2 = \sum_{i=0}^{2k-1} a r^i + a r^{2k} = \sum_{i=0}^{n-1} a r^i$$The updated power is calculated as $R_k \cdot R_k \cdot r = r^k \cdot r^k \cdot r^1 = r^{2k+1} = r^n$. Again, both tuple values are mathematically correct.
Because the inductive step holds true for both possible cases, by the principle of strong mathematical induction, the algorithm is correct for all integers $n \ge 1$. $\blacksquare$
The maximum subarray problem (or maximum segment sum) involves finding a contiguous subarray within a one-dimensional array \( A[1 \dots n] \) that yields the largest possible sum.
Formal Definition: Find indices \( i \) and \( j \) with \( 1 \le i \le j \le n \) to maximize: $$ S = \sum_{k=i}^{j} A[k] $$
The key properties of the Maximum Subarray Problem are:
This is not a novel problem; many have attempted it already.
| Researcher | Approach | Complexity |
|---|---|---|
| Ulf Grenander (1977) | Prefix Sums | \( O(n^2) \) |
| Michael Shamos | Divide and Conquer | \( O(n \log n) \) |
| Jay Kadane | Dynamic Programming | \( O(n) \) |
Kadane's algorithm uses a 1-pass scan. It tracks the maximum subarray ending at the current position and the overall maximum seen so far.
The divide and conquer approach takes \(O(n \log n)\). Furthermore, the problem exhibits self-reducibility and a tree structure. The maximum sum is the maximum of three cases:
The algorithm works by dividing the array into two halves. The maximum contiguous subarray must lie in one of three places: entirely within the left half, entirely within the right half, or crossing the midpoint between the two halves. We recursively find the maximums for the left and right halves, and use a helper function to find the maximum crossing subarray, ultimately returning the largest of the three.
FIND-MAXIMUM-SUBARRAY(A, low, high)
1: if high == low
2: return (low, high, A[low]) // Base case: only one element
3: else
4: mid = ⌊(low + high) / 2⌋
5: (left-low, left-high, left-sum) = FIND-MAXIMUM-SUBARRAY(A, low, mid)
6: (right-low, right-high, right-sum) = FIND-MAXIMUM-SUBARRAY(A, mid + 1, high)
7: (cross-low, cross-high, cross-sum) = FIND-MAX-CROSSING-SUBARRAY(A, low, mid, high)
8: if left-sum ≥ right-sum and left-sum ≥ cross-sum
9: return (left-low, left-high, left-sum)
10: elseif right-sum ≥ left-sum and right-sum ≥ cross-sum
11: return (right-low, right-high, right-sum)
12: else
13: return (cross-low, cross-high, cross-sum)
To make this work, we need the helper function FIND-MAX-CROSSING-SUBARRAY. This function does not need to be recursive; it simply radiates outward from the midpoint to find the maximum sum that straddles the divide.
FIND-MAX-CROSSING-SUBARRAY(A, low, mid, high)
1: left-sum = -∞
2: sum = 0
3: for i = mid downto low
4: sum = sum + A[i]
5: if sum > left-sum
6: left-sum = sum
7: max-left = i
8:
9: right-sum = -∞
10: sum = 0
11: for j = mid + 1 to high
12: sum = sum + A[j]
13: if sum > right-sum
14: right-sum = sum
15: max-right = j
16:
17: return (max-left, max-right, left-sum + right-sum)
To establish the time complexity of FIND-MAXIMUM-SUBARRAY, we must build and solve its recurrence relation. Let $T(n)$ denote the running time of the algorithm on an array of $n$ elements.
Proof.
high == low), the algorithm simply executes lines 1 and 2 and returns immediately. This takes constant time. Thus, $T(1)=\Theta(1)$.
FIND-MAX-CROSSING-SUBARRAY. If we look at the pseudocode for this helper function, it contains two non-nested for loops. The first loop iterates from mid down to low, executing $n/2$ times. The second loop iterates from mid + 1 up to high, also executing $n/2$ times. Each iteration takes constant $\Theta(1)$ time. Combined, the helper function processes every element in the array exactly once, making its running time strictly linear: $C(n)=\Theta(n)$. Proof of Complexity via the Master Theorem. We can solve this recurrence rigorously using the Master Theorem for divide-and-conquer recurrences of the form $T(n)=aT(n/b)+f(n)$.
First, we evaluate the watershed function $n^{\log_b a}$:
$$n^{\log_2 2} = n^1 = n$$Next, we compare our driving function $f(n)$ to the watershed function. We observe that $f(n)=\Theta(n)$ is asymptotically identical to $n^{\log_b a}$. Specifically, $f(n)=\Theta(n^{\log_b a} \log^0 n)$.
This perfectly matches Case 2 of the Master Theorem. When the root cost $f(n)$ is proportional to the number of leaves $n^{\log_b a}$, the total work is evenly distributed across all levels of the recursion tree. The solution dictates that we simply multiply the watershed function by a logarithmic factor:
$$T(n)=\Theta(n^{\log_b a} \lg n)=\Theta(n \lg n) \quad \blacksquare$$On the other hand, Kadane's Algorithm relies on a beautifully simple mathematical observation: a maximum contiguous subarray ending at a specific index $i$ is either the element $A[i]$ itself, or it is the element $A[i]$ added with the maximum contiguous subarray ending at index $i - 1$. In other words, checking every possible combination of numbers in the array is not necessary; all that is needed is that best score from the previous step to know what to do at the present step.
Take for example the following array $A$.
| $A = \quad$ | 6 | -5 | 3 | 2 | -4 | -1 |
|---|
Starting at $A[0]$, the maximum subarray is just 6. At index 1, there are some choices:
Below, Kadane's Algorithm is implemented.
KADANE-MAX-SUBARRAY(A)
1: max-sum = -∞
2: current-sum = 0
3: start-index = 1
4: end-index = 1
5: temp-start = 1
6:
7: for i = 1 to A.length
8: if current-sum < 0
9: current-sum = A[i]
10: temp-start = i
11: else
12: current-sum = current-sum + A[i]
13:
14: if current-sum > max-sum
15: max-sum = current-sum
16: start-index = temp-start
17: end-index = i
18:
19: return (start-index, end-index, max-sum)
Notice how it follows the logic we outlined above. As we scan the array from left to right, we maintain a running current-sum. If our accumulated current-sum ever drops below zero, it means the sequence we have built so far is actively dragging our total value down. Therefore, it is mathematically optimal to completely abandon the previous sequence and start a fresh subarray directly at the current element $A[i]$. Throughout this single pass, we continuously update a global max-sum to remember the highest peak we have encountered.
To prove that Kadane's algorithm outperforms the Divide-and-Conquer approach, we must formally analyze both its time and space complexities.
Lines 1–5 perform basic scalar variable initializations, requiring strictly $\Theta(1)$ constant time. Line 7 initiates a single for loop that iterates precisely $n$ times, where $n = A.\text{length}$. Inside this loop (lines 8–17), the algorithm performs only primitive operations: evaluating boolean conditions, performing basic addition, and reassigning scalar variables. Each of these internal operations resolves in $\Theta(1)$ time.
Because there are no nested loops, recursive calls, or complex subroutine invocations, the total time spent within the loop is the summation of a constant over $n$ iterations:
$$\sum_{i=1}^{n} \Theta(1) = \Theta(n)$$Adding the initialization overhead yields $T(n) = \Theta(1) + \Theta(n) = \Theta(n)$. The worst-case running time is rigorously bound at linear time.
The Divide-and-Conquer solution required $\Theta(\lg n)$ auxiliary space to maintain the call stack during recursion. Kadane's algorithm, however, operates entirely iteratively. It only ever allocates a fixed set of five integer variables (max-sum, current-sum, start-index, end-index, and temp-start). Because the memory footprint does not scale with the size of the input array $n$, the auxiliary space complexity is strictly $\Theta(1)$. This makes Kadane's an optimal in-place algorithm.
The objective of this problem is to find the largest possible rectangular area within a given histogram, assuming that the maximum rectangle can be formed by combining a number of contiguous bars. For simplicity in our calculations, we assume that all bars have a uniform width of $1$ unit.
Consider a histogram composed of $7$ bars with the following heights: $\{6, 2, 5, 4, 5, 1, 6\}$. By visual inspection of the histogram below, the largest possible contiguous rectangle that can be inscribed has an area of $12$ square units. This maximum area rectangle is bounded by the middle three bars (heights $5, 4, \text{and } 5$). Its height is constrained by the minimum height of $4$ in that specific range, and its width spans $3$ bars, yielding a total area of $3 \times 4 = 12$.
We can solve this problem elegantly using a Divide and Conquer approach. The fundamental idea is to locate the minimum height value within the given array segment. Once the index of the minimum value is found, the maximum rectangular area must inherently be the maximum of the following three potential values:
The areas to the left and right of the minimum value's bar are structured identically to the main problem and can therefore be calculated recursively.
If we implement a standard linear search to find the minimum value at each recursive step, the worst-case time complexity of this algorithm degrades to $O(n^2)$. This worst-case scenario emerges when the array elements are strictly increasing or strictly decreasing. In such a case, the recursive partition is highly unbalanced: we consistently have $(n-1)$ elements on one side of the minimum and $0$ elements on the other. Because finding the minimum takes $O(n)$ time during each call, the recurrence relation mirrors the worst-case performance of the Quick Sort algorithm.
To improve our efficiency from $O(n^2)$ to $O(n \log n)$, we need a faster way to find the minimum efficiently over any arbitrary subarray. We can achieve this by utilizing a Range Minimum Query (RMQ) powered by a Segment Tree.
By pre-processing the given histogram heights, we can construct a segment tree. Once this segment tree is built, all subsequent range minimum queries take strictly $O(\log n)$ time. The overall complexity of the algorithm is calculated as follows:
$$ \text{Overall Time} = \text{Time to build Segment Tree} + \text{Time to recursively find maximum area} $$The time required to build the initial segment tree is $O(n)$. Let the time required to recursively find the maximum area be denoted by the function $T(n)$. In the worst-case recursive scenario (highly unbalanced splits), this can be mathematically expressed as:
$$ T(n) = O(\log n) + T(n-1) $$The closed-form solution to the above recurrence relation evaluates to $O(n \log n)$. Therefore, combining the preprocessing and recursive phases, the overall time evaluates to $O(n) + O(n \log n)$, which simplifies asymptotically to strictly $O(n \log n)$.
Consider an array $arr[0 \dots n-1]$. The objective is to efficiently find the minimum value within a specific range from index $qs$ (query start) to $qe$ (query end), where $0 \le qs \le qe \le n-1$.
A simple solution is to run a loop from $qs$ to $qe$ and linearly search for the minimum element in the given range. This solution takes $O(n)$ time in the worst case per query.
Another solution involves precomputing and creating a 2D array where an entry at $[i, j]$ stores the minimum value in the range $arr[i \dots j]$. With this setup, the minimum of any given range can be calculated in $O(1)$ time. However, this preprocessing phase takes $O(n^2)$ time. Furthermore, this approach requires $O(n^2)$ extra space, which can easily become prohibitively huge for large input arrays.
A Segment tree provides a balanced compromise, allowing us to perform both preprocessing and querying in moderate time. Using a segment tree, the preprocessing time is $O(n)$ and the time for a range minimum query is reduced to $O(\log n)$. The extra space required to store the segment tree is $O(n)$.
Segment Trees are represented like this:
An array representation is typically used for Segment Trees. For each node located at index $i$ in the tree array:
We initiate the process with the full segment $arr[0 \dots n-1]$. At each step, we divide the current segment into two halves (provided it has not yet reached a segment of length $1$). We recursively call the same procedure on both halves. For each segment processed, we store the minimum value calculated from its halves into a segment tree node.
All levels of the resulting segment tree will be completely filled except possibly the last level. Furthermore, the tree will strictly be a Full Binary Tree because segments are consistently divided into exactly two halves at every level. Since the constructed tree is always a full binary tree containing $n$ leaves, there will be exactly $n-1$ internal nodes. Thus, the total number of nodes in the tree will be $2 \times n - 1$.
The height of the segment tree will be $\lceil\log_2 n\rceil$. Because the tree is represented using an array and we must maintain the strict mathematical relationship between parent and child indexes, the size of the memory allocated for the segment tree array must be $2 \times 2^{\lceil\log_2 n\rceil} - 1$.
Once the segment tree is constructed, we can perform range minimum queries. The algorithm to retrieve the minimum evaluates three main conditions for any given node relative to the query range $[qs, qe]$:
/* qs → query start index,
* qe → query end index */
RMQ(node, qs, qe)
1: if range of node is within qs and qe
2: return value in node
3: else if range of node is completely outside qs and qe
4: return ∞
5: else
6: return min(RMQ(node's left child, qs, qe),
7: RMQ(node's right child, qs, qe))
Using the Segment Tree illustrated above as an input, the expected output for a range given by $[1, 5]$ would be two.
The time complexity for tree construction is $O(n)$. As established, there is a total of $2n-1$ nodes, and the representative minimum value for every node is calculated exactly once during the initial tree construction. The time complexity to query is $O(\log n)$. To compute a range minimum query, the algorithm processes at most two nodes at every level of the tree. Since the total number of levels is bound by $O(\log n)$, the query efficiently operates in logarithmic time.
Incidentally, these complexities can be proven. We start with the first one, $O(n)$ for tree construction.
Proof. Let $n$ be the number of elements in the given array. The segment tree is constructed as a strictly full binary tree. By definition, the leaf nodes of this tree represent the individual elements of the array. Therefore, the number of leaf nodes is exactly $n$.
In any full binary tree, the number of internal nodes is always exactly one less than the number of leaf nodes. Thus, the number of internal nodes is $n - 1$. We can calculate the total number of nodes in the segment tree by summing the leaf and internal nodes:
$$ \text{Total Nodes} = n + (n - 1) = 2n - 1$$During the construction phase, the value of each internal node is computed by taking the minimum of its two direct children. This operation takes constant time, or $O(1)$. Since we perform this constant time operation exactly once for each of the $2n - 1$ nodes, the total time complexity $T(n)$ evaluates to:
$$T(n) = (2n - 1) \times O(1) = O(n)$$Alternatively, this can be proven using the recurrence relation for the construction, where the problem is divided into two halves of size $n/2$, and the merge step takes $O(1)$ time:
$$T(n) = 2T(n/2) + O(1)$$Applying the Master Theorem for divide-and-conquer recurrences of the form $T(n) = aT(n/b) + f(n)$, we have $a=2$, $b=2$, and $f(n) = O(1) = O(n^0)$. Because $\log_b(a) = \log_2(2) = 1$, and $0 < 1$, this falls into the first case of the Master Theorem, conclusively yielding a bound of $$O(n^{\log_b a}) = O(n^1) = O(n)$. \quad \blacksquare$$
Now, we move to a proof that the time complexity to query is $O(\log n)$.
Proof. To start, we must analyze the maximum number of nodes visited at any single depth (or level) of the segment tree.
Let the specified query range be $[qs, qe]$. The RMQ algorithm evaluates three distinct conditions for any given node's interval:
Case A: Completely outside the query range. Returns in $O(1)$ time.
Case B: Completely inside the query range. Returns in $O(1)$ time.
Case C: Partially overlapping the query range. Branches to evaluate both children.
The time complexity is determined entirely by how many times Case C (partial overlap) occurs, as this dictates further recursive calls. For a node's interval to partially overlap the query range $[qs, qe]$, the interval must strictly contain either the starting boundary $qs$ or the ending boundary $qe$ (but not both, as that would be Case B).
Because the nodes at any specific level of a segment tree represent strictly mutually exclusive (disjoint) intervals of the original array, the query start index $qs$ can fall into at most one node's interval at that level. Similarly, the query end index $qe$ can fall into at most one node's interval at that level.
Consequently, at any arbitrary depth of the tree, there can be at most two nodes that exhibit partial overlap (Case C) with the query range $[qs, qe]$. All other nodes visited at that level will either be completely inside or completely outside the query range, thereby immediately terminating the recursive branch in $O(1)$ time.
Because the algorithm branches from at most two nodes per level, it processes a constant number of operations, $O(1)$, at each level. The total time complexity is bounded by the maximum depth, or height, of the segment tree. For a full binary tree containing $n$ leaves, the height $h$ is mathematically defined as:
$$ h = \lceil \log_2 n \rceil $$Multiplying the maximum number of levels by the constant work done per level yields the final time complexity bound:
$$ \text{Query Time} \le O(1) \times \lceil \log_2 n \rceil = O(\log n) \quad \blacksquare$$Given $n$ points in the Euclidean plane, the objective is to find a pair of points that minimizes the distance between them.
This is yet another application of the divide and conquer strategy.
Let $\delta$ be the minimum of these two distances:
$$ \delta = \min(\delta_1, \delta_2) $$To efficiently perform this cross-boundary check, we extract the points that lie within the $2\delta$-wide strip and sort them by their $y$-coordinates. The crucial observation that ensures the efficiency of this algorithm is that we do not need to check every point in the strip against every other point. We only need to check each point against a constant number of subsequent points in the sorted list. Specifically, checking the next $11$ points is mathematically sufficient.
To understand why we only need to check at most $11$ neighboring points, we can conceptually divide the $2\delta$-wide strip into a grid of smaller boxes, each with dimensions $\frac{1}{2}\delta \times \frac{1}{2}\delta$.
Claim: No two points on the same side of the dividing line can lie in the same $\frac{1}{2}\delta \times \frac{1}{2}\delta$ box.
Proof. Suppose two points exist in the same $\frac{1}{2}\delta \times \frac{1}{2}\delta$ box. The maximum possible distance between them would be the length of the box's diagonal. Using the Pythagorean theorem, this distance evaluates to:
$$ \sqrt{\left(\frac{\delta}{2}\right)^2 + \left(\frac{\delta}{2}\right)^2} = \sqrt{\frac{2\delta^2}{4}} = \frac{\delta}{\sqrt{2}} \approx 0.7\delta $$Because $0.7\delta < \delta$, this contradicts our already established baseline that the minimum distance between any pair of points on the same side is at least $\delta$. Therefore, each $\frac{1}{2}\delta \times \frac{1}{2}\delta$ box can contain at most one point.
Let $s_i$ be the point with the $i$-th smallest $y$-coordinate in the strip. For any subsequent point $s_j$ to be within a distance of $\delta$ from $s_i$, it must lie within the neighboring grid boxes. Because there are only $11$ boxes that fall within a vertical distance of $+\delta$ from $s_i$, we conclude that if $|i - j| > 11$, the distance between $s_i$ and $s_j$ is strictly greater than $\delta$. $\blacksquare$
In the figure above, notice the selection $s_i$ made; the box in which it resides is coloured blue. From there, the 11 boxes next to it are all that we need to check to find that closest point to $s_1$; those boxes are highlighted in yellow.
This geometric constraint directly translates to the efficiency of our algorithm. Once the points within the $2\delta$-strip are sorted by their $y$-coordinates, any candidate point $s_j$ that could potentially form a closer pair with $s_i$ (i.e., having a distance strictly less than $\delta$) must be one of the immediate $11$ elements following $s_i$ in the sorted array.
Consequently, we do not need to compare $s_i$ against every other point in the strip. The inner loop of our cross-boundary check only needs to execute a maximum of $11$ times for each point. This realization guarantees that checking the strip takes strictly linear time, reducing the time complexity of the combine step from a brute-force $O(n^2)$ down to $O(n)$. Examine the pseudocode below.
CLOSEST-PAIR(P)
1: if P.length ≤ 3
2: return brute-force minimum distance
3:
4: Compute separation line L dividing points into left and right halves
5: δ1 = CLOSEST-PAIR(left half)
6: δ2 = CLOSEST-PAIR(right half)
7: δ = min(δ1, δ2)
8:
9: Delete all points further than δ from separation line L
10: Sort remaining points p[1...m] in the strip by y-coordinate
11:
12: for i = 1 to m
13: for k = 1 to 11
14: if i + k ≤ m
15: δ = min(δ, distance(p[i], p[i+k]))
16:
17: return δ
Proof of the Number of Distance Calculations. Let $D(n)$ represent the total number of pairwise distance calculations performed by the algorithm. Because we branch into two subproblems of half the size and do at most $11n$ comparisons in the conquer step, our recurrence relation is defined as:
$$ D(n) \le 2D\left(\frac{n}{2}\right) + 11n $$For the base case, $D(1) = 0$. Solving this recurrence relation yields $D(n) = O(n \log n)$. $\blacksquare$
Proof of the Running Time. Let $T(n)$ be the total running time of the algorithm. If we naively sort the points in the strip by their $y$-coordinate from scratch during every recursive call, the sorting step takes $O(n \log n)$ time. The overall time complexity recurrence becomes:
$$ T(n) \le 2T\left(\frac{n}{2}\right) + O(n \log n) $$By solving this recurrence, the overall running time evaluates to $T(n) = O(n \log^2 n)$. $\blacksquare$
To achieve a strict $O(n \log n)$ running time, we must avoid sorting from scratch within the recursive branches. Instead, we can pre-sort the initial array of points by their $y$-coordinates at the very top level.
Each recursive call will return both the minimum distance $\delta$ and a list of its points sorted by $y$. During the combine step, we can construct the sorted list for the strip by simply merging the two pre-sorted lists from the left and right halves. Because merging takes strictly linear time, our improved recurrence relation becomes:
$$T(n) \le 2T\left(\frac{n}{2}\right) + O(n)$$Applying the Master Theorem, this optimization successfully bounds the total running time to $T(n) = O(n \log n)$.
Before diving into multiplication, it is important to note that adding two $n$-bit integers takes $O(n)$ bit operations. However, the standard "grade school" method for multiplying two $n$-bit integers requires $\Theta(n^2)$ bit operations. We can attempt to improve this bound using a divide-and-conquer approach.
To multiply two $n$-bit integers, $x$ and $y$, we can split each of them into two $\frac{n}{2}$-bit integers:
When we multiply them, the equation expands as follows:
$$xy = (2^{n/2} \cdot x_1 + x_0)(2^{n/2} \cdot y_1 + y_0)$$ $$xy = 2^n \cdot x_1 y_1 + 2^{n/2}(x_1 y_0 + x_0 y_1) + x_0 y_0$$This approach requires $4$ recursive multiplications of half-sized integers, plus $\Theta(n)$ work for the addition and shifting. The recurrence relation is $T(n) = 4T(n/2) + \Theta(n)$. Unfortunately, solving this yields $T(n) = \Theta(n^2)$, which does not improve upon the grade school method. But there have been advancements made!
Introduced by Karatsuba and Ofman in 1962, the Karatsuba Method cleverly uses $3$ multiplications instead of $4$ to compute the same result. We define two new variables:
If we multiply these together, we get $\alpha\beta = x_1 y_1 + x_1 y_0 + x_0 y_1 + x_0 y_0$. By algebraic manipulation, we can isolate the middle term needed for our original multiplication:
$$x_1 y_0 + x_0 y_1 = \alpha\beta - x_1 y_1 - x_0 y_0$$Because $x_1 y_1$ and $x_0 y_0$ are already being computed for the high and low parts of the final answer, computing $(x_1 + x_0)(y_1 + y_0)$ requires only one additional multiplication. This reduces our recurrence relation to $T(n) \le 3T(n/2) + O(n)$.
Solving this recurrence yields a significantly faster time complexity of $O(n^{\log_2 3})$, which is approximately $O(n^{1.585})$. For context, the best known algorithm uses the Fast Fourier Transform to achieve $\Theta(n \log n \log\log n)$.
Given two $n \times n$ matrices, $A$ and $B$, the straightforward way to compute their product $C = A \times B$ computes the dot product of the $i$-th row of $A$ and the $j$-th column of $B$ for every element $C_{ij}$.
SIMPLE-MATRIX-MULTIPLY(A, B, n)
1: Let C be a new n × n matrix
2: for i = 1 to n
3: for j = 1 to n
4: C[i,j] = 0
5: for k = 1 to n
6: C[i,j] = C[i,j] + (A[i,k] * B[k,j])
7:
8: return C
We can mathematically justify the exact number of operations by looking at the loops:
A[i,k] * B[k,j]. Therefore, the total number of multiplications is $n^2 \times n = n^3$.To see this in practice, let $n = 2$. Consider the multiplication of two $2 \times 2$ matrices, $A$ and $B$:
$$ A = \begin{bmatrix} a_{11} & a_{12} \\ a_{21} & a_{22} \end{bmatrix}, \quad B = \begin{bmatrix} b_{11} & b_{12} \\ b_{21} & b_{22} \end{bmatrix} $$The resulting matrix $C$ is computed as:
$$ C = \begin{bmatrix} a_{11}b_{11} + a_{12}b_{21} & a_{11}b_{12} + a_{12}b_{22} \\ a_{21}b_{11} + a_{22}b_{21} & a_{21}b_{12} + a_{22}b_{22} \end{bmatrix} $$Let's count the operations for this $n = 2$ example:
Because the number of operations scales cubically, analyzing and improving this $O(n^3)$ bound paved the way for more advanced divide-and-conquer strategies like Strassen's Method.
Instead, we can divide the $n \times n$ matrices $A$ and $B$ into four smaller block matrices of size $\frac{n}{2} \times \frac{n}{2}$. Computing the final matrix involves calculating equations like $C_{11} = A_{11}B_{11} + A_{12}B_{21}$.
This naive block multiplication requires $8$ recursive calls to multiply the $\frac{n}{2} \times \frac{n}{2}$ submatrices, plus $O(n^2)$ time to add them. The recurrence relation is $T(n) = 8T(n/2) + n^2$. Using the Master Theorem, this results in $\Theta(n^3)$, giving us no asymptotic improvement.
Similar to Karatsuba's trick for integers, Strassen's algorithm multiplies $2 \times 2$ block matrices using $7$ multiplications instead of $8$, at the cost of requiring more additions. It computes the following $7$ intermediate products:
These products are then strategically added and subtracted to form the four quadrants of the resulting matrix $C$:
Because we only perform $7$ recursive multiplications, the new recurrence relation becomes $T(n) = 7T(n/2) + c n^2$. Evaluating this gives a time complexity of $\Theta(n^{\log_2 7})$, which is approximately $O(n^{2.81})$. Note that the fastest known matrix multiplication algorithms use $O(n^{2.376})$ time.
A common problem in collaborative filtering—such as a music site attempting to match your song preferences with others—is determining the similarity between two ranked lists. Suppose you rank $n$ songs in a standard baseline order: $1, 2, \dots, n$. Another user ranks the same songs in the order $a_1, a_2, \dots, a_n$. The site consults its database to find people with similar tastes by utilizing a specific similarity metric: the number of inversions between the two rankings.
Mathematically, songs $i$ and $j$ are considered inverted if $i < j$, but their corresponding ranks are swapped such that $a_i > a_j$.
A brute force algorithm would check all $\Theta(n^2)$ pairs of $i$ and $j$ to count the inversions. However, we can achieve a much more efficient runtime using a divide-and-conquer strategy:
To count the split inversions efficiently during the Combine step, we can take from the $\text{MERGE-SORT}$ algorithm. We assume each individual half is already sorted. By merging the two sorted halves into a single sorted whole, we maintain the sorted invariant. During this merge process, every time an element from the right half is placed before an element from the left half, we mathematically know it is inverted with all remaining elements in the left half.
To implement this, we need to consider:
MERGE-AND-COUNT step, arrays $A$ and $B$ must already be sorted.SORT-AND-COUNT step executes, the resulting list $L$ is perfectly sorted.Here is the formal pseudocode for the divide-and-conquer approach to counting inversions. This algorithm cleverly piggybacks the Merge Sort algorithm, simultaneously sorting the list and counting the number of inversions to achieve an efficient $O(n \log n)$ time complexity.
SORT-AND-COUNT(L)
1: if L has one element
2: return (0, L)
3:
4: Divide list L into two halves A and B
5: (rA, A) = SORT-AND-COUNT(A)
6: (rB, B) = SORT-AND-COUNT(B)
7: (r, L) = MERGE-AND-COUNT(A, B)
8:
9: return (rA + rB + r, L)
The core logic of the inversion counting resides in the MERGE-AND-COUNT subroutine. By relying on the pre-condition that arrays $A$ and $B$ are already sorted, we can mathematically guarantee that if an element in $B$ is strictly smaller than an element in $A$, it must also be smaller than all remaining elements in $A$.
MERGE-AND-COUNT(A, B)
1: Maintain current pointers i and j, initialized to 1
2: inversions = 0
3: Let L be an empty list
4:
5: while i ≤ A.length and j ≤ B.length
6: if A[i] ≤ B[j]
7: append A[i] to L
8: i = i + 1
9: else
10: append B[j] to L
11: inversions = inversions + (A.length - i + 1)
12: j = j + 1
13:
14: append any remaining elements of A to L
15: append any remaining elements of B to L
16:
17: return (inversions, L)
The Fast Fourier Transform is one of the most important algorithms of the 20th century. Its applications are remarkably broad, spanning optics, acoustics, quantum physics, telecommunications, control systems, signal processing, speech recognition, data compression, and image processing. It is the foundational math behind everyday technologies like DVDs, JPEGs, MP3s, MRI and CAT scans, and numerical solutions to Poisson's equation.
Polynomials can be represented in multiple ways, each offering different computational tradeoffs.
A polynomial is stored as an array of its coefficients: $a_0 + a_1 x + \dots + a_{n-1} x^{n-1}$.
According to the Fundamental Theorem of Algebra, a degree $n$ polynomial with complex coefficients has exactly $n$ complex roots. A critical corollary to this theorem states that a degree $n$ polynomial $A(x)$ is uniquely specified by its evaluation at $n+1$ distinct values of $x$. Thus, we can represent a polynomial as a set of point-value pairs.
Because coefficient representation offers fast evaluation and point-value representation offers fast multiplication, our goal is to achieve both by efficiently converting between the two representations.
Using brute force for these conversions is slow:
To convert from coefficient to point-value efficiently, we evaluate a polynomial $a_0 + a_1 x + \dots + a_{n-1} x^{n-1}$ at $n$ distinct points using a divide-and-conquer approach. We break the polynomial up into its even and odd powers:
$$ A(x) = a_0 + a_1 x + a_2 x^2 + a_3 x^3 + a_4 x^4 + a_5 x^5 + a_6 x^6 + a_7 x^7 $$We can separate this into two smaller polynomials:
This allows us to rewrite the original polynomial fundamentally as:
$$ A(x) = A_{even}(x^2) + x A_{odd}(x^2) $$The core intuition of the FFT is to strategically choose pairs of evaluation points to be $\pm x$ (specifically, utilizing the complex roots of unity). This symmetry allows us to compute two evaluations for the price of one:
$$ A(1) = A_{even}(1) + 1 \cdot A_{odd}(1) $$ $$ A(-1) = A_{even}(1) - 1 \cdot A_{odd}(1) $$Notice that $A_{even}(1)$ and $A_{odd}(1)$ are exactly the same in both equations. By recursively applying this principle, the FFT brings the conversion time down from $O(n^2)$ to $O(n \log n)$.
Before diving into dynamic programming, let's briefly recall the core principles of the Divide and Conquer strategy.
True or false? Every algorithm that contains a divide step and a conquer step is a divide-and-conquer algorithm. Turns out, the answer is no! The reason for that is: a dynamic programming algorithm contains a divide step and a conquer step but may not be considered a strict divide-and-conquer algorithm.
Dynamic programming is characterized by two hallmark properties:
This relies heavily on the concept of Self-reducibility. In the context of our algorithm design, this means a recursive structure reduces the overall problem to some smaller subproblems. However, it is important to note that this self-reduction may not always strictly adhere to a simple tree structure.
There are several algorithm paradigms that utilize self-reducibility:
Given a chain $\{A_1, A_2, ..., A_n\}$ of $n$ matrices, where for $i = 1, 2, ..., n$, matrix $A_i$ has dimension $p_{i-1} \times p_i$, fully parenthesize the product $A_1 A_2 \dots A_n$ in a way that minimizes the number of scalar multiplications.
A product is said to be fully parenthesized if it is either a single matrix or the product of two fully parenthesized products. For example:
As we saw in the last section regarding simple matrix multiplication, consider a $p \times q$ matrix $A = (a_{ij})$ and a $q \times r$ matrix $B = (b_{jk})$. Then their product is defined as:
$$ AB = \left(\sum_{j=1}^{q} a_{ij} b_{jk}\right)_{p \times r} $$Thus, the total number of scalar multiplications required to compute this matrix product is $pqr$. This is crucial because different fully parenthesized products may yield vastly different numbers of scalar multiplications.
For example, consider matrices $A_1, A_2, A_3$ with dimensions defined by $p_0, p_1, p_2, p_3$:
Suppose the optimal fully parenthesized product is the product of two fully parenthesized products: $A_1 \dots A_k$ and $A_{k+1} \dots A_n$. For the overall product to be optimal, these individual components must themselves be optimal fully parenthesized products of $A_1 \dots A_k$ and $A_{k+1} \dots A_n$, respectively.
Let $m[i,j]$ be the minimum number of scalar multiplications required for computing the product $A_i \dots A_j$. We can define this recursively:
$$ m[i, j] = \begin{cases} 0 & \text{if } i = j \\ \min_{i \le k < j} (m[i,k] + m[k+1,j] + p_{i-1} p_k p_j) & \text{if } i < j \end{cases} $$Here is a basic recursive function to compute $m[i, j]$:
function m[i,j]
1: if i = j
2: then return 0;
3: else begin
4: u ← ∞;
5: for k ← i to j - 1 do
6: u ← min(u, m[i,k] + m[k+1,j] + pi-1 pk pj);
7: end - else
8: return u;
However, running this naively results in many overlapping subproblems. A better approach is to use a bottom-up tabular DP algorithm to compute the optimal value:
MATRIX-CHAIN-ORDER(p)
1: n ← length(p) - 1;
2: for i ← 1 to n do m[i,i] = 0;
3: for l ← 2 to n do
4: for i ← 1 to n - l + 1 do
5: j ← i + l - 1
6: m[i,j] ← ∞
7: for k ← i to j - 1 do
8: q ← m[i,k] + m[k+1,j] + pi-1 pk pj
9: if q < m[i,j] then
10: m[i,j] ← q
11: s[i,j] ← k
12: return m and s
To actually determine the parentheses placement, you follow the values stored in the $s[i,j]$ table during the computation.
Here is an example of the computed $m$ table for a chain of 6 matrices with dimensions: $A_1 (30 \times 35)$, $A_2 (35 \times 15)$, $A_3 (15 \times 5)$, $A_4 (5 \times 10)$, $A_5 (10 \times 20)$, and $A_6 (20 \times 25)$. The table shows the optimal scalar multiplication cost for every subchain $A_i \dots A_j$.
When tracking the $s$ values along with the $m$ table, we can fully construct the optimal parenthesization. By tracing the splits, the optimal solution for this specific chain resolves to: $((A_1(A_2 A_3))((A_4 A_5)A_6))$.
Let us analyze the time complexity of the MATRIX-CHAIN-ORDER(p) function:
Because there are three nested loops, each iterating proportionally to $n$, the overall time complexity is $O(n^3)$.
If we had used the naive recursive function, how many recursive calls would be made, and how many $m[i,j]$ values would be computed? There are $\binom{n}{2} + n$, or $O(n^2)$, distinct $m[i,j]$ subproblems to compute. However, the naive recursion would recompute them exponentially many times.
In general, the running time of a DP algorithm is evaluated by:
The analysis method for dynamic programming can also be applied to divide-and-conquer algorithms. Which is good for us (and the author of these notes)!
However, not every dynamic programming algorithm can be analyzed with the formula: $$\text{Running-time} = (\text{table size}) \times (\text{computation time of recursive formula.})$$ That is because a counterexample can be seen in the study of the shortest path problem (later on).
In Matrix-Chain Multiplication, is it true that for $i < i'$, $s(i, i+l) \le s(i', i'+l)$? If this monotone property holds, can we improve the running time with it?
No, it is not true in general that $s(i, i+l) \le s(i', i'+l)$ for $i < i'$ in the Matrix-Chain Multiplication problem. The optimal split point $s(i, j)$ does not necessarily shift monotonically to the right as the window of matrices slides to the right, primarily because the scalar multiplication cost function (which depends on the matrix dimensions $p_{i-1}p_kp_j$) does not satisfy the quadrangle inequality required to guarantee such monotonic behavior.
If this monotone property did hold (specifically if it satisfied the standard form $s(i, j-1) \le s(i, j) \le s(i+1, j)$), we could significantly improve the algorithm's efficiency using Knuth's Optimization. By restricting the search for the optimal split point $k$ to this narrow, dynamically updated range rather than evaluating all possible points between $i$ and $j$, the amortized work done in the innermost loop would telescope, reducing the overall time complexity of the dynamic programming solution from $O(n^3)$ to $O(n^2)$.
1: MATRIX_CHAIN_ORDER_HYPOTHETICAL(p)
2: n = p.length - 1
3: let m[1..n, 1..n] and s[1..n, 1..n] be new tables
4:
5: // Base cases: single matrices have zero multiplication cost
6: for i = 1 to n
7: m[i, i] = 0
8: s[i, i] = i
9:
10: // l is the chain length
11: for l = 2 to n
12: for i = 1 to n - l + 1
13: j = i + l - 1
14: m[i, j] = ∞
16: // Knuth's Optimization limits the search space for k
17: for k = s[i, j-1] to s[i+1, j]
18: // k must be strictly less than j to form a valid split
19: if k < j
20: q = m[i, k] + m[k+1, j] + p[i-1] * p[k] * p[j]
21: if q < m[i, j]
22: m[i, j] = q
23: s[i, j] = k
24:
25: return m, s
Given two sequences $X$ and $Y$, find a longest common subsequence $Z$ of $X$ and $Y$.
For example, given:
one possible $Z = 01010$.
Let $c[i,j]$ be the length of the longest common subsequence of prefixes $x_1, x_2, \dots, x_i$ and $y_1, y_2, \dots, y_j$. Recursively,
$$ c[i, j] = \begin{cases} 0 & \text{if } i = 0 \text{ or } j = 0 \\ c[i-1, j-1] + 1 & \text{if } i, j > 0 \text{ and } x_i = y_j \\ \max(c[i, j-1], c[i-1, j]) & \text{if } i, j > 0 \text{ and } x_i \ne y_j \end{cases} $$
LCS(X, Y)
1: m = X.length
2: n = Y.length
3: Let L[0..m, 0..n] be a new 2D array filled with 0
4:
5: for i = 1 to m
6: for j = 1 to n
7: if X[i] == Y[j]
8: L[i][j] = L[i-1][j-1] + 1
9: else
10: L[i][j] = max(L[i-1][j], L[i][j-1])
11:
12: return RECONSTRUCT-LCS(L, X, Y, m, n)
RECONSTRUCT-LCS(L, X, Y, i, j)
1: if i == 0 or j == 0
2: return ""
3:
4: if X[i] == Y[j]
5: return RECONSTRUCT-LCS(L, X, Y, i-1, j-1) + X[i]
6: else
7: if L[i-1][j] > L[i][j-1]
8: return RECONSTRUCT-LCS(L, X, Y, i-1, j)
9: else
10: return RECONSTRUCT-LCS(L, X, Y, i, j-1)
The total time complexity of the Longest Common Subsequence algorithm is the sum of the time required to build the dynamic programming table and the time needed to reconstruct the actual sequence, which is $O(mn)$.
Proof. The LCS(X, Y) function allocates a 2D array $L$ of dimensions $(m+1) \times (n+1)$. Initializing this table takes $O(mn)$ time. Following this, the algorithm utilizes two nested loops: the outer loop runs exactly $m$ times, and the inner loop runs exactly $n$ times. Within the innermost loop, the algorithm performs basic primitive operations—such as array lookups, character equality checks, additions, and calculating the maximum of two integers. Each of these operations executes in constant $O(1)$ time.
Mathematically, the total work done in the nested loops is evaluated as:
$$ \sum_{i=1}^{m} \sum_{j=1}^{n} O(1) = m \times n \times O(1) = O(mn) $$The RECONSTRUCT-LCS function begins at the bottom-right corner of the table at indices $(m, n)$. During each recursive call, the algorithm evaluates the matrix and must take one of three paths:
In every possible scenario, at least one of the indices ($i$ or $j$) is decremented by $1$. Because neither index can ever decrease below $0$ (triggering the base case), the maximum number of recursive steps the algorithm can possibly take is strictly bounded by the total number of decrements available, which is $m + n$. Since each individual recursive step takes $O(1)$ time to evaluate its conditional statements, the time complexity of the reconstruction phase is strictly $O(m + n)$.
Combining both phases, the overall time complexity evaluates to:
$$ T(m,n) = O(mn) + O(m + n) = O(mn) \quad \blacksquare$$The space complexity of the algorithm is determined by the auxiliary memory required to store the dynamic programming states and the recursion call stack. This is also $O(mn)$.
Proof. The algorithm explicitly allocates a 2D array $L$ of dimensions $(m+1) \times (n+1)$ to store the lengths of the longest common subsequences of all possible prefix combinations. Storing this grid requires exactly $(m+1)(n+1)$ memory cells, which imposes a spatial footprint of $O(mn)$.
The RECONSTRUCT-LCS function introduces memory overhead due to the system call stack. As proven in the time complexity analysis, the maximum depth of the recursion tree is $m + n$. Therefore, the recursion stack requires an additional auxiliary space of $O(m + n)$.
Adding the memory requirements together yields $O(mn) + O(m + n)$. Because $O(mn)$ asymptotically dominates $O(m + n)$ for large input strings, the overall worst-case space complexity mathematically simplifies to strictly $O(mn). \blacksquare$
Given two sequences $X$ and $Y$, find a longest consecutive common subsequence $Z$ of $X$ and $Y$. Consider the same sequences:
The longest consecutive common subsequence is $Z = 01$.
Let $s[i,j]$ be the length of the longest consecutive common subsequence of $x_1, \dots, x_i$ and $y_1, \dots, y_j$. Let $t[i,j]$ denote the length of the longest consecutive common tail between those sequences. The recursive formula for that is:
$$ t[i, j] = \begin{cases} 0 & \text{if } x_i \ne y_j \text{ or } i = 0 \text{ or } j = 0 \\ t[i-1, j-1] + 1 & \text{if } x_i = y_j \end{cases} $$ $$ s[i, j] = \begin{cases} 0 & \text{if } i = 0 \text{ or } j = 0 \\ \max(s[i, j-1], s[i-1, j], t[i, j]) & \text{if } i, j > 0 \text{ and } x_i = y_j \\ \max(s[i, j-1], s[i-1, j]) & \text{if } i, j > 0 \text{ and } x_i \ne y_j \end{cases} $$For example, if you are comparing $X = 10110110$ and $Y = 00100110$, $t[i,j]$ would track the matching length at the end.
To compute the Longest Consecutive Common Subsequence efficiently, we can use a bottom-up dynamic programming approach. We will maintain two tables: $T$ to track the lengths of the consecutive matching tails, and $S$ to track the maximum consecutive matches found so far up to indices $i$ and $j$.
LCCS(X, Y)
1: m = X.length
2: n = Y.length
3: Let T[0..m, 0..n] be a new 2D array filled with 0
4: Let S[0..m, 0..n] be a new 2D array filled with 0
5:
6: for i = 1 to m
7: for j = 1 to n
8: if X[i] == Y[j]
9: T[i][j] = T[i-1][j-1] + 1
10: S[i][j] = max(S[i][j-1], S[i-1][j], T[i][j])
11: else
12: T[i][j] = 0
13: S[i][j] = max(S[i][j-1], S[i-1][j])
14:
15: return S[m][n]
In many practical implementations, the $S$ matrix is omitted entirely by simply keeping track of the global maximum value observed in $T$ during the loops. However, we explicitly construct $S$ here to mirror the mathematical recurrence relation defined above.
Proof of Time Complexity. Let $m$ be the length of sequence $X$ and $n$ be the length of sequence $Y$. The algorithm begins by initializing two 2D arrays, $T$ and $S$, of dimensions $(m+1) \times (n+1)$. This initialization takes $O(mn)$ operations.
The core of the algorithm consists of two nested loops:
Inside the innermost loop, the algorithm performs a sequence of constant-time operations: an array index lookup, a character comparison (X[i] == Y[j]), basic arithmetic (addition), and evaluations of the max() function. Since all these operations take $O(1)$ time, the total work done inside the nested loops is strictly proportional to the number of iterations.
Mathematically, the time complexity evaluates to:
$$ \sum_{i=1}^{m} \sum_{j=1}^{n} O(1) = m \times n \times O(1) = O(mn) $$Thus, the overall time complexity of the algorithm is strictly bounded by $O(mn). \blacksquare$
The space complexity is dominated by the auxiliary memory allocated to store the dynamic programming tables. It is $O(mn)$.
Proof. The algorithm allocates two separate 2D arrays, $T$ and $S$, each with dimensions $(m+1) \times (n+1)$. The number of memory cells allocated is:
$$ \text{Total Space} = 2 \times (m+1) \times (n+1) $$Expanding this polynomial gives $2mn + 2m + 2n + 2$. When evaluating asymptotic space complexity, we drop the constants and lower-order terms. The dominant term is $mn$, yielding a worst-case spatial footprint of $O(mn). \quad \blacksquare$
Nota Bene. Because the computation of the current row $i$ in both $T$ and $S$ depends only on the values of the current row $i$ and the immediately preceding row $i-1$, we do not strictly need to keep the entire $(m+1) \times (n+1)$ matrices in memory. By storing only the "current" and "previous" rows, the space complexity can be optimized to $O(\min(m, n))$, though the unoptimized tabular version presented above clearly requires $O(mn)$ space.
Let $t[i,j]$ be the length of the longest consecutive common tail of the following two sequence prefixes:
For example, consider the sequences $X = 10110110$ and $Y = 00100110$. To borrow from our earlier two examples, the longest consecutive common tail $Z$ is $0110$.
The recursive relationship for the consecutive common tail strictly depends on whether the terminating elements of the current prefixes match:
$$ t[i, j] = \begin{cases} 0 & \text{if } x_i \ne y_j \text{ or } i = 0 \text{ or } j = 0 \\ t[i-1, j-1] + 1 & \text{if } x_i = y_j \end{cases} $$To compute the table of longest common tails efficiently, we can utilize a bottom-up dynamic programming approach. The code below computes the entire $T$ table and also tracks the maximum tail length found, which corresponds to the longest common substring between the two full sequences.
LONGEST-COMMON-TAIL(X, Y)
1: m = X.length
2: n = Y.length
3: Let T[0..m, 0..n] be a new 2D array filled with 0
4: max_tail_length = 0
5:
6: for i = 1 to m
7: for j = 1 to n
8: if X[i] == Y[j]
9: T[i][j] = T[i-1][j-1] + 1
10: max_tail_length = max(max_tail_length, T[i][j])
11: else
12: T[i][j] = 0
13:
14: return max_tail_length
The time complexity of LONGEST-COMMON-TAIL is $O(mn)$.
Proof. Let $m$ be the length of sequence $X$ and $n$ be the length of sequence $Y$. The algorithm first allocates and initializes a 2D array $T$ of size $(m+1) \times (n+1)$, which takes $O(mn)$ time.
The core computational work occurs inside two nested loops:
Inside the inner loop, the algorithm executes primitive operations: array indexing, checking character equality, addition, and assigning the maximum value. Each of these operations executes in constant $O(1)$ time. Because the recurrence $t[i, j]$ only ever looks back at the immediately preceding diagonal element $t[i-1, j-1]$, there is no need to iterate over previous states.
Mathematically, the total time evaluating the subproblems evaluates to:
$$ \sum_{i=1}^{m} \sum_{j=1}^{n} O(1) = m \times n \times O(1) = O(mn) $$Combining the initialization and the nested loops, the overall time complexity is strictly bounded by $O(mn). \quad \blacksquare$
The spacial complexity of LONGEST-COMMON-TAIL is $O(mn)$.
Proof. The space complexity is entirely dominated by the explicit allocation of the auxiliary dynamic programming table $T$.
The algorithm allocates a 2D array with $(m+1)$ rows and $(n+1)$ columns. The total number of memory cells allocated is $(m+1) \times (n+1) = mn + m + n + 1$. Dropping the constants and lower-order terms for asymptotic analysis leaves $mn$ as the dominant term.
Therefore, the worst-case space complexity is $O(mn). \quad \blacksquare$
Remark. Because computing the $i$-th row of the table $T$ only requires values from the $(i-1)$-th row, we could theoretically optimize the space complexity down to $O(\min(m, n))$ by only storing the two most recent rows in memory. However, the standard tabular implementation provided above requires $O(mn)$ space.
Given a rectangle with point-holes inside, partition it into smaller rectangles without any holes inside to minimize the total length of cuts. This problem is NP-Hard!.
To start, take what you are given (above) and make a Guillotine Cut, which is a straight, bisecting line going from edge-to-edge as below.
A Guillotine Partition is defined as a sequence of guillotine cuts. A canonical guillotine partition occurs when every single cut passes directly through a hole.
This leads quite nicely to the problem of finding a Minimum Length Guillotine Partition. Given a rectangle with point-holes inside, the objective is to partition it into smaller rectangles such that no hole is inside any of the resulting rectangles, while minimizing the total length of the cuts.
By using dynamic programming, we can find the minimum guillotine partition in time $O(n^5)$. This is because each cut has at most $2n$ choices, and there are $O(n^4)$ subproblems to evaluate. The minimum guillotine partition can serve as a polynomial-time approximation for the general NP-Hard problem.
A canonical guillotine partition is a restricted version of this problem where every cut must be a straight line that goes from one end of the current rectangle to the other, and every cut must pass strictly through at least one of the holes. Because the canonical version restricts the locations of the cuts, we can solve it efficiently using dynamic programming.
To implement the dynamic programming approach, we first extract the $x$-coordinates and $y$-coordinates of all $n$ holes, along with the boundaries of the original rectangle. Sorting these coordinates creates a non-uniform grid. Any valid canonical sub-rectangle can be uniquely identified by choosing two $x$-coordinates (for the left and right boundaries) and two $y$-coordinates (for the top and bottom boundaries).
MIN-GUILLOTINE-PARTITION(X, Y)
1: Let X be the sorted unique x-coordinates of all boundaries and holes
2: Let Y be the sorted unique y-coordinates of all boundaries and holes
3: Let DP be a 4D array initialized to ∞
4:
5: for width_idx = 1 to X.length - 1
6: for height_idx = 1 to Y.length - 1
7: for i = 0 to X.length - width_idx - 1
8: j = i + width_idx
9: for k = 0 to Y.length - height_idx - 1
10: l = k + height_idx
11:
12: if rectangle(X[i], X[j], Y[k], Y[l]) contains no holes
13: DP[i][j][k][l] = 0
14: continue
15:
16: // Try all valid vertical cuts passing through a hole
17: for each hole at x-coordinate X[v] strictly between X[i] and X[j]
18: cost = (Y[l] - Y[k]) + DP[i][v][k][l] + DP[v][j][k][l]
19: DP[i][j][k][l] = min(DP[i][j][k][l], cost)
20:
21: // Try all valid horizontal cuts passing through a hole
22: for each hole at y-coordinate Y[h] strictly between Y[k] and Y[l]
23: cost = (X[j] - X[i]) + DP[i][j][k][h] + DP[i][j][h][l]
24: DP[i][j][k][l] = min(DP[i][j][k][l], cost)
25:
26: return DP[0][X.length-1][0][Y.length-1]
Now, we prove the time and spacial complexities of MIN-GUILLOTINE-PARTITION(X, Y).
Proof. Let $n$ be the number of holes. In a canonical partition, any cut must pass through one of the holes. Therefore, the $x$-coordinates of the vertical boundaries must be chosen from the set of hole $x$-coordinates (plus the $2$ outer bounding lines), yielding at most $n+2$ possible $x$-coordinates. Similarly, there are at most $n+2$ possible $y$-coordinates.
A subproblem is defined by identifying a specific sub-rectangle. A sub-rectangle requires picking a left boundary, a right boundary, a bottom boundary, and a top boundary. Choosing $2$ horizontal coordinates and $2$ vertical coordinates out of $O(n)$ possibilities means there are $\binom{n}{2} \times \binom{n}{2} = O(n^4)$ total subproblems to evaluate.
For any given sub-rectangle, we must iterate through all possible valid guillotine cuts to find the optimal subdivision. A valid canonical cut must pass through one of the holes inside the current sub-rectangle. Because there are at most $n$ holes, there are at most $n$ possible vertical cuts and $n$ possible horizontal cuts. Thus, each subproblem has at most $2n$ choices to evaluate.
Multiplying the number of subproblems by the number of choices yields the overall time complexity:
$$ \text{Total Time} = O(n^4) \text{ subproblems} \times O(n) \text{ choices per subproblem} = O(n^5) $$Therefore, the dynamic programming approach runs in time $O(n^5)$. $\quad \blacksquare$
Proof. The spatial footprint of the algorithm is dominated by the size of the DP table. The table is indexed by the four boundary lines of the sub-rectangle: $i$, $j$, $k$, and $l$. Since each index ranges from $0$ to $O(n)$, a 4D array requires $O(n) \times O(n) \times O(n) \times O(n)$ memory cells. Thus, the space complexity is strictly bounded by $O(n^4)$. $\quad \blacksquare$
While the canonical guillotine partition can be solved in polynomial time and serves as a polynomial-time approximation, the general problem—where cuts are not restricted to be "guillotine" cuts and can instead stop at intersections in "windmill" formations—is fundamentally more complex.
Theorem. The general Minimum Length Rectangular Partition with Holes problem is NP-Hard.
Proof Sketch. The proof relies on a polynomial-time reduction from the known NP-complete problem, Planar 3-SAT (or variations of the exact cover problem). The general non-guillotine problem allows cuts to meet and terminate at T-junctions within the interior of the rectangle, permitting complex, interlocked topological structures (often referred to as "windmill" patterns).
By strategically placing points (holes) in the rectangle, one can construct "wire" gadgets, "splitters", and "logic gates" representing a boolean satisfiability formula. The holes force the minimum-length cuts to traverse these gadgets. The minimum length of the non-guillotine cuts corresponds directly to finding a valid boolean assignment that satisfies the formula. Because verifying all possible non-guillotine topological intersections effectively maps to testing boolean assignments for Planar 3-SAT, finding the absolute minimum total cut length without the guillotine restriction is NP-Hard. $\quad \blacksquare$
However, we can do better than a sketch. This proof of NP-Hardness is due to Lingas, A., R.Y. Pinter, R.L. Rivest, and A. Shamir.
Proof. The proof relies on a polynomial-time reduction from Planar Satisfiability (PLSAT), which is known to be NP-complete. A boolean formula $F$ in 3-Conjunctive Normal Form (3CNF), with variables $X$ and clauses $C$, is considered planar if its corresponding bipartite graph $G(F) = (X \cup C, E)$ can be drawn in a plane without any intersecting edges.
To transform PLSAT to Minimum Edge Length Rectangular Partitioning (MELRP), we systematically construct a rectilinear figure $\mathcal{H}$ such that there exists a rectangular partitioning of $\mathcal{H}$ with a total edge length not exceeding a specific threshold $k$ if and only if the planar formula $F$ is satisfiable. The figure $\mathcal{H}$ is laid out as a physical image of the planar circuit $G(F)$.
The construction uses several geometric "devices" or gadgets to simulate boolean logic:
To evaluate the cost of a partition, we conceptually cut the entire figure $\mathcal{H}$ into its individual constituent devices (wires, splits, inverters, junctions). For any given partitioning, the length of each edge is divided equally among the devices it passes through; this distributed length is called the "charge" assigned to the device.
A device is said to be "optimally charged" if its internal partitioning achieves the absolute mathematical minimum possible charge.
We define the threshold $k$ as the sum of the minimum possible charges for every single device in the entire figure $\mathcal{H}$. Therefore, the entire figure $\mathcal{H}$ can be partitioned with a total edge length $\le k$ if and only if every single device is optimally charged.
Because all clause junctions must be optimally charged to meet the threshold $k$, every clause must receive at least one $1$ (True) signal. Thus, finding a partition of length $\le k$ is mathematically equivalent to finding a satisfying truth assignment for the formula $F$.
Since this geometric construction can be performed in logarithmic space, the dimensions and threshold $k$ are polynomially bounded by the size of the formula $F$, proving that MELRP is strongly NP-complete. By simulating the boundaries of larger holes with dense, degenerate points, this NP-completeness proof identically holds for rectilinear figures with simple point-holes (MELRPP). $\quad \blacksquare$
The puzzle from in class asks about the Monotonicity Property (also known as the Knuth Optimization) in Matrix-Chain Multiplication. Specifically, if $s[i, j]$ is the index $k$ that achieves the optimal split for the product $A_i \dots A_j$, we examine if the optimal split point moves monotonically as the sequence range shifts.
The Puzzle: Is it true that for $i < i'$, $s[i, i+l] \le s[i', i'+l]$? Can we improve the running time if this property holds?
Answer: Well yes! The property is true. In Matrix-Chain Multiplication, the optimal split point $s[i, j]$ satisfies the following monotonicity condition:
$$s[i, j-1] \le s[i, j] \le s[i+1, j]$$
This means that the optimal split point for a sequence of length $L$ is bounded by the optimal split points of its two sub-sequences of length $L-1$.
The standard dynamic programming algorithm for Matrix-Chain Multiplication has a time complexity of $O(n^3)$. This is because there are $O(n^2)$ entries in the table, and for each entry, we perform a linear search over $O(n)$ possible split points.
If the monotonicity property holds, we can reduce the search space for the split point $k$, effectively improving the complexity to $O(n^2)$.
Proof. Instead of searching for $k$ in the full range $[i, j-1]$, we restrict our search to the interval $[s[i, j-1], s[i+1, j]]$. To determine the total complexity, we sum the work done for each length $l$ from $2$ to $n$:
For a fixed length $l$, the total work across all sub-problems of that length is:
$$ \sum_{i=1}^{n-l+1} (s[i+1, i+l] - s[i, i+l-1] + 1) $$This is a telescoping sum. Notice how the terms cancel out:
$$ (s[2, l+1] - s[1, l]) + (s[3, l+2] - s[2, l+1]) + \dots + (s[n-l+1, n] - s[n-l, n-1]) + O(n) $$The sum simplifies to:
$$ s[n-l+1, n] - s[1, l] + O(n) $$Since the values of $s[i, j]$ are bounded by $n$, the total work for a single length (one diagonal of the DP table) is $O(n)$. Because there are $n$ such lengths to compute, the total time complexity is:
$$ \sum_{l=1}^{n} O(n) = O(n^2) $$Thus, by exploiting the monotonicity of the optimal split points, we reduce the cubic-time algorithm to quadratic time. $\quad \blacksquare$
Remember when we mentioned that the study of shortest paths offers a counter example to every dynamic programming algorithm being analysable with the formula: $$\text{Running-time} = (\text{table size}) \times (\text{computation time of recursive formula})\text{?}$$ This is the section in which we go in more detail.
The problem of shortest paths can be given as follows. Consider a network $G=(N,A)$ in which there is an origin node s and a destination node t. Standardly, the notation is taken to mean $n=|N|$, $m=|A|$ where $N$ is the set of notes and $A$ is the set of arcs (or edges). Examine the figure below.
With these facts, what is the shortest path from \( s \) to \( t \)? Such a path is defined as the one that minimizes the total cost, represented as \( \min\left(\sum c_{ij}\right) \), where \( c_{ij} \) is the cost of arc \( (i, j) \).
Let $d^*(u)$ denote the length of shortest path from origin node s to node u. Then
$$d^*(u) = \min_{v \in N^-(u)} \{d^*(v) + c_{vu}\}$$ where
$$N^-(u) = \{v | (v,u) \text{ exists}\}$$
The term $d^*(u)$ represents the optimal value of our objective function. In this context, it is the absolute shortest distance (or minimum cost) required to travel from a starting source node $s$ to a specific destination node $u$. The asterisk ($*$) is a common mathematical convention used to denote that this value is "optimal"—meaning that among all possible paths through the graph, none are shorter than $d^*(u)$.
The set $N^-(u) = \{v | (v,u) \text{ exists}\}$ defines the in-neighborhood or the set of immediate predecessors of node $u$. Essentially, this looks "backward" from $u$ to find every node $v$ that has a direct, one-step edge leading into $u$. If you think of the graph as a series of roads, $N^-(u)$ represents all the possible intersections you could have been at just one moment before arriving at $u$.
The expression $\{d^*(v) + c_{vu}\}$ represents a candidate path length. It calculates what the total distance to $u$ would be if you chose to arrive there specifically by way of node $v$. It breaks the journey into two parts: $d^*(v)$, which is the already-optimized shortest path from the source to the predecessor, and $c_{vu}$, which is the specific "edge cost" or weight of the final link between $v$ and $u$.
Finally, the $\min_{v \in N^-(u)}$ operator is the decision-making component of the equation, often referred to as the Bellman Equation or the Principle of Optimality. That means the "value" of a decision problem is written in terms of the reward you can get from a cast of present initial choices plus the potential "value" of future subsequent choices. Since there might be multiple ways to reach node $u$ from different predecessors, the algorithm evaluates the candidate path length for every possible $v$ in the in-neighborhood. By selecting the minimum of these values, it ensures that $d^*(u)$ remains the most efficient route possible, discarding any sub-optimal detours.
See how the Bellman Equation was applied? It mattered to consider all the different way to reach $u$—the possible selections that could be made now and those the algorithm could make in the future.
We can approach the problem of shortest paths with dynamic programming as we said. The pseudocode for the approach is below.
DAG-SHORTEST-PATH(V, E, c, s)
1: for each vertex u ∈ V
2: d*[u] = ∞
3: d*[s] = 0
4: S = {s}
5: T = V - {s}
6:
7: while T is not empty
8: find u ∞ T such that all predecessors N-(u) ⊆ S
9:
10: min_dist = ∞
11: for each v ∞ N-(u)
12: current_dist = d*[v] + c(v, u)
13: if current_dist < min_dist
14: min_dist = current_dist
15:
16: d*[u] = min_dist
17: S = S ∪ {u}
18: T = T - {u}
19:
20: return d*
The algorithm described above is specifically tailored for Directed Acyclic Graphs (DAGs). It leverages build the set of "known" shortest distances, $S$, through a sequence of vertex relaxations. Remember: the value $d^*(u)$ represents the optimal distance from the source $s$ to any vertex $u \in V$ (see: above) Below, examine the DP algorighm on an example.
A critical feature of this algorithm is the selection criterion for the next vertex to be processed. By requiring that $N^-(u) \subseteq S$, the algorithm ensures that a vertex $u$ is only added to $S$ once all its immediate predecessors (the set of vertices with incoming edges to $u$) have had their shortest paths finalized. This effectively processes the graph in topological order, ensuring that no shorter path to $u$ can be discovered later in the execution.
The recurrence relation used to update the distance (which we saw above), $$d^*(u) = \min_{v \in N^-(u)} \{d^*(v) + c(v,u)\}$$ is actually quite special: it is an application of the Bellman Equation. It states that the shortest path to $u$ must consist of the shortest path to some predecessor $v$, plus the cost of the final edge $(v, u)$. By iterating until $T = \emptyset$, the algorithm guarantees that every reachable vertex from $s$ is assigned its true minimum cost.
Lemma. Consider an acyclic network $G=(V,E)$ with a source node s and a sink node t. Suppose (S,T) is a partition of V with $s \in S$. Then there exists $u \in T$ such that $N^-(u) \subseteq S$
This Lemma addresses a fundamental structural property of Directed Acyclic Graphs (DAGs). It essentially states that in any acyclic network where you have separated the vertices into two groups—a set $S$ containing the starting point and a set $T$ containing the remaining nodes—there must be at least one node in $T$ whose entire "history" (all incoming edges) comes exclusively from $S$.
The condition $N^-(u) \subseteq S$ is the most critical part of this statement. It means that for a specific node $u$ in the target set $T$, every single predecessor $v$ that has an edge $(v, u)$ pointing to $u$ must already be a member of the set $S$. In practical terms, this ensures that there are no "hidden" paths from other nodes in $T$ that could reach $u$ before $S$ does.
The logic behind this relies on the absence of cycles. If every node in $T$ had at least one predecessor that was also in $T$, you could trace these predecessors backward indefinitely. In a finite graph, you would eventually be forced to revisit a node you had already passed, creating a cycle. Since the lemma specifies the network is acyclic, this "infinite loop" is impossible, meaning at least one node in $T$ must have all its predecessors sitting safely within $S$.
In the context of shortest-path algorithms, this lemma provides the mathematical guarantee that we can always find a "next" node to process. Because $u$ only has predecessors in $S$, and we have already calculated the shortest paths for everything in $S$, we can finalize the distance $d^*(u)$ without worrying that some other path through $T$ will come along later and offer a shorter route.
Proof. Note that for any $u \in T$, $N^-(u) \neq \emptyset$. If $N^-(u) \not\subseteq S$, then there exist $v \in N^-(u)$ such that $v \in T$. If $N^-(v) \not\subseteq S$ then there exist $w \in N^-(v)$ such that $w \in T$. This process cannot go forever. Finally, we'll find $z \in T$ such that $N^-(z) \subseteq S$. $\blacksquare$
The proof begins with a logical "what if" scenario. It assumes that if we cannot find a node $u \in T$ where all its predecessors are in $S$, then every single node in $T$ must have at least one predecessor that is also in $T$. This sets up a chain of dependency: to process $u$, you first need $v$; to process $v$, you first need $w$, and so on.
The phrase "This process cannot go forever" is the pivot point of the argument. Because the graph $G$ is finite (it has a limited number of vertices) and, crucially, acyclic, you cannot move backward through $T$ indefinitely. If you could, you would eventually have to revisit a node you've already seen, which would create a directed cycle—violating the very definition of the network.
Therefore, as you trace this path backward ($u \leftarrow v \leftarrow w \dots$), you must eventually hit a "dead end" within the set $T$. This "dead end" is a node $z$ that has no more predecessors left in $T$. Since the graph is connected and starts at source $s \in S$, the only remaining place for $z$'s predecessors to be is within the set $S$.
The conclusion, $N^-(z) \subseteq S$, confirms that there is always at least one "entry point" into the set $T$ from $S$. In the context of the shortest path algorithm, this node $z$ is the one we are "ready" to calculate, because everything we need to know to find its shortest path is already contained in the solved set $S$.
Proposal. Dynamic programming works on any acyclic network (even with negative weight). As in, the figure below.
The proposal is wrong! Dynamic programming may not work in a network with a cycle. Below, we have a counterexample!
The reason this figure serves as a counterexample is because of the presence of a cyclic dependency between the two middle nodes. In the first (acyclic) figure, there is a clear topological order. We can calculate the shortest path to the top node, then use that result to calculate the bottom node. Because there are no loops, once a value is computed, it is guaranteed to be final.
In the counterexample, however, we see two vertical edges between the center nodes pointing in opposite directions. This creates a deadlock for the DP algorithm. To calculate $d^*(\text{top node})$, the algorithm requires the finalized value of $d^*(\text{bottom node})$. Conversely, to calculate $d^*(\text{bottom node})$, it needs the finalized value of $d^*(\text{top node})$.
Recall the lemma we discussed: "There exists $u \in T$ such that $N^-(u) \subseteq S$." In this cyclic graph, if our set $S$ only contains the source node, the lemma fails. Neither of the two middle nodes has all its predecessors in $S$ because they are predecessors of each other. This means the algorithm cannot find a "next" node to process, effectively getting stuck in an infinite loop or requiring an infinite number of iterations to converge. Furthermore, if a cycle has a negative total weight, the concept of a "shortest path" may not even exist, as one could traverse the cycle infinitely to reach a cost of negative infinity ($-\infty$). While Dijkstra’s algorithm fails with negative edges even in acyclic graphs, DP specifically fails here because it cannot establish the subproblem ordering necessary to fill its table.
Earlier, we found the supposition that "Every algorithm of dynamic-programming type can use the following formula to estimate its running time:
$running~time = (table~size) \times (computing~time$ for recursive formula)" to be false. The reason for that is found below. Say we take that formula and proceed to
calculate the running time. What we would get is:
Table Size: One entry for each $d^*(u)$ calculation $$\text{Size} = O(|V|) = O(n)$$ Per-Node Computation: Worst-case search through predecessors $N^-(u)$ for each node $$\text{Time per node} = O(|V|) = O(n)$$ Total Running Time: $$\text{Total} = (\text{Number of Nodes}) \times (\text{Work per Node})$$ $$O(n) \times O(n) = O(n^2)$$
Is above calculation correct?
No! We also need to know how much time to find $u \in T : N^-(u) \subseteq S$! Examine the pseudocode for DAG-SHORTEST-PATH more closely.
Except for the time to find $u \in T : N^-(u) \subseteq S$, how much time do we take? Since each edge is updated once, we need only $O(|A|) = O(|E|) = O(m)$ time!
Incidentally, how much time does it take to find a node $u \in T$ such that $N^-(u) \subseteq S$? In a naive implementation, we must scan the remaining nodes in the set $T$ and check the status of their predecessors. This scan takes $O(n)$ time per iteration, which—when repeated for all $n$ nodes—results in a total search time of $O(n^2)$.
However, we can optimize this step significantly. The requirement to only process a node $u$ after all its predecessors $N^-(u)$ have been processed is functionally equivalent to finding a topological ordering of the graph. A topological ordering is a linear sequence of vertices where, for every directed edge $(u, v)$, node $u$ strictly precedes node $v$. By processing nodes in this order, we ensure that $d^*(v)$ is finalized before we ever attempt to calculate $d^*(u)$.
Using Kahn's Algorithm, we can construct this ordering and identify the next valid node in constant time, $O(1)$. This is achieved by maintaining an active in-degree counter for each vertex. By storing "source" nodes (those with an in-degree of 0) in a queue and iteratively updating the dependency counts of their neighbors, we eliminate the need for redundant scanning.
When this optimized search is combined with the $O(m)$ time required to relax each edge $(v,u) \in E$, the overall complexity of the dynamic programming approach on an acyclic network becomes strictly linear: $$O(n + m)$$
The TOPOLOGICAL-SORT algorithm below is Kahn's. We first initialize a list $L$ to store the sorted order and a set $S$ of nodes with no incoming edges. Setting initial in-degrees to zero for all $n$ nodes takes $O(n)$ time. From there, we begin edge counting: traverse every edge $(u, v) \in E$ once to calculate the true in-degree of every node. This takes $O(m)$ time.
For every node $u$ removed from $S$, we examine its outgoing edges. Each edge is "removed" exactly once, and each node is inserted into $L$ exactly once. This phase takes $O(n + m)$ time.
TOPOLOGICAL-SORT(V, E)
1: L ← ∅ // List that will contain the sorted elements
2: S ← {u ∈ V | in-degree(u) = 0} // Set of all nodes with no incoming edges
3:
4: while S is not empty do
5: remove a node u from S
6: add u to tail of L
7:
8: for each node v with an edge e from u to v do
9: remove edge e from the graph
10: if v has no other incoming edges then
11: insert v into S
12:
13: if graph has edges then
14: return error (graph has at least one cycle)
15: else
16: return L (a topologically sorted order)
Because this pre-processing step runs in $O(n + m)$ and each distance update within the DP also happens exactly once per edge, the total time complexity remains $O(n + m)$. This makes the algorithm optimal for any acyclic network, regardless of whether the edge weights are positive or negative.
Remark. A topological ordering is not necessary a sorted ordering! It is an ordering of the vertices bsaed on the direction of the edges.
Theorem. Dynamic Programming is a linear time algorithm for the shortest path problem in acyclic networks, possibly with negative weight.
Proof. Let $G = (V, E)$ be a directed acyclic graph (DAG) with $n = |V|$ nodes and $m = |E|$ edges, and let $c_{vu}$ be the weight of the directed edge from $v$ to $u$. The weights $c_{vu}$ can be positive, negative, or zero.
First, because $G$ is acyclic, we can perform a topological sort on $G$ in $O(n + m)$ time using Kahn's Algorithm. This gives us a linear ordering of the vertices such that for every directed edge $(v, u)$, vertex $v$ strictly precedes vertex $u$ in the sequence.
Let $d^*(u)$ be the shortest path distance from the source node $s$ to node $u$. We initialize $d^*(s) = 0$ and $d^*(u) = \infty$ for all $u \neq s$. We then process each vertex $u$ in the exact topological order, applying the dynamic programming recurrence:
$$ d^*(u) = \min_{v \in N^-(u)} \{d^*(v) + c_{vu}\} $$Correctness: Because we process nodes in topological order, whenever we compute $d^*(u)$, every predecessor $v \in N^-(u)$ has already been fully processed. Therefore, the optimal shortest path distance $d^*(v)$ is already finalized and mathematically correct. Furthermore, because the graph is acyclic, it is impossible to have any negative weight cycles. Thus, the shortest path is always well-defined, and the optimal substructure property strictly holds regardless of the sign of the edge weights.
Time Complexity: The topological sort takes $O(n + m)$ time. During the dynamic programming phase, we visit each of the $n$ vertices exactly once. When visiting vertex $u$, we examine each of its incoming edges exactly once. Thus, every edge in $E$ is evaluated exactly once across the entire execution of the algorithm. The time taken for this phase is proportional to $\sum_{u \in V} (1 + |N^-(u)|) = n + m$, which is strictly $O(n + m)$.
Combining the two phases, the total running time is $O(n + m) + O(n + m) = O(n + m)$, which is linear with respect to the size of the network. $\quad \blacksquare$
Corollary. In acyclic networks, the longest path can be computed in linear time.
Proof. Let $G = (V, E)$ be a directed acyclic graph with edge weights $c_{vu}$. We wish to find the longest simple path from a source $s$ to a destination $t$.
We can prove this by reduction. We construct a transformed graph $G' = (V, E)$ which is identical to $G$, except we mathematically negate the weight of every single edge. Let the new weights be $c'_{vu} = -c_{vu}$. Since the topological structure of the graph is unchanged, $G'$ remains a directed acyclic graph.
Finding the maximum-weight path in $G$ is mathematically equivalent to finding the minimum-weight path in $G'$:
$$ \max \sum c_{vu} \iff \min \sum -c_{vu} $$Because $G'$ is an acyclic network, we can apply the exact shortest path dynamic programming algorithm described in the theorem above to $G'$.
Crucially, as proven in the theorem, the shortest path algorithm for DAGs handles negative edge weights correctly because there are no cycles to trap the algorithm in an infinite negative loop. The algorithm will compute the shortest path in $G'$ in $O(n + m)$ linear time. Once the shortest path in $G'$ is found, we simply negate the final total distance to obtain the longest path distance in the original graph $G$.
Alternative Direct Proof: We can compute the longest path directly without transforming the graph by simply substituting the $\min$ function with a $\max$ function in our dynamic programming recurrence:
$$ l^*(u) = \max_{v \in N^-(u)} \{l^*(v) + c_{vu}\} $$By initializing $l^*(s) = 0$ and $l^*(u) = -\infty$ for all $u \neq s$, and evaluating the nodes in topological order, we explicitly compute the longest path. Evaluating this modified recurrence takes $O(n + m)$ time for the exact same topological and arithmetic reasons outlined in the theorem. Therefore, the longest path in an acyclic network can be computed in linear time. $\quad \blacksquare$
The proof of the theorem helps to explain why the Shortest Path is so fast (and handles negative numbers) Imagine a system of one-way mountain trails where water only flows downhill. It is physically impossible to walk in a circle and end up back where you started (this is what "acyclic" means). Because you can never loop back, you can safely arrange all the checkpoints into a straight, sequential checklist from top to bottom (this is the "Topological Sort"). To find the quickest route to the bottom, you just walk down your checklist. By the time you evaluate a specific checkpoint, you have already calculated the absolute best routes for every single trail leading into it from above. You just look at the incoming trails, pick the fastest one, lock it in, and move to the next checkpoint on the list. Because you never have to second-guess or hike back up the mountain, you only look at each trail exactly once.
Furthermore, if a trail has a "negative weight" (think of it as a magical slide that actually speeds up your total time), it doesn't break the math. In a normal network, a negative trail could trap an algorithm in an infinite loop of going down the slide, climbing back up, and going down again to rack up infinite speed. But on our strict downward-only mountain, you can only use that slide once. The algorithm just registers it as a great shortcut and moves on safely.
The proof of the corollary explains why the Longest Path goes from "Impossible" to "Easy". In a normal city where streets connect in all directions, asking a computer to find the longest path from A to B without crossing its own tracks is a famous mathematical nightmare (an "NP-Hard" problem). The computer has to guess and check millions of twisting, winding routes to make sure it didn't miss a longer detour, because the possibility of looping around blocks makes the options explode.
But remember, our network is a strict one-way, downhill mountain. You are physically forced to reach the bottom eventually; you cannot artificially inflate your mileage by walking in circles.
Since the threat of infinite loops is completely erased by the terrain, finding the longest path is literally the exact same effortless process as finding the shortest path. You just flip a single word in your instructions: instead of picking the "quickest" incoming trail at each checkpoint, you deliberately pick the "longest" one. The computer still just walks down the exact same checklist from top to bottom, checking each trail only once. The problem goes from taking millions of years to a fraction of a second.
Together, these proofs have great meaning in the broader context of algorithm design. In general graphs, finding a shortest path when negative edge weights are present is computationally expensive. Dijkstra's Algorithm fails completely, and the Bellman-Ford algorithm takes $O(nm)$ time. The first proof demonstrates that by simply recognizing a graph is acyclic (a DAG), we can exploit topological sorting to bypass those limitations and achieve the mathematically optimal linear time of $O(n + m)$. Furthermore, finding the longest simple path in a general graph is a famously NP-Hard problem (as it encompasses the Hamiltonian Path problem). However, the corollary proof reveals a massive structural loophole: if a network has no cycles, the problem instantly drops from NP-Hard down to linear time $O(n + m)$. The absence of cycles guarantees that the algorithm will never get trapped in an infinite loop of accumulating positive weights, allowing the dynamic programming optimal substructure to safely and efficiently compute the exact longest path.
As we saw in the previous section on the Shortest Path problem, we use a dynamic programming approach to find the shortest path from an origin node $s$ to all other nodes in a network $G=(V, E)$ (where $V$, the set of vertices, is the same as $N$, the set of nodes, and symmetrical logic applies to $A$ and $E$). While our previous approach required the graph to be a Directed Acyclic Graph (DAG) to establish a topological ordering, Dijkstra's Algorithm relaxes the acyclic constraint but requires all edge weights to be non-negative.
Before we proceed, let us first define the assumptions of our problem. First, we assume integral, non-negative data (cost coefficients $c_{ij} \ge 0$) and that there is a directed path from source node $s$ to all other nodes. Our objective is to fine the shortest path from node $s$ to each other node (i.e., a single-source shortest path). As a matter of fact, solutions to this problem, and indeed its construction, finds a great many applications in vehicle routing, communication systems, and etc.
Just as before, we divide the nodes into two sets: $S$, the set of permanently labeled nodes where $d(j) = d^*(j)$ (the true shortest distance), and $T = V - S$, the set of temporarily labeled nodes where $d(j) \ge d^*(j)$.
Recall too that the recurrence relation for a node $u \in T$ is defined as: $$d(u) = \min_{v \in N^-(u) \cap S} \{d^*(v) + c(v,u)\}$$
Lemma. Consider a network $G=(V,E)$ with a source node $s$ and a sink node $t$. Suppose $(S,T)$ is a partition of $V$ with $s \in S$. If arc-weights are nonnegative, then:
$$d(u) = \min_{w \in T} d(w) \implies d^*(u) = d(u).$$ Visualised, see below.Proof of Lemma. For contradiction, suppose $d(u) = \min_{v \in T} d(v) > d^*(u)$.
Then there exists a path $p$ from $s$ to $u$ such that $length(p) = d^*(u) < d(u)$.
Let $w$ be the first node in $T$ on path $p$. Then $d(w) \le length(p(s,w))$, where $p(s,w)$ is the piece of path $p$ from $s$ to $w$.
Since all arc-weights are nonnegative, adding more edges cannot decrease the path length. Therefore:
$$length(p) \ge length(p(s,w)) \ge d(w) \ge d(u) > d^*(u) = length(p)$$
This gives $length(p) > length(p)$, which is a contradiction. Thus, $d^*(u) = d(u)$. $\quad \blacksquare$
A key step in shortest path algorithms is the update procedure UPDATE. In this lecture, and in subsequent ones, we let $d(\cdot)$ denote a vector of temporary distance labels. $d(i)$ is the length of some path from the origin node 1 to node $i$. This is used in Dijkstra's algorithm and in general label-correcting algorithms.
UPDATE(i)
1: for each (i, j) ∈ A(i) do
2: if d(j) > d(i) + cij then
3: d(j) := d(i) + cij
4: pred(j) := i
Suppose $d(7)=6$ at some point in the algorithm, because of the path 1-8-2-7. Node 7 is incident to nodes 9, 5, and 3, with temporary distance labels as shown below. We now perform UPDATE(7).
Note: distance labels cannot increase in an update step. They can only decrease. We do not need to perform UPDATE(7) again unless $d(7)$ decreases. Updating sooner could not lead to further decreases in distance labels. In general, if we perform UPDATE(j), we do not do so again unless $d(j)$ has decreased.
These observations regarding the monotonic decrease of distance labels provide the foundational logic for Dijkstra's algorithm. By recognizing that a node's distance label only needs to be propagated to its neighbors when its own value improves, we can systematically finalize the shortest paths in a specific order.
Dijkstra's algorithm will determine $d^*(j)$ for each $j$, in order of increasing distance from the origin node 1. $S$ denotes the set of permanently labeled nodes. That is, $d(j) = d^*(j)$ for $j \in S$. $T$ denotes the set of temporarily labeled nodes. I.e., $d(j) \ge d^*(j)$ for $j \in T$.
DIJKSTRA(V, E, s)
1: S := {1}
2: T := V - {1}
3: d(1) := 0 and pred(1) := 0
4: for j = 2 to n do d(j) := ∞
5: Update(1)
6: while S ≠ V do
7: begin (node selection, also called FINDMIN)
8: let i ∈ T be a node for which d(i) = min{d(j) : j ∈ T}
9: S := S ∪ {i}
10: T := T - {i}
11: Update(i)
12: end
13: return d, pred
Because Dijkstra’s algorithm always chooses the "closest" vertex in $T = V - S$ to add to set $S$, it employs a greedy strategy. While greedy strategies do not always yield optimal results for every problem, Dijkstra's algorithm successfully computes exact shortest paths. A standard method for proving this algorithmic correctness involves establishing loop invariants.
To prove correctness, we maintain the following invariants at the start of each iteration of the while loop:
Because set $S$ increases by exactly one node at a time, at the end of the algorithm when $S = V$, Invariant 1 guarantees that all nodes have been assigned their mathematically optimal shortest paths.
Combined Proof of all Listed Variants. Initialization: When $S = \{s\}$ initially, and after the first UPDATE(s), all invariants trivially hold. The source is 0, and all neighbors are updated with their direct edge costs.
Maintenance (Proof by Contradiction): We wish to show that in each iteration, the vertex $u$ added to $S$ has $d(u) = \delta(s, u)$. For the purpose of contradiction, let $u$ be the first vertex for which $d(u) \neq \delta(s, u)$ when it is added to $S$.
Because there is a path from $s$ to $u$, a true shortest path $p$ exists. Prior to adding $u$ to $S$, path $p$ connects a vertex in $S$ (specifically $s$) to a vertex in $T$ ($u$). Let us trace path $p$ and find the first vertex $y$ along $p$ that is in $T$. Let $x \in S$ be $y$'s immediate predecessor. We can decompose the shortest path into:
$$ s \xrightarrow{p_1} x \rightarrow y \xrightarrow{p_2} u $$Because $x \in S$, and $u$ is the first vertex where our algorithm failed, we know $x$ was processed correctly: $d(x) = \delta(s, x)$. When $x$ was added to $S$, the edge $(x, y)$ was relaxed. By Invariant 3 (restricted paths), the label $d(y)$ correctly captured this distance. Therefore, $d(y) = \delta(s, y)$.
Now, because $y$ appears before $u$ on the true shortest path $p$, and all edge weights are non-negative (specifically those on subpath $p_2$), the distance to $y$ must be less than or equal to the total distance to $u$:
$$ d(y) = \delta(s, y) \le \delta(s, u) \le d(u) $$However, both $u$ and $y$ were in $T$ when the algorithm explicitly chose $u$ as the minimum element. Our greedy choice implies $d(u) \le d(y)$. For both $d(y) \le d(u)$ and $d(u) \le d(y)$ to be true simultaneously, they must be strictly equal:
$$ d(y) = \delta(s, y) = \delta(s, u) = d(u) $$Consequently, $d(u) = \delta(s, u)$, which directly contradicts our assumption that $u$ had the wrong distance. The algorithm is perfectly correct. $\quad \blacksquare$
The running time of Dijkstra's algorithm depends heavily on how the set $T$ (the unvisited nodes) is implemented as a min-priority queue.
EXTRACT-MIN takes $O(\log V)$. Every successful edge relaxation requires a DECREASE-KEY, which also takes $O(\log V)$. For sparse graphs, this drastically outperforms the naive array. This implementation is due to J. W. J. Williams (1964).DECREASE-KEY to run in amortized $O(1)$ time. Thus, the $E$ edge relaxations take $O(E)$ time, leaving the $V$ extractions to take $O(V \log V)$. This provides the best asymptotic runtime for general weighted graphs.As summarized previously, the choice of the data structure used to maintain the set of temporarily labeled nodes $T$ (the priority queue) heavily dictates the time complexity of Dijkstra's algorithm. While a simple array implementation yields $O(n^2)$ and Dial's bucket implementation provides a pseudo-polynomial $O(m + nC)$, advanced data structures reduce the time significantly. Here, we detail and formally analyze the implementations mentioned.
A standard binary heap organizes the nodes such that the node with the minimum distance label can be extracted efficiently. When a distance label is updated, its position in the heap is adjusted upward.
DIJKSTRA-BINARY-HEAP(V, E, s)
1: for each u ∈ V do
2: d(u) = ∞
3: pred(u) = NULL
4: d(s) = 0
5:
6: Let Q be a Min-Priority Queue (Binary Heap) containing all V, keyed by d
7:
8: while Q is not empty do
9: u = EXTRACT-MIN(Q)
10:
11: for each edge (u, v) ∈ E do
12: if d(v) > d(u) + cuv then
13: d(v) = d(u) + cuv
14: pred(v) = u
15: DECREASE-KEY(Q, v, d(v))
16:
17: return d, pred
Time Complexity: $O((m + n) \log n)$
Proof. Let $n = |V|$ and $m = |E|$. Building the initial binary heap takes $O(n)$ time. The while loop executes $n$ times. Each EXTRACT-MIN operation takes $O(\log n)$ time to maintain the heap property, taking $O(n \log n)$ globally.
Inside the loop, the algorithm evaluates each of the $m$ edges exactly once across the entire execution. If an edge relaxation is successful, the DECREASE-KEY operation is called. In a binary heap, decreasing a key requires bubbling the element up the tree, which takes $O(\log n)$ time. In the worst-case scenario, every single edge evaluation results in a successful relaxation, invoking DECREASE-KEY $m$ times, taking $O(m \log n)$ time globally.
Combining these bounds, the total time complexity is $O(n \log n + m \log n) = O((m + n) \log n)$. $\quad \blacksquare$
Space Complexity: $O(n)$
Proof. The priority queue $Q$ stores exactly $n$ nodes. The arrays for $d()$ and $pred()$ also require $O(n)$ memory cells. Thus, the auxiliary space complexity is strictly $O(n)$. $\quad \blacksquare$
A Fibonacci heap is a collection of min-heap-ordered trees. It is theoretically superior to a standard binary heap because it lazily defers structural consolidation until an EXTRACT-MIN is called. This lazy property allows DECREASE-KEY operations to run in constant amortized time, which drastically accelerates Dijkstra's algorithm on dense graphs.
DIJKSTRA-FIBONACCI-HEAP(V, E, s)
1: for each u ∈ V do
2: d(u) = ∞
3: pred(u) = NULL
4: d(s) = 0
5:
6: Let F be a new empty Fibonacci Heap
7: for each u ∈ V do
8: INSERT(F, u, d(u))
9:
10: while F is not empty do
11: u = EXTRACT-MIN(F)
12:
13: for each edge (u, v) ∈ E do
14: if d(v) > d(u) + cuv then
15: d(v) = d(u) + cuv
16: pred(v) = u
17: FIB-DECREASE-KEY(F, v, d(v))
Time Complexity: $O(m + n \log n)$
Proof. As before, there are exactly $n$ EXTRACT-MIN operations and at most $m$ FIB-DECREASE-KEY operations.
EXTRACT-MIN triggers the consolidation of the tree root list, which takes an amortized time of $O(\log n)$. For $n$ extractions, this totals $O(n \log n)$ time.FIB-DECREASE-KEY operation simply cuts the node from its parent and adds it to the root list. This takes an amortized time of strictly $O(1)$. For up to $m$ edge relaxations, this phase totals $O(m)$ time.Adding these together yields an overall time complexity of $O(m + n \log n)$. This represents a significant asymptotic improvement over the binary heap for dense graphs where $m$ heavily outweighs $n$. $\quad \blacksquare$
A Radix Heap is a sophisticated extension of Dial's Algorithm. While Dial's algorithm uses a simplistic array of $O(nC)$ buckets (where $C$ is the maximum edge weight), a Radix heap groups ranges of distances into a small logarithmic number of buckets. It exploits the property that edge weights are non-negative and the minimum label extracted from the heap is strictly monotonically increasing.
A Radix heap consists of an array of $K = \lfloor \log_2(nC) \rfloor + 2$ buckets. The size of the buckets grows exponentially. Specifically, Bucket $i$ holds nodes whose temporary distances fall into the range size of $2^{i-1}$ (for $i \ge 1$), while Bucket 0 holds exactly one distance value (the current minimum).
DIJKSTRA-RADIX-HEAP(V, E, s, C)
1: Initialize K = floor(log2(n * C)) + 2 empty buckets B[0..K-1]
2: for each u ∈ V do d(u) = ∞
3: d(s) = 0
4: Place s into B[0]
5:
6: while there are unvisited nodes do
7: if B[0] is empty then
8: Find the lowest index i > 0 where B[i] is not empty
9: Find the minimum distance element u in B[i]
10: Update the lower bounds of buckets 0 to i-1 based on d(u)
11: Redistribute all elements from B[i] into the newly sized lower buckets B[0..i-1]
12:
13: u = Remove any node from B[0]
14:
15: for each edge (u, v) ∈ E do
16: if d(v) > d(u) + cuv then
17: d(v) = d(u) + cuv
18: Remove v from its current bucket
19: Insert v into the appropriate bucket B[k] based on its new distance
Time Complexity: $O(m + n \log C)$
Proof. Let $C$ be the maximum edge weight. The algorithm utilizes $O(\log(nC))$ distinct buckets, which is approximately $O(\log n + \log C)$. In many literature variations focusing strictly on $C$, it is denoted as $O(\log C)$.
When an element's distance is relaxed via DECREASE-KEY, it can only move to a bucket with a lower index. Because there are at most $O(\log C)$ buckets, any single node $v$ can be shifted leftwards at most $O(\log C)$ times over the entire execution.
Summing the edge relaxations and the amortized cost of moving nodes between the logarithmically scaled buckets yields a total running time of $O(m + n \log C)$. $\quad \blacksquare$
Understanding the boundaries of Dijkstra's algorithm allows us to adapt it to unique scenarios.
If we introduce negative weight edges, the algorithm produces incorrect answers. Why doesn't the proof above hold? The flaw occurs at the inequality $\delta(s, y) \le \delta(s, u)$. If the subpath $p_2$ contains a massive negative weight, the true distance to $u$ could actually be less than the distance to $y$. Dijkstra assumes that adding edges can never make a path shorter (the monotonic property), locking $u$'s distance permanently before discovering the negative shortcut. Suppose we have a graph where edges leaving the source $s$ are negative, but all other edges in the graph are non-negative (and there are no negative cycles). Dijkstra's algorithm will still work perfectly. Because $s$ is processed first, its outgoing negative edges are immediately relaxed. Since all subsequent edges encountered are non-negative, the monotonic property holds for the remainder of the graph's execution. Consider changing thewhile loop from while |Q| > 0 to while |Q| > 1. This stops the algorithm when only one vertex remains in the unvisited queue. This is completely correct and safe. The final vertex remaining has the maximum distance in the entire graph. All of its incoming edges from $S$ have already been evaluated. Relaxing its outgoing edges cannot possibly improve the distance of any node currently in $S$ (due to non-negative weights), making the final extraction redundant.
If given an output array of distances $d$ and predecessors $pred$, we can verify if it represents a valid shortest-path tree in $O(V + E)$ time without re-running Dijkstra. We simply iterate through all edges $(u, v) \in E$ and check the triangle inequality: $d(v) \le d(u) + c(u, v)$. Furthermore, for every vertex $v \neq s$, we must verify that $d(v) = d(pred(v)) + c(pred(v), v)$. If both conditions hold, the output is mathematically verified.
Suppose edge weights $r(u,v)$ represent the probability (from $0$ to $1$) that a communication channel will NOT fail, and we want to find the path that maximizes the total product of these probabilities. We can map this directly to Dijkstra's algorithm. Because maximizing $\prod r(u,v)$ is mathematically equivalent to minimizing $\sum -\log(r(u,v))$, we can simply assign a new weight $w(u,v) = -\log(r(u,v))$ to every edge. Since $r \le 1$, $-\log(r)$ is guaranteed to be non-negative, flawlessly satisfying Dijkstra's constraints.
Proposal. Dijkstra's Algorithm works on any network with non-negative arc-weights.
Wrong! Dijkstra's algorithm may not work in a network with negative arc-weights. Consider the graph below:
If we run Dijkstra's on this graph starting from $s$:
If the graph edge weights are small integers, we can significantly speed up the FindMin operation using buckets.
Let $C = 1 + \max(c_{ij} : (i,j) \in A)$. Then $nC$ is a strict upper bound on the minimum length path from node 1 to node $n$. Recall: When we select nodes for Dijkstra's Algorithm we select them in strictly increasing order of distance from node 1.
Let us start with a simple storage rule: create an array of buckets from $0$ to $nC$. Let $\text{BUCKET}(k) = \{i \in T : d(i) = k\}$. This optimization technique is known as Dial's Algorithm.
Whenever $d(j)$ is updated, update the buckets so that the simple bucket scheme remains accurate (move node $j$ to its new, lower bucket). The FindMin operation simply looks for the minimum non-empty bucket. To find the minimum non-empty bucket efficiently, start where you last left off, and iteratively scan buckets with higher indices. You never need to scan backwards because distances only increase.
Buckets Array
Let $C$ be the largest arc length (cost) in the graph.
Total running time: $O(m + nC)$. This can be heavily improved in practice.
We can optimize memory by creating buckets only when needed, and stop creating them once each node has been assigned. Let $d^* = \max d^*(j)$. The maximum bucket index ever used is at most $d^* + C$.
If $j$ is placed into $\text{Bucket}(d^* + C + 1)$ after UPDATE(i), then $d(j) = d(i) + c_{ij} \le d^* + C$, showing that indices never drift infinitely far ahead of the current max distance. Memory can actually be optimized down to a modulo circular array of size $C+1$.
The title of this section is deceptive. This section is dedicated to the study of a specific priority queue as applied to Djikstra's algorithm. That specific priority queue is the min-prioirty queue.
Before diving into priority queues, consider, again, a fundamental question: Is an array a data structure? At least now, Dr. Du provides an actual answer that can be reasoned with.
Strictly speaking, he says, no! An array is simply a method of data organization (a contiguous block of memory). A data structure is defined as a data organization strictly coupled with a defined set of operations. An array organization can be used to implement a wide variety of distinct data structures, including Stacks, Queues, Lists, Max-Heaps, Min-Heaps, and Priority Queues.
In the previous section, we explored Dial's Algorithm, which optimizes Dijkstra's using buckets, resulting in a running time of $O(m + nC)$. Is this a polynomial-time algorithm?
Answer: No. Rather, it is a pseudo-polynomial time algorithm. The running time depends on $C$ (the maximum edge weight). Because $C$ is a numeric value, its magnitude can be exponentially larger than the number of bits required to store it. True polynomial-time algorithms must run in polynomial time relative to the size of the input data (the number of nodes and edges), not the numeric magnitude of the edge weights.
A priority queue is a data structure for maintaining a set of elements where each element has an associated priority. Elements with high priority are served before elements with low priority. Each element is associated with a value called a key.
In a min-priority queue, a smaller key value dictates a higher priority (i.e., if element $s$ has a higher priority than element $t$, then $s.key \le t.key$). A Min-Heap is the standard binary tree implementation used to power a min-priority queue.
To achieve the $O((m+n)\log n)$ time complexity for Dijkstra's algorithm that we analyzed in the previous section, the min-priority queue must efficiently support four core operations: MINIMUM(S), EXTRACT-MIN(S), DECREASE-KEY(S, x, k), and INSERT(S, x).
MINIMUM(S) returns the element of $S$ with the smallest key. Because the smallest element is always at the root of the min-heap, this takes $O(1)$ time.
MINIMUM(A)
1: return A[1]
EXTRACT-MIN(S) removes and returns the element of $S$ with the smallest key. It replaces the root with the last element in the heap, shrinks the effective array size, and then calls MIN-HEAPIFY to bubble the new root down to its proper place. This procecure takes $O(logn)$ due to the MIN-HEAPIFY operation maintaining the tree's height.
EXTRACT-MIN(A)
1: if heap-size[A] < 1 then
2: error "heap underflow"
3: min ← A[1]
4: A[1] ← A[heap-size[A]]
5: heap-size[A] ← heap-size[A] - 1
6: MIN-HEAPIFY(A, 1)
7: return min
DECREASE-KEY(S, x, k) decreases the value of element $x$'s key to a new value $k$ (which is assumed to be strictly smaller than $x$'s current key). If a node's key becomes smaller, its priority increases, meaning it may need to bubble up the tree. This takes $O(logn)$ as it traces a path straight up to the root.
DECREASE-KEY(A, i, key)
1: if key > A[i] then
2: error "new key is larger than current key"
3: A[i] ← key
4: while i > 1 and A[Parent(i)] > A[i] do
5: exchange A[i] with A[Parent(i)]
6: i ← Parent(i)
Note: In some literature or slides, you might see this erroneously labeled as Increase-Key even when the numeric value drops (e.g., from 9 down to 1). In a min-heap context, lowering the numeric value "increases" the priority, but the algorithmic function name strictly follows the numeric direction. Hence the name DECREASE-KEY.
INSERT(S, x) inserts a new element $x$ into set $S$. It accomplishes this by expanding the heap size, dropping a dummy "infinity" node at the end, and then utilizing Decrease-Key to bubble the actual value up into its correct position. This runs in $O(logn)$ time, bound by the DECREASE-KEY call.
HEAP-INSERT(A, key)
1: heap-size[A] ← heap-size[A] + 1
2: A[heap-size[A]] ← +∞
3: HEAP-DECREASE-KEY(A, heap-size[A], key)
As extensively detailed and proven in the previous sections, substituting these exact Min-Heap operations directly into Dijkstra's core logic yields strict algorithmic efficiencies.
The algorithm executes exactly $V$ Extract-Min operations (one for each node added to set $S$). The algorithm attempts up to $E$ Decrease-Key operations (when relaxing edges).
Applying the $O(\log n)$ time bounds defined in the pseudocode above to these steps establishes the $O((m+n)\log n)$ overall runtime for sparse networks. This mathematically confirms our summary from the previous section regarding the profound performance impact of utilizing standard binary min-priority queues.
Previously on CS 6363, we extensively analyzed the Single-Source Shortest Path problem using Dijkstra's algorithm and specialized priority queues to find the fastest routes from a single origin node to the rest of the network.
In case you missed it, Dijkstra's algorithm solves the Single-Source Shortest Path problem. It finds the absolute shortest path from one specific starting node (the source) to every other node in a network. A strict requirement for this algorithm is that it only works if all edge weights (costs/distances) in the graph are non-negative.
Before the algorithm takes its first step, it categorises the graph's nodes into two sets. Set $S$ (Permanently Labeled) is the set of nodes whose true shortest distance from the source has been finalized and locked in. Set $T$ (Temporarily Labeled) is the set of unvisited or partially evaluated nodes whose current distance is just an estimate (an upper bound). It also prepares a distance list, d(j), recording the shortest known distance from the source to node j. The distance to the source node is set to 0, and the distance to all other nodes is initialized to ∞ (infinity).
The algorithm runs in a continuous loop, progressively moving nodes from $T$ to $S$ until $T$ is empty. During each iteration, the first step is Node Selection (which we will call FindMin). The algorithm scans the set of temporary nodes $T$ and selects the node $u$ that currently has the minimum distance label $d(u)$. Because all edge weights are positive, there is no possible way a future path could somehow loop around and find a shorter route to this specific node. Its current temporary distance is guaranteed to be its true, final distance.
The second step is Transfer and Lock. The selected node $u$ is removed from $T$ and added to $S$, locking in its shortest path. The third step is the Update (Relaxation) of neighbours. Now that node $u$ is locked, the algorithm looks at all of $u$'s immediate neighbors that are still in the temporary set $T$. For each neighbour $v$, it asks if the distance to $v$ is shorter by traveling through the newly locked node $u$. It calculates the new potential distance: $d(u)$ plus the cost of the edge from $u$ to $v$. If this new distance is strictly less than $v$'s current distance label $d(v)$, it updates $v$'s label to this new, lower number. Note that distance labels can only decrease during this step, never increase.
For termination, the algorithm repeats this FindMin → Transfer → Update cycle until $T$ is completely empty (meaning $S$ contains all vertices). At this point, the distance list $d(j)$ contains the optimal shortest paths to every node in the graph, and the exact routes can be traced backwards using the recorded predecessors.
So, that is a small crash course of Djikstra's Algorithm. It is nice to be able to find the single-source shortest path. However, what if we need to find the shortest path from every node to every other node in a digraph $G=(V,E)$? This expanded problem is known as the All-Pairs Shortest Paths problem.
Before calculating shortest paths, let us look at a foundational concept: counting the exact number of paths between nodes. Given a digraph $G=(V,E)$ and a positive integer $k$, how do we count the number of paths with exactly $k$ edges from node $s$ to node $t$ for each pair of nodes $\{s,t\}$?
We start with the Adjacency Matrix $A(G)=(a_{ij})_{n \times n}$ where $V=\{1,2,\dots,n\}$. The entries are defined as $a_{ij}=1$ if $(i,j) \in E$, and $0$ otherwise.
Here is a visual representation of the path counting example. The Directed Graph $G$ illustrates the connections between nodes (including a self-loop on node 1), and the corresponding Adjacency Matrix $A(G)$ maps those exact edges into a mathematical format.
Directed Graph $G$
Adjacency Matrix $A(G)$
Theorem. In $A(G)^k$, each element $a_{ij}^{(k)}$ represents the exact number of paths with exactly $k$ edges from $i$ to $j$.
Proof (by induction on $k$).
We can adapt the logic of matrix multiplication to find the weight of the shortest path with at most $k$ edges from $i$ to $j$. Let $l_{ij}^{(k)}$ denote this minimum length. This is All-Pairs Shortest Pathswith at most $k$ edges. The recursive formula is:
$$ l_{ij}^{(k+1)} = \min_{1 \le h \le n} (l_{ih}^{(k)} + l_{hj}^{(1)}) $$Proof. We must prove this equality by demonstrating that both sides are less than or equal to each other (i.e., Part 1: $LHS \le RHS$, and Part 2: $LHS \ge RHS$).
Part 1: $l_{ij}^{(k+1)} \le \min_{1 \le h \le n} (l_{ih}^{(k)} + l_{hj}^{(1)})$
Consider any intermediate node $h$ in the graph (where $1 \le h \le n$). Suppose we take the shortest path from $i$ to $h$ using at most $k$ edges (which has length $l_{ih}^{(k)}$), and then we take the direct edge from $h$ to $j$ (which has length $l_{hj}^{(1)}$).
The concatenation of these two paths forms a valid path from $i$ to $j$ using at most $k+1$ edges. Its total length is $l_{ih}^{(k)} + l_{hj}^{(1)}$. Because $l_{ij}^{(k+1)}$ is defined as the absolute minimum length of all possible paths from $i$ to $j$ with at most $k+1$ edges, it must be less than or equal to this specific concatenated path going through $h$. Since this holds true for any choice of intermediate node $h$, it must be less than or equal to the minimum over all possible $h$:
$$ l_{ij}^{(k+1)} \le \min_{1 \le h \le n} \left( l_{ih}^{(k)} + l_{hj}^{(1)} \right) $$Part 2: $l_{ij}^{(k+1)} \ge \min_{1 \le h \le n} (l_{ih}^{(k)} + l_{hj}^{(1)})$
Let $p$ be the actual shortest path from $i$ to $j$ that uses at most $k+1$ edges. The total length of $p$ is strictly $l_{ij}^{(k+1)}$. We analyze this path in two exhaustive cases:
Case 1. The path $p$ contains at most $k$ edges.
Case 2. The path $p$ contains exactly $k+1$ edges.
We have shown both $LHS \le RHS$ (Part 1) and $LHS \ge RHS$ (Part 2). The equality is proven to be strictly true. $\quad \blacksquare$
A key observation is that since any valid shortest path in a graph without negative-weight cycles cannot contain a cycle, it will contain at most $n-1$ edges. Therefore, $l_{ij}^{(n-1)}$ will be the absolute shortest path from $i$ to $j$.
If we design a dynamic programming algorithm using this recursive formula, we must compute $l_{ij}^{(k)}$ for $1 \le i \le n$, $1 \le j \le n$, and $1 \le k \le n-1$. Because each $l_{ij}^{(k)}$ calculation requires evaluating $n$ intermediate nodes, computing the entire matrix layer takes $O(n^3)$, and doing this $n-1$ times results in an overall time complexity of $O(n^4)$.
We can implement this by writing a subroutine that computes $L^{(k)}$ given $L^{(k-1)}$ and the original weight matrix $W$ (which represents $L^{(1)}$). This subroutine is functionally identical to matrix multiplication, but it substitutes addition for multiplication and the min function for addition.
EXTEND-SHORTEST-PATHS(L, W)
1: n ← L.rows
2: Let L' be a new n × n matrix
3: for i ← 1 to n do
4: for j ← 1 to n do
5: L'ij ← ∞
6: for h ← 1 to n do
7: L'ij ← min(L'ij, Lih + Whj)
8: return L'
Notice that the three nested loops (lines 3, 4, and 6) iterate $n$ times each, making the time complexity of a single EXTEND-SHORTEST-PATHS execution strictly $O(n^3)$.
To find the final all-pairs shortest paths, we must compute $L^{(n-1)}$. We do this by calling our extension subroutine iteratively, starting from our base case $L^{(1)} = W$.
SLOW-ALL-PAIRS-SHORTEST-PATHS(W)
1: n ← W.rows
2: L(1) ← W
3: for m ← 2 to n - 1 do
4: L(m) ← EXTEND-SHORTEST-PATHS(L(m-1), W)
5: return L(n-1)
The outer loop in SLOW-ALL-PAIRS-SHORTEST-PATHS runs exactly $n - 2$ times. Because it calls the $O(n^3)$ subroutine on each iteration, the total overall time complexity evaluates to $O(n^4)$.
We can actually speed up this dynamic program. Our ultimate goal is only to compute $L^{(n-1)}$ (the final matrix), which means we can skip the intermediate $k$ values by doubling our path lengths at each step. We accomplish this using a "New Multiplication" operator (often referred to mathematically as Min-Plus matrix multiplication).
Our purpose is to compute $l_{ij}^{(n-1)}$ for all $i, j$. We may reach this purpose much faster by skipping some intermediate $k$ values. This idea can be realized by finding a different recursive formula that computes paths of length $2m$ by combining two paths of length $m$:
$$ l_{ij}^{(2m)} = \min_{1 \le h \le n} \left\{ l_{ih}^{(m)} + l_{hj}^{(m)} \right\} $$To perform this matrix-based initialization, we define the weighted adjacency matrix $L(G) = (l_{ij})_{n \times n}$ where $V = \{1, 2, \dots, n\}$ and the edge weights are formulated as:
$$ l_{ij} = \begin{cases} 0 & \text{if } i = j \\ c_{ij} & \text{if } (i, j) \in E \\ \infty & \text{otherwise} \end{cases} $$Directed Graph $G$
Weighted Matrix $L(G)$
Notice the striking algebraic similarity between standard matrix multiplication and our shortest path formulation:
$$ c_{ij} = \sum_{h=1}^{n} \left( a_{ih} \cdot b_{hj} \right) \quad \longleftrightarrow \quad l_{ij}^{(k+1)} = \min_{1 \le h \le n} \left( l_{ih}^{(k)} + l_{hj}^{(1)} \right) $$We can map the mathematical operations directly to create a new algebra framework:
We define the operation $\circ$ as:
$$ (a_{ih})_{m \times n} \circ (b_{hj})_{n \times p} = \left(\min_{1 \le h \le n} \{a_{ih} + b_{hj}\}\right)_{m \times p} $$This "Min-Plus" matrix multiplication obeys the associative law, meaning $A \circ (B \circ C) = (A \circ B) \circ C$. By treating our weighted adjacency matrix under this new algebra, we can formalize the following theorem.
Theorem. In the matrix $L(G)^k$ (computed using the $\circ$ operator), each element $l_{ij}^{(k)}$ is exactly the length of the shortest path with at most $k$ edges from node $i$ to node $j$.
Proof (by induction on $k$).
As defined at the start of this section, the All-Pairs Shortest Paths problem tasks us with finding the shortest path from $s$ to $t$ for all pairs of nodes $\{s,t\}$ in a digraph $G=(V,E)$. To definitively find these absolute shortest paths without arbitrarily capping the edge count, we rely on a fundamental structural property of networks.
Lemma. If a network has no negative-weight cycles, then for any two nodes $u$ and $v$ that are connected, there exists a simple shortest path between them.
Proof. A simple path is one that contains no cycles. If a valid shortest path contained a positive-weight cycle, we could simply remove the cycle to create a strictly shorter path, contradicting the premise that the original path was optimal. If the path contained a zero-weight cycle, we could remove it to create a simple path of the exact same minimum length. Therefore, as long as the network lacks negative-weight cycles (which would cause the path cost to plummet to $-\infty$), an optimal simple path must exist. $\quad \blacksquare$
Theorem. In the matrix $L(G)^{n-1}$, each element $l_{ij}^{(n-1)}$ is the length of the absolute shortest path from $i$ to $j$.
Proof. By the lemma above, if the graph lacks negative-weight cycles, the shortest path between any two nodes is a simple path. In a graph containing exactly $n$ vertices, a simple path can contain at most $n-1$ edges. Because the matrix $L(G)^{n-1}$ explicitly computes the shortest paths utilizing at most $n-1$ edges, it is mathematically guaranteed to encapsulate the true, absolute shortest path for every pair of nodes. $\quad \blacksquare$
Theorem. The Faster All-Pairs Shortest Paths algorithm (which relies on repeated squaring of the matrix) works correctly in any network without a negative-weight cycle.
Note: This is because the repeated squaring algorithm computes $L^{(m)}$ where $m \ge n-1$. Since paths cannot get any shorter by adding more edges in a graph without negative cycles, computing powers beyond $n-1$ safely yields the exact same optimal values as $L^{(n-1)}$. Of course, we can prove this too.
Proof. The Faster All-Pairs Shortest Paths algorithm computes the sequence of matrices $L^{(1)}, L^{(2)}, L^{(4)}, L^{(8)}, \dots, L^{(m)}$ by repeatedly squaring the matrix, where the final power $m = 2^{\lceil \log_2(n-1) \rceil}$. By definition of this sequence, the final computed power $m$ is guaranteed to be strictly greater than or equal to $n-1$.
As proven previously, if a network $G$ contains no negative-weight cycles, then the absolute shortest path between any two vertices must be a simple path. Because a simple path in a graph of $n$ vertices cannot visit any vertex more than once, it can contain at most $n-1$ edges. Therefore, the matrix $L^{(n-1)}$ correctly holds the absolute shortest path distances.
What happens when we compute $L^{(m)}$ where $m > n-1$? The matrix $L^{(m)}$ represents the shortest paths using at most $m$ edges. Suppose, for the sake of contradiction, that allowing up to $m$ edges yielded a path strictly shorter than the paths found using $n-1$ edges. This would mean there exists a path with $\ge n$ edges that is shorter than the optimal simple path.
However, by the Pigeonhole Principle, any path containing $n$ or more edges must visit at least one vertex twice, meaning it must contain a cycle. Because we are operating under the strict assumption that the network has no negative-weight cycles, the weight of this cycle must be greater than or equal to zero ($W_{cycle} \ge 0$).
If we take this longer path and simply remove the cycle, we create a new path with fewer edges. Since the removed cycle had a weight $\ge 0$, the total length of the path either decreases or remains exactly the same. Thus, it is mathematically impossible for a path with $m > n-1$ edges to be strictly shorter than the optimal simple path containing $\le n-1$ edges.
Consequently, for all $m \ge n-1$, the matrices are mathematically identical:
$$ L^{(m)} = L^{(n-1)} $$Therefore, the repeated squaring algorithm, which terminates and returns $L^{(m)}$, correctly yields the exact shortest paths without requiring the calculation to stop precisely at $n-1$. $\quad \blacksquare$
What if the network does contain a negative-weight cycle? We can actually use our matrix multiplication framework to detect it natively.
Theorem. A network $G$ contains a negative-weight cycle if and only if the matrix $L(G)^n$ contains a negative element along its main diagonal.
Proof.
Forward Direction ($\Rightarrow$): Suppose the graph contains a negative-weight cycle. This cycle must consist of at most $n$ edges. Let node $i$ be any vertex situated on this cycle. The length of the path from $i$, around the cycle, and back to $i$ will be less than zero. Because the diagonal element $l_{ii}^{(n)}$ represents the minimum cost to travel from node $i$ to node $i$ using at most $n$ edges, it will capture this cycle's cost. Therefore, $l_{ii}^{(n)}$ will be strictly negative.
Reverse Direction ($\Leftarrow$): Suppose $L(G)^n$ contains a negative diagonal element, meaning $l_{ii}^{(n)} < 0$ for some node $i$. This implies there exists a path starting at $i$ and ending at $i$ with a total weight less than zero. A path from a node to itself is, by definition, a cycle (or a collection of cycles). For the total weight to drop below zero, at least one of those constituent cycles must possess a negative total weight. $\quad \blacksquare$
FASTER-ALL-PAIRS-SHORTEST-PATHS(L(G))
1: n ← |V|
2: m ← 1
3: L(1) ← L(G)
4: while n - 1 > m do
5: L(2m) ← L(m) ∘ L(m)
6: m ← 2m
7: return L(m)
This algorithm is guaranteed to work in any network as long as it does not contain a negative-weight cycle. Furthermore, a graph $G$ contains a negative-weight cycle if and only if $L(G)^n$ contains a negative element along its diagonal.
We can optimize this problem even further down to exactly $O(n^3)$ using the Floyd-Warshall algorithm. Instead of restricting the number of edges in the path, Floyd-Warshall dynamically restricts the set of allowed intermediate nodes.
Let $d_{ij}^{(k)}$ denote the length of the shortest path from $i$ to $j$ such that all internal nodes along the path are strictly drawn from the set $\{1,2,\dots,k\}$. The recursive formulation is as follows:
$$ d_{ij}^{(k)} = \begin{cases} c_{ij} & \text{if } k = 0 \\ \min(d_{ij}^{(k-1)}, d_{ik}^{(k-1)} + d_{kj}^{(k-1)}) & \text{if } k \ge 1 \end{cases} $$We may design a dynamic programming algorithm with the recursive formula. This requires us to compute all states $d_{ij}^{(k)}$ for $1 \le i \le n$, $1 \le j \le n$, and $1 \le k \le n$. Because each $d_{ij}^{(k)}$ relies on simply comparing two previously computed values, it is computed in $O(1)$ time. Hence, it runs in $O(n^3)$ time.
FLOYD-WARSHALL(L(G))
1: n ← rows[L(G)]
2: D(0) ← L(G)
3: for k ← 1 to n do
4: for i ← 1 to n do
5: for j ← 1 to n do
6: d(k)ij ← min(d(k-1)ij, d(k-1)ik + d(k-1)kj)
7: return D(n) = (d(n)ij)
Like the repeated squaring method, the Floyd-Warshall algorithm works efficiently in any network lacking negative-weight cycles. Similarly, the graph contains a negative cycle if and only if the final matrix $D^{(n)}(G)$ possesses a negative diagonal element.
Theorem. The Floyd-Warshall algorithm works correctly in any network without a negative-weight cycle.
Proof. We prove the correctness by induction on $k$. Let $d_{ij}^{(k)}$ be the weight of a shortest path from vertex $i$ to vertex $j$ for which all intermediate vertices are in the set $\{1, 2, \dots, k\}$.
Theorem. A network $G$ contains a negative-weight cycle if and only if $D^{(n)}(G)$ contains a negative diagonal element.
Proof. We prove this biconditional statement in two parts:
Dr. Du gave us a special "treat" and gave us not one, not two, but three puzzles to play with after class.
Puzzle 1: Given a digraph $G=(V,E)$ and a positive integer $k$, count the number of paths with at most $k$ edges from $s$ to $t$ for each pair $\{s,t\}$ of nodes.
Puzzle 2: Given an undirected graph $G=(V,E)$ and a positive integer $k$, count the number of paths with at most $k$ edges from $s$ to $t$ for each pair $\{s,t\}$ of nodes.
Puzzle 3: Given a digraph $G=(V,E)$ without loops, and a positive integer $k$, count the number of paths with at most $k$ edges from $s$ to $t$ for each pair $\{s,t\}$ of nodes.
Approaching these puzzles, we already know that for an adjacency matrix $A$, the $ij$-th entry of $A^m$ gives the exact number of paths of length exactly $m$. To find the number of paths of length at most $k$, we need to evaluate the summation:
$$ \sum_{m=1}^{k} A^m = A^1 + A^2 + \dots + A^k $$While we could compute each power individually and add them together, there is a much faster method using Block Matrices. We construct a new $2n \times 2n$ matrix $M$ using our original $n \times n$ adjacency matrix $A$ and an $n \times n$ identity matrix $I$:
$$ M = \begin{pmatrix} A & A \\ 0 & I \end{pmatrix} $$Let's look at what happens when we square this matrix ($M^2$):
$$ M^2 = \begin{pmatrix} A & A \\ 0 & I \end{pmatrix} \begin{pmatrix} A & A \\ 0 & I \end{pmatrix} = \begin{pmatrix} A^2 & A^2 + A \\ 0 & I \end{pmatrix} $$If we raise this block matrix to the power of $k$, the top-right quadrant accumulates the exact summation we are looking for:
$$ M^k = \begin{pmatrix} A^k & \sum_{m=1}^{k} A^m \\ 0 & I \end{pmatrix} $$Therefore, by constructing $M$ and computing $M^k$ (which can be done efficiently in $O(n^3 \log k)$ time using repeated squaring), the top-right $n \times n$ block will precisely contain the number of paths of length at most $k$ between all pairs of nodes.
To mathematically prove why this block matrix operation is correct, we can look at its geometric equivalent. Constructing the matrix $M$ is identical to creating a new, augmented graph $G'$ from our original graph $G=(V, E)$.
Proof of Bijection. We want to prove that every path of length exactly $k$ from $u \in V$ to $v' \in V'$ in the augmented graph $G'$ corresponds to exactly one path of length $m \le k$ from $u$ to $v$ in the original graph $G$.
Because there are no edges returning from $V'$ to $V$, any path from $u$ to $v'$ must follow a strict sequence:
Because the self-loops in $V'$ act as a guaranteed "padding" mechanism without offering any alternative routing, every valid path of length $m \le k$ in $G$ maps perfectly 1-to-1 to a path of length exactly $k$ in $G'$. Thus, evaluating the $u \to v'$ entry of $M^k$ inherently counts all paths of length $\le k$ in the original graph. $\quad \blacksquare$
Applying the Solution to the Puzzles:
You could also approach this a different way. All three puzzles ask for the number of paths from $s$ to $t$ with at most $k$ edges. As established, $A^m$ counts paths of length exactly $m$. The sum of paths of length $1$ to $k$ is given by: $$ S = \sum_{m=1}^{k} A^m $$
Proof of Solution via Block Matrices. We can solve this summation for all three cases (Digraphs, Undirected Graphs, and Digraphs without loops) using a single matrix construction. Let $M$ be the following $2n \times 2n$ block matrix:
$$ M = \begin{pmatrix} A & A \\ 0 & I \end{pmatrix} $$Where $A$ is the adjacency matrix and $I$ is the identity matrix. When we raise $M$ to the power $k$, we get:
$$ M^k = \begin{pmatrix} A^k & \sum_{m=1}^{k} A^m \\ 0 & I \end{pmatrix} $$The top-right $n \times n$ block contains the exact count of all paths of length $1, 2, \dots, k$.
$\quad \blacksquare$
Applying this algebraic proof to the puzzles:
A problem that the greedy strategy works for obtaining optimal solutions usually exhibits two primary characteristics: self-reducibility (optimal substructure) and a certain exchange property. Unlike Dynamic Programming, which exhaustively evaluates all subproblems to find the global optimum, a greedy algorithm makes a locally optimal choice at each step with the hope that these local choices will lead to a globally optimal solution. The exchange property is the mathematical key to proving that this localized "greed" does not sacrifice the global optimum.
Suppose we have a set of $n$ proposed activities (or intervals) $S = \{a_1, a_2, \dots, a_n\}$ that wish to use a resource, such as a lecture hall, which can serve only one activity at a time. Each activity $a_i$ has a start time $s_i$ and a finish time $f_i$, where $0 \le s_i < f_i < \infty$. If selected, activity $a_i$ takes place during the half-open time interval $[s_i, f_i)$.
Two activities $a_i$ and $a_j$ are non-overlapping (or compatible) if their intervals do not intersect: $[s_i, f_i) \cap [s_j, f_j) = \emptyset$, which implies either $s_i \ge f_j$ or $s_j \ge f_i.$ The objective is to find a maximum-size subset of mutually compatible activities.
Let us say we sort the activities in monotonically increasing order of finish time, such that $f_1 \le f_2 \le \dots \le f_n$. The greedy choice dictates that we should always pick the activity that finishes first, $a_1$, because it leaves the maximum possible remaining time for other activities.
Proof of the Exchange Property. Let $A^*$ be a maximum-size mutually compatible subset of activities. Let $a_k$ be the activity in $A^*$ with the earliest finish time. If $a_k = a_1$, then our greedy choice $a_1$ is already in the optimal solution.
If $a_k \neq a_1$, we construct a new subset $A^{**} = (A^* \setminus \{a_k\}) \cup \{a_1\}$. Because $f_1 \le f_k$, activity $a_1$ finishes no later than $a_k$ does. Since $a_k$ was compatible with all other activities in $A^*$, replacing it with an activity that finishes even earlier cannot possibly introduce any conflicts. Therefore, $A^{**}$ is also a set of mutually compatible activities. Furthermore, $|A^{**}| = |A^*|$, meaning $A^{**}$ is also a maximum-size optimal solution. Thus, we can always exchange the first interval of a maximum solution with our greedy choice $[s_1, f_1)$ without losing optimality. $\quad \blacksquare$
Suppose $A^* = \{I_1^*, I_2^*, \dots, I_k^*\}$ is an optimal solution containing our greedy choice $I_1^*$ (which is $a_1$). Then the subset $A^* \setminus \{I_1^*\} = \{I_2^*, \dots, I_k^*\}$ must be an optimal solution for the subproblem consisting of all original activities that do not overlap with $I_1^*$.
Proof. If there existed a better solution to the subproblem containing more than $k-1$ activities, we could simply add $I_1^*$ back into that better solution to create a valid set of activities with size strictly greater than $k$. This contradicts the assumption that $A^*$ is a maximum-size solution. Therefore, the problem exhibits optimal substructure. $\quad \blacksquare$
Now that we have shown the exchange property and optimal substructure hold, we can solve this problem elegantly with a purely greedy approach, running in strictly $O(n \log n)$ time (dominated by the sorting step).
GREEDY-ACTIVITY-SELECTOR(s, f)
1: Sort the activities so that f1 ≤ f2 ≤ &dots; ≤ fn
2: S ← {a1}
3: k ← 1
4: for i ← 2 to n do
5: if si ≥ fk then
6: S ← S ∪ {ai}
7: k ← i
8: return S
Let us apply the GREEDY-ACTIVITY-SELECTOR algorithm to a concrete example. Suppose we are given an array of $n = 6$ activities, represented by their start and finish times $[s_i, f_i)$:
Activity Selection Timeline
Step 1: Sort by Finish Time
The greedy strategy requires us to evaluate the activities in monotonically increasing order of their finish times ($f_i$). Sorting our input array yields the following sequence:
| Activity | Start ($s_i$) | Finish ($f_i$) |
|---|---|---|
| $w_2$ | 1 | 2 |
| $w_3$ | 3 | 4 |
| $w_4$ | 0 | 6 |
| $w_5$ | 5 | 7 |
| $w_6$ | 8 | 9 |
| $w_1$ | 5 | 9 |
Step 2: Iterate and Apply the Greedy Choice
We initialize our optimal subset $S = \emptyset$ and track the finish time of the most recently added activity (let's call it $f_{current}$, initially set to $0$). We iterate through the sorted list and select any activity whose start time is strictly greater than or equal to $f_{current}$.
The algorithm terminates having constructed the subset $S = \{w_2, w_3, w_5, w_6\}$. By making the greedy, localized choice to leave the timeline as open as possible at every single step, we have successfully found the maximum possible cardinality of non-overlapping intervals, which in this case is $4$.
Huffman coding is an optimal data compression algorithm that relies heavily on greedy logic.
Given $n$ characters $a_1, a_2, \dots, a_n$ with specific frequencies of occurrence $f_1, f_2, \dots, f_n$, we want to find a set of variable-length binary codes $c_1, c_2, \dots, c_n$ for these characters to minimize the total cost (the size of the encoded file). We restrict ourselves to prefix codes: no code word can be a prefix of any other code word, which ensures unambiguous decoding.
The objective is to minimize the total cost function: $$ B(T) = \sum_{i=1}^{n} |c_i| \cdot f_i $$
Every prefix code can be represented by a binary tree. Each code corresponds to a path from the root down to a leaf. Taking a left branch represents a '0' bit, and taking a right branch represents a '1' bit. Therefore, each leaf is labeled with a character $a_i$. The length of the code word $|c_i|$ is exactly the depth of the leaf $d(a_i)$ in the tree.
Prefix Code Binary Tree Example
From the tree above, we derive the codes: $c_1 = 000$, $c_2 = 001$, $c_3 = 01$, $c_4 = 100$, $c_5 = 101$, $c_6 = 11$.
Lemma. (Full Binary Tree Requirement). An optimal prefix code binary tree must be full; that is, each internal node must have exactly two children.
Proof. Suppose an optimal tree contains an internal node $x$ that has only one child $y$. We could completely remove node $x$ from the tree, replacing it with $y$. This operation would reduce the depth of $y$ and all descendant leaves of $y$ by exactly 1. Because the depth represents the code length, the cost function $\sum d(a_i) \cdot f_i$ would strictly decrease. This contradicts the assumption that the original tree was optimal. Therefore, no internal node can have only one child in an optimal tree. $\quad \blacksquare$
The greedy logic of Huffman codes dictates that the characters with the lowest frequencies should be placed at the greatest depths of the tree. This is its exchange property.
Proof of Exchange Property. Suppose we have a tree where $f_i > f_j$, but their depths are inverted such that $d(a_i) > d(a_j)$. Let's calculate the change in cost if we exchange their positions in the tree.
The original cost contribution is: $(d(a_i) \cdot f_i + d(a_j) \cdot f_j)$.
The new cost contribution after swapping is: $(d(a_j) \cdot f_i + d(a_i) \cdot f_j)$.
The difference (Original - New) is:
$$ (d(a_i) \cdot f_i + d(a_j) \cdot f_j) - (d(a_j) \cdot f_i + d(a_i) \cdot f_j) $$
$$ = d(a_i)(f_i - f_j) - d(a_j)(f_i - f_j) $$
$$ = (d(a_i) - d(a_j))(f_i - f_j) $$
Because we assumed $d(a_i) > d(a_j)$ and $f_i > f_j$, both terms in the product are strictly positive, meaning the difference is strictly positive. This proves that swapping them strictly decreases the total cost of the tree. Consequently, the two characters with the absolute lowest frequencies must exist at the maximum depth of the tree, and we can safely make them sibling nodes. $\quad \blacksquare$
If we assign the weight of each internal node to be the sum of the weights of its two children, we can iteratively reduce the problem. If we remove two sibling leaves $x$ and $y$ and treat their parent $z$ as a new leaf character with frequency $f_z = f_x + f_y$, finding the optimal tree for this reduced set of leaves mathematically guarantees an optimal tree for the original set. This establishes the self-reducibility (i.e., optimality) of the problem of Huffman Codes.
We construct the tree bottom-up. We place all $n$ characters into a Min-Priority Queue $Q$ keyed by their frequencies $f_i$. We repeatedly extract the two lowest frequencies, merge them into a parent node, and insert the parent back into $Q$.
HUFFMAN(C)
1: n ← |C|
2: Q ← C
3: for i ← 1 to n - 1 do
4: allocate a new node z
5: z.left ← x ← EXTRACT-MIN(Q)
6: z.right ← y ← EXTRACT-MIN(Q)
7: z.freq ← x.freq + y.freq
8: INSERT(Q, z)
9: return EXTRACT-MIN(Q) // Return the root of the tree
Because each of the $n-1$ merge operations requires exactly two EXTRACT-MIN calls and one INSERT call on a priority queue of size at most $n$, the algorithm runs in $O(n \log n)$ time.
Huffman Tree Bottom-Up Construction
At each step, the two nodes with the lowest frequencies (or summed frequencies) are merged.
The construction of this specific Huffman tree perfectly illustrates the greedy strategy combined with the self-reducible optimality we discussed earlier. The algorithm utilizes a min-priority queue to repeatedly identify and merge the two nodes with the lowest frequencies.
Let's trace the exact steps using the given initial frequencies: $f_1=0.1, f_2=0.2, f_3=0.3, f_4=0.3, f_5=0.1$.
By strictly adhering to the greedy choice—always merging the two lowest available probabilities—the algorithm naturally pushes the most frequent characters (like $f_3$ and $f_4$) closer to the root, giving them shorter binary codes, while pushing the least frequent characters (like $f_1$ and $f_5$) deeper into the tree, giving them longer binary codes. This localized greedy decision guarantees the globally optimal prefix code for data compression.
In the standard activity-selection problem, our goal was to maximize the pure count of non-overlapping intervals. We accomplished this gracefully with a Greedy Algorithm based on the earliest finish time.
The Puzzle: Suppose each interval now has a specific, non-negative weight attached to it, and we want to find the non-overlapping subset that maximizes the total weight. Does the exchange property still hold? If not, can we find an efficient way to find an optimal solution?
Part 1: The Failure of the Exchange Property
The exchange property fundamentally breaks when weights are introduced. The greedy choice of picking the earliest finishing interval may lock us out of a massive weight that conflicts with it. Let's demonstrate this with a counterexample (based on the class lecture puzzle).
Suppose we have the following intervals:
If we sort by finish time, $I_5 = [0, 1)$ finishes first. The greedy strategy immediately selects it. Because $I_5$ is selected, $I_1$ and $I_2$ might be rejected if they conflict. A greedy chain might construct Solution 1: select $I_5=[0,1) \text{ (wt 1)}$ and $I_4=[1,4) \text{ (wt 5)}$, yielding a total weight of exactly $1 + 5 = 6$.
However, if we evaluate Solution 2, we can select $I_1=[0,2) \text{ (wt 2)}$ and $I_3=[2,4) \text{ (wt 4)}$. These intervals do not overlap, and their total weight is $2 + 4 = 6$.
Even more drastically, consider a simpler counterexample: $A=[0,3)$ with weight $1$, and $B=[2, 100)$ with weight $1000$. A purely greedy earliest-finish approach picks $A$, blocking $B$, leaving you with a weight of $1$ instead of $1000$. The greedy exchange property completely collapses.
Part 2: The Dynamic Programming Solution
Because the greedy choice fails, we must rely on Dynamic Programming to evaluate the substructures. We sort the intervals by finish time $f_1 \le f_2 \le \dots \le f_n$.
For every interval $j$, we define $p(j)$ as the largest index $i < j$ such that interval $i$ is compatible with interval $j$ (i.e., $f_i \le s_j$). If no such interval exists, $p(j) = 0$.
Let $OPT(j)$ denote the maximum weight of any mutually compatible subset drawn from the first $j$ intervals. For any interval $j$, the optimal solution either includes interval $j$, or it does not.
This yields the strict Bellman recurrence relation:
$$ OPT(j) = \max \Big( w_j + OPT(p(j)), \; OPT(j-1) \Big) $$
Assume we are given $n$ activities, where each activity $i$ is represented as an object with start, finish, and weight attributes.
WEIGHTED-ACTIVITY-SELECTION(A)
1: n ← A.length
2: Sort A in monotonically increasing order by finish time (f1 ≤ f2 ≤ &dots; ≤ fn)
3: // Precompute p[j] for each activity
4: let p[1..n] be a new array
5: for j ← 1 to n do
6: p[j] ← 0
7: // Use Binary Search to find the largest index i < j such that A[i].finish ≤ A[j].start
8: p[j] ← BINARY-SEARCH-COMPATIBLE(A, j)
9:
10: // Initialize DP table
11: let OPT[0..n] be a new array
12: OPT[0] ← 0
13: // Fill the DP table
14: for j ← 1 to n do
15: include_j ← A[j].weight + OPT[p[j]]
16: exclude_j ← OPT[j-1]
17: OPT[j] ← max(include_j, exclude_j)
18: return OPT[n]
If we also need to output the actual sequence of chosen activities (rather than just the maximum weight), we can backtrack through the OPT table.
FIND-SOLUTION(j, OPT, p, A)
1: if j == 0 then
2: return ∅
3: else if A[j].weight + OPT[p[j]] > OPT[j-1] then
4: // Activity j was included
5: return {A[j]} ∪ FIND-SOLUTION(p[j], OPT, p, A)
6: else
7: // Activity j was excluded
8: return FIND-SOLUTION(j-1, OPT, p, A)
Proof of Time Complexity. Sorting the $n$ intervals initially takes $O(n \log n)$ time. To calculate $p(j)$ for every interval, we do not need to scan linearly backward. Because the finish times are sorted, we can use binary search to find the latest valid $f_i \le s_j$ in strictly $O(\log n)$ time per interval, resulting in $O(n \log n)$ total to compute the $p$ array.
Finally, computing the $OPT$ table takes $O(1)$ time for each of the $n$ states, taking $O(n)$ overall. Adding these phases together yields a total running time of $O(n \log n)$. This elegantly and efficiently solves the Weighted Activity Selection puzzle. $\quad \blacksquare$
The following problems from CLRS provide practical applications of both Dynamic Programming and Greedy Algorithms. They seemed useful to include, so here they are.
Consider a modification of the rod-cutting problem in which, in addition to a price $p_i$ for each rod, each cut incurs a fixed cost of $c$. The revenue associated with a solution is now the sum of the prices of the pieces minus the costs of making the cuts. Give a dynamic-programming algorithm to solve this modified problem.
In the standard rod-cutting problem, the maximum revenue $r_n$ for a rod of length $n$ is defined by $r_n = \max_{1 \le i \le n} (p_i + r_{n-i})$. To account for the cost of cutting, we must subtract $c$ whenever a cut is actually made.
If we do not make any cuts (meaning we sell the rod of length $n$ whole), the revenue is simply $p_n$. If we make a cut at length $i$ (where $1 \le i \le n-1$), we receive the price $p_i$ for the cut piece, the optimal revenue $r_{n-i}$ for the remaining piece, and we must pay the cut cost $c$. Thus, the modified Bellman equation is:
$$ r_n = \max \left( p_n, \max_{1 \le i \le n-1} (p_i + r_{n-i} - c) \right) $$Here is the bottom-up dynamic programming implementation:
MODIFIED-CUT-ROD(p, n, c)
1: let r[0..n] be a new array
2: r[0] = 0
3: for j = 1 to n do
4: q = p[j] // Initialize with the cost of NO cuts
5: for i = 1 to j - 1 do
6: q = max(q, p[i] + r[j - i] - c) // Include cut cost c
7: r[j] = q
8: return r[n]
Because there are two nested loops iterating proportionally to $n$, this algorithm runs in $\Theta(n^2)$ time, identical to the standard rod-cutting algorithm.
Choose a recurrence relation denoted by $T(n)$. Design a recursive and a bottom-up dynamic programming algorithm for $T(n)$. The bottom-up DP solution should be faster than the recursive solution.
Let us define a recurrence relation that models a path-finding or combinatorics sequence: $$ T(n) = \begin{cases} 0 & \text{if } n = 0 \\ 1 & \text{if } n = 1 \\ T(n-1) + 2 \cdot T(n-2) & \text{if } n \ge 2 \end{cases} $$
The naive approach directly translates the mathematical recurrence into a recursive function. Due to the branching factor, this approach recomputes the same subproblems exponentially many times, resulting in a time complexity of $O(2^n)$.
RECURSIVE-T(n)
1: if n == 0 return 0
2: if n == 1 return 1
3: return RECURSIVE-T(n - 1) + 2 * RECURSIVE-T(n - 2)
We can optimize this significantly by building an array iteratively from $0$ to $n$. Each state relies only on the two previously computed states, resolving in $O(1)$ time per step. The total time complexity is strictly $O(n)$, making it exponentially faster than the recursive approach.
BOTTOM-UP-T(n)
1: if n == 0 return 0
2: let dp[0..n] be a new array
3: dp[0] = 0
4: dp[1] = 1
5: for i = 2 to n do
6: dp[i] = dp[i - 1] + 2 * dp[i - 2]
7: return dp[n]
Consider the antithetical variant of the matrix-chain multiplication problem where the goal is to parenthesize the sequence of matrices so as to maximize, rather than minimize, the number of scalar multiplications. Does this problem exhibit optimal substructure?
Yes, it does exhibit optimal substructure.
Proof. Suppose we have an optimally parenthesized sequence of matrices $A_1 A_2 \dots A_n$ that strictly maximizes the total number of scalar multiplications. Let the final matrix multiplication in this optimal sequence split the chain at some index $k$, such that we multiply the resulting matrices $(A_1 \dots A_k)$ and $(A_{k+1} \dots A_n)$.
For the global parenthesization to be maximal, the internal parenthesization of the subchain $A_1 \dots A_k$ must also independently maximize the number of scalar multiplications. We can prove this by contradiction: if there existed a different way to parenthesize $A_1 \dots A_k$ that yielded a strictly greater number of scalar multiplications, we could substitute that superior sub-configuration into our global chain. This substitution would increase the total global number of multiplications, directly contradicting our initial assumption that the original global parenthesization was maximal. Therefore, the maximal solution is inherently built from maximal sub-solutions. $\quad \blacksquare$
The modified recurrence relation is simply inverted using the $\max$ operator:
$$ m[i, j] = \max_{i \le k < j} \Big( m[i,k] + m[k+1,j] + p_{i-1}p_kp_j \Big) $$A palindrome is a nonempty string over some alphabet that reads the same forward and backward. Give an efficient algorithm to find the longest palindrome that is a subsequence of a given input string. What is the running time of your algorithm?
We can solve this problem in $O(n^2)$ time using dynamic programming. Let $L[i, j]$ represent the length of the longest palindromic subsequence within the substring $A[i \dots j]$.
LONGEST-PALINDROME-SUBSEQUENCE(A)
1: n ← A.length
2: let L[1..n][1..n] be a new 2D array initialized to 0
3: for i = 1 to n do
4: L[i][i] = 1 // Base case: length 1 substrings
5: for len = 2 to n do
6: for i = 1 to n - len + 1 do
7: j = i + len - 1
8: if A[i] == A[j] and len == 2 then
9: L[i][j] = 2
10: elseif A[i] == A[j] then
11: L[i][j] = 2 + L[i+1][j-1]
12: else
13: L[i][j] = max(L[i+1][j], L[i][j-1])
14: return L[1][n]
Running Time: The algorithm utilizes two nested loops. The outer loop controls the sliding length of the substring (running $n-1$ times), and the inner loop iterates through the valid starting indices for that length (running $O(n)$ times). Because the internal operations take $O(1)$ time, the overall time complexity is tightly bound at $\Theta(n^2)$.
Proof of Time Complexity. Let $n$ be the length of the input string $A$. We can mathematically rigorously prove the $\Theta(n^2)$ time complexity by evaluating the exact number of operations executed by the algorithm:
for loop iterates exactly $n$ times to set the diagonal elements. The assignment L[i][i] = 1 takes constant $O(1)$ time. Thus, this loop takes $\Theta(n)$ time overall.
len (representing the substring length) iterates from $2$ up to $n$.
For each value of len, the inner loop variable i (representing the starting index) iterates from $1$ up to $n - len + 1$.
max() function. Since none of these invoke further loops or recursive calls, the entire inner block resolves in strictly $O(1)$ constant time.
Adding the independent phases together yields the overall time complexity relation: $$ T(n) = \Theta(n^2) + \Theta(n) + \Theta(n^2) $$ By the rules of asymptotic notation, we drop the lower-order linear term and the constant coefficients, leaving a tight, definitive running time of $\Theta(n^2)$. $\quad \blacksquare$
Proof of Correctness (Optimal Substructure). We can prove the algorithm correctly finds the Longest Palindromic Subsequence (LPS) using strong induction on the substring length $len$. Let $L[i][j]$ denote the optimal LPS length for the substring $A[i \dots j]$.
Since the formula correctly calculates the maximum palindrome length for length $len$ assuming all smaller lengths are correct, and correctly establishes the base case, by the principle of mathematical induction, the final cell $L[1][n]$ is guaranteed to contain the absolute maximum palindromic subsequence length for the entire string $A$. $\quad \blacksquare$
Not just any greedy approach to the activity-selection problem produces a maximum-size set of mutually compatible activities. Give an example to show that the approach of selecting the activity of least duration does not work. Do the same for the approaches of always selecting the compatible activity that overlaps the fewest other remaining activities, and always selecting the compatible remaining activity with the earliest start time.
If we greedily select the activity with the shortest duration, we might select an activity that spans the exact boundary of two other, non-overlapping activities.
In the figure above, the greedy choice picks $A_2$ because it is the shortest. However, $A_2$ overlaps with both $A_1$ and $A_3$, forcing the algorithm to discard them. The greedy algorithm yields a size of 1, whereas the optimal solution is $\{A_1, A_3\}$ with a size of 2.
If we select the activity that overlaps with the fewest other remaining activities, we can be tricked by clustering dense "dummy" activities at the edges of the timeline, making the optimal choices look artificially bad.
In this scenario, the optimal path consists of the 4 blue blocks at the top. However, the outer blue blocks overlap with a massive stack of grey "dummy" intervals, inflating their overlap count. The green interval in the center only overlaps with two blocks (the two middle blue ones). Thus, the greedy algorithm picks the green block first. This immediately destroys the optimal configuration. The greedy algorithm yields a size of 3 (Green, and the two Red outer blocks), while the optimal is 4.
If we pick the activity with the earliest start time, we may pick an activity that runs for a massive duration, entirely blocking out numerous other valid, shorter activities that start slightly later.
The greedy algorithm picks the massive red block simply because it starts at $t=0$. This forces the algorithm to reject the two smaller blue blocks. The greedy solution size is 1, while the optimal solution is 2.
We are about to get into Spanning Tres. Before that, it is important to proof some theorems that are useful to know.
Theorem 1. For a tree $T$, there is one and only one path between every pair of vertices.
Proof. Since $T$ is a connected graph, there exists at least one path between every pair of vertices. Suppose, for the sake of contradiction, that there exist two distinct paths between a pair of vertices $u$ and $v$. The union of these two distinct paths must contain a cycle. However, by definition, a tree is an acyclic graph. This is a contradiction. Therefore, the path between any two vertices must be unique. $\blacksquare$
Theorem 2. If a graph $G$ has one and only one path between every pair of vertices, then $G$ is a tree.
Proof. Because there is a path between every pair of vertices, $G$ is by definition connected. A cycle in a graph implies that there are at least two distinct paths between some pair of vertices on that cycle. Since we are given that $G$ has exactly one path between every pair of vertices, it cannot contain any cycles. A connected graph with no cycles is a tree. $\blacksquare$
Theorem 3. A tree with $n$ vertices has exactly $n-1$ edges.
Proof. We proceed by mathematical induction on the number of vertices $n$.
Base Cases: If $n=1$, the number of edges is $0$ ($1-1=0$). If $n=2$, the number of edges is $1$ ($2-1=1$). The statement holds.
Inductive Step: Assume the statement is true for any tree with $k$ vertices, meaning it has $k-1$ edges. We must prove it holds for a tree $T$ with $n = k+1$ vertices.
Let $e = (u,v)$ be any edge in $T$. Because $T$ is a tree, $e$ is the unique path between $u$ and $v$. Deleting $e$ disconnects $T$ into exactly two disjoint subtrees, say $T_1$ and $T_2$. Let $n_1$ and $n_2$ be the number of vertices in $T_1$ and $T_2$, respectively, where $n_1 + n_2 = k+1$.
Since $n_1 \le k$ and $n_2 \le k$, our inductive hypothesis applies to both subtrees. Therefore, $T_1$ has $n_1 - 1$ edges and $T_2$ has $n_2 - 1$ edges.
The total number of edges in our original tree $T$ is the edges in $T_1$, plus the edges in $T_2$, plus the edge $e$ we removed:
$$(n_1 - 1) + (n_2 - 1) + 1 = n_1 + n_2 - 1 = (k+1) - 1$$
Thus, a tree with $n = k+1$ vertices has $n-1$ edges. By induction, the theorem holds for all $n \ge 1$. $\blacksquare$
Theorem 4. Any connected graph $G$ with $n$ vertices and $n-1$ edges is a tree.
Proof. The minimum number of edges required to keep a graph of $n$ vertices connected is $n-1$. Because $G$ has exactly $n-1$ edges, removing even a single edge will disconnect the graph. A graph where the removal of any edge causes disconnection cannot contain a cycle (since removing an edge from a cycle does not disconnect a graph). Since $G$ is connected and acyclic, it is a tree. $\blacksquare$
Theorem 5. A graph $G$ with $n$ vertices, $n-1$ edges, and no cycles is a connected graph (and therefore a tree).
Proof. Suppose, for contradiction, that $G$ is disconnected and consists of $k$ distinct connected components $C_1, C_2, \dots, C_k$ where $k > 1$. Because $G$ has no cycles, each component $C_i$ is a tree. Let $v_i$ be the number of vertices in component $C_i$. By Theorem 3, each component has $v_i - 1$ edges.
The total number of edges in $G$ is the sum of the edges in all components:
$$\sum_{i=1}^{k} (v_i - 1) = \left(\sum_{i=1}^{k} v_i\right) - k = n - k$$
We are given that $G$ has $n-1$ edges. Therefore, $n - k = n - 1$, which implies $k = 1$. This contradicts our assumption that $k > 1$. Thus, $G$ must be connected. $\blacksquare$
Theorem 6. A graph $G$ is a tree if and only if it is minimally connected.
Proof.
($\Rightarrow$): Let $G$ be a tree. By definition, there is a unique path between any two vertices. Removing any edge severs the only path between its endpoints, disconnecting the graph. Thus, $G$ is minimally connected.
($\Leftarrow$): Let $G$ be minimally connected. This means removing any edge disconnects the graph. If $G$ contained a cycle, we could remove an edge from that cycle without disconnecting the graph. Therefore, $G$ must be acyclic. A connected, acyclic graph is a tree. $\blacksquare$
Theorem 7. Every tree with $n \ge 2$ vertices has at least two pendant vertices (leaves).
Proof. Let $T$ be a tree with $n \ge 2$ vertices. By Theorem 3, $T$ has $n-1$ edges. The Handshaking Lemma states that the sum of the degrees of all vertices is twice the number of edges:
$$\sum_{i=1}^{n} \deg(v_i) = 2(n-1) = 2n - 2$$
Since $T$ is connected and $n \ge 2$, no vertex can have a degree of $0$. Thus, $\deg(v_i) \ge 1$ for all vertices.
Suppose, for contradiction, that $T$ has fewer than two pendant vertices (meaning $1$ or $0$ vertices have a degree of $1$). This means at least $n-1$ vertices have a degree of $2$ or more.
If we sum the degrees under this assumption, the sum must be at least:
$$1(1) + 2(n-1) = 2n - 1$$
However, we know the sum must be exactly $2n - 2$. Because $2n - 1 > 2n - 2$, we have reached a contradiction. Therefore, $T$ must have at least two vertices of degree 1. $\blacksquare$
Theorem 8. Every tree has either one or two centers.
Proof. A center of a tree is a vertex that minimizes the maximum distance to all other vertices. The maximum distance from any vertex $v$ always occurs at a pendant vertex (leaf).
If we delete all pendant vertices from a tree $T$ (where $n \ge 3$), the resulting subgraph $T'$ is still a connected tree. Removing the leaves symmetrically reduces the maximum distance from the central nodes to the extremities by $1$. Therefore, $T$ and $T'$ share the exact same centers.
We can iteratively repeat this process of deleting pendant vertices ($T \to T' \to T'' \dots$). Because the graph is finite, this process must eventually terminate. It will terminate either when a single vertex remains (meaning the original tree had 1 center) or when a single edge connecting two vertices remains (meaning the original tree had 2 centers). $\blacksquare$
Theorem 9. The maximum number of vertices at level $L$ in a binary tree is $2^L$, where $L \ge 0$.
Proof. We prove this by mathematical induction on the level $L$.
Base Case: At the root level ($L=0$), there is only $1$ vertex. The formula gives $2^0 = 1$. The base case holds.
Inductive Step: Assume the statement is true for level $L=k$, meaning the maximum number of vertices at level $k$ is $2^k$.
By the definition of a binary tree, every vertex at level $k$ can have at most $2$ children. These children make up level $k+1$. Therefore, the maximum number of vertices at level $k+1$ is twice the maximum number of vertices at level $k$:
$$2 \times 2^k = 2^{k+1}$$
By mathematical induction, the maximum number of vertices at any level $L$ is $2^L$. $\blacksquare$
Theorem 10 (Handshaking Lemma). Let $deg_G(v)$ denote the degree of vertex $v$ in a graph $G = (V, E)$. Then $$\sum deg_G(v) = 2|E|.$$ In other words, the sum of the values of $deg_G(v)$ taken over all $v \in V$ is equal to twice the number of edges.
Proof. Let $S$ denote the subset $V \times E$ consisting of those pairs $(v, e)$ for which $v$ belongs to $e$ where $v \in V$ and $e \in E$. For each $v \in V$, the 'row total' $r_v(S)$ is the number of edges containing $v$. So, it is equal to $deg_G(v).$ For each $e \in E$, the column total $c_e(S)$ is the number of vertices in $e$, which is 2 (i.e., two vertices are needed to form an edge). Hence, $$\sum deg_G(v) = 2 + 2 + \dots + 2 = 2|E|.\quad \blacksquare$$
The Handshaking Lemma gives a nice corollary, which is written and proved below.
Corollary. Every graph has an even number of vertices of odd degree.
Proof. The right hand side of the equation in Theorem 10 is even, so on the left hand side, we must have an even number of odd numbers. $\blacksquare$
Beyond simple shortest paths, another foundational problem in network design is the Minimum Cost Spanning Tree (MST) problem. This problem arises naturally in various physical and logical architectures. For instance, consider a communications company (like AT&T) that needs to build a network connecting $n$ different users. If the cost of making a direct link between user $i$ and user $j$ is $c_{ij}$, the goal is to find the minimum total cost of connecting all users together. A common assumption is that only the links possible are the ones directly joining two nodes.
A similar scenario occurs in electronic circuitry, where pins of different components must be made electrically equivalent by connecting them with wires, and the objective is to minimize the total wire length.
We can also view the Traveling Salesman Problem (TSP) through the lens of spanning trees. TSP asks for a minimum cost tour linking $n$ cities. A valid TSP tour can be formulated as a spanning tree (specifically a path) plus exactly one additional edge closing the loop, subject to the strict constraint that every node has a degree of exactly two.
Let $G = (N, A)$ be an undirected network where $N$ is the set of nodes and $A$ is the set of arcs (edges). Because it is undirected, arc $(i, j)$ is identical to arc $(j, i)$. We associate a cost $c_{ij}$ with each arc $(i,j) \in A$.
A spanning tree $T$ of $G$ is defined as a connected, acyclic subgraph that spans (includes) all the nodes in $N$. A fundamental property of graph theory is that any connected graph with $n$ nodes and exactly $n - 1$ arcs is guaranteed to be a spanning tree. The minimum cost spanning tree problem asks us to find a spanning tree $T^*$ such that the sum of its arc costs is minimized.
Proposition (Tree Characterization Theorem). If $G = (V, E)$ is a connected undirected graph with $n$ vertices and exactly $n-1$ edges, then $G$ is a spanning tree.
Proof. To prove that $G$ is a spanning tree, we must show that $G$ is connected, acyclic, and includes all vertices. We are already given that $G$ is connected and spans all $n$ vertices in $V$. Therefore, we only need to prove that $G$ contains no cycles.
We proceed by contradiction. Assume that $G$ contains at least one cycle.
If we remove a single edge from a cycle in $G$, the graph remains connected. We can continue this process, removing one edge at a time from any existing cycles, until no cycles remain. The resulting subgraph, which we will call $T = (V, E')$, is both connected and acyclic. By definition, $T$ is a tree.
A fundamental, inductively proven property of any tree with $n$ vertices is that it contains exactly $n-1$ edges. Therefore, our newly formed tree $T$ must have $|E'| = n-1$ edges.
However, we established that our original graph $G$ started with exactly $n-1$ edges. Since we assumed $G$ had a cycle, we had to remove at least one edge from $G$ to form $T$. This implies that the number of edges in $T$ must be strictly less than the number of edges in $G$, meaning $|E'| < n-1$.
We have reached a contradiction: $|E'|$ cannot simultaneously be equal to $n-1$ and strictly less than $n-1$. Therefore, our initial assumption must be false, meaning $G$ cannot contain any cycles. $G$ is connected, acyclic, and contains all $n$ vertices of the original graph, $G$ is a spanning tree. $\blacksquare$
In the figure below, we see that in the connected graph, a spanning tree of minimum cost was found in red.
Let $T^*$ be a spanning tree. We refer to the arcs that belong to $T^*$ as tree arcs, and the arcs that belong to the original graph $G$ but not to $T^*$ as nontree arcs. In the figure below, the blue lines are all non-tree arcs.
The entire theoretical foundation of MST algorithms rests on how trees interact with cycles and cuts.
Lemma (Fundamental Cycle). Let $T$ be any spanning tree. Let $a$ be any nontree arc (an arc not in $T$). Then adding $a$ to $T$ creates a exactly one unique cycle $C$. Deleting any arc of $C$ breaks the cycle and restores the graph to a valid spanning tree $T'$.
Proof. We can break this proof into two distinct parts: showing the creation of a unique cycle, and showing the restoration of the spanning tree. Let $G = (V, E)$ be the original graph, where $T$ is a spanning tree of $G$. First, by definition, a spanning tree $T$ is a connected, acyclic subgraph containing all vertices in $V$. Because $T$ is connected, there exists a path between any two vertices in $T$. Furthermore, because $T$ is acyclic, this path is strictly unique. Let the nontree arc be $a = (u, v)$. Since $T$ is a spanning tree, there already exists a unique path $P$ between node $u$ and node $v$ entirely within $T$. When we add the arc $a$ to $T$, we connect $u$ and $v$ directly, closing a loop with the existing path $P$. This combination $P \cup \{a\}$ forms a cycle $C$. Because the path $P$ in $T$ was unique, the resulting cycle $C$ is also unique.
To prove the restoration of the spanning tree, let the new graph be $T_{temp} = T \cup \{a\}$. $T_{temp}$ contains exactly one cycle ($C$). Let $e$ be any arc belonging to the cycle $C$. If we remove $e$, we form a new graph $T' = T_{temp} \setminus \{e\}$. Since $e$ was part of a cycle, removing it breaks that specific cycle. Because $T_{temp}$ only had one cycle, $T'$ is now acyclic. We must also ensure $T'$ remains connected. Suppose we want to travel between two arbitrary nodes $x$ and $y$. If the original path between $x$ and $y$ in $T$ did not use the removed edge $e$, that path still exists in $T'$. If the path did use $e$, we can route around the missing edge by taking the remaining path around the broken cycle $C \setminus \{e\}$. Thus, connectivity is preserved. A connected, acyclic subgraph spanning all vertices is a spanning tree. Therefore, $T'$ is a valid spanning tree. $\blacksquare$
Lemma (Fundamental Cut). If $(i, j)$ is a tree arc, removing it partitions the tree $T^* - (i,j)$ into exactly two disjoint components, with node sets $S$ and $N - S$. The set of all arcs in the original graph $G$ that connect a node in $S$ to a node in $N - S$ constitutes a cut (or cutset). For instance, suppose $(i,j) = (4, 5)$.
Proof. Let $T$ be a spanning tree of a graph $G = (N, A)$, where $N$ is the set of nodes and $A$ is the set of arcs. By definition, a spanning tree $T$ is minimally connected, meaning that for any edge $e$ in $T$, removing $e$ will disconnect the graph. Let $e = (i, j)$ be an arc in $T$. In the tree $T$, the edge $(i, j)$ is the strictly unique path between node $i$ and node $j$ (if there were another path, $T$ would contain a cycle). When we remove $(i, j)$ to form $T \setminus \{(i, j)\}$, there is no longer any path between $i$ and $j$, disconnecting the tree. Because $T$ was connected without any cycles, removing a single edge splits the vertices into exactly two connected components.
Let $S$ be the set of nodes reachable from $i$ in $T \setminus \{(i, j)\}$. Let $N \setminus S$ be the remaining nodes, which must be exactly the set of nodes reachable from $j$. These two sets are disjoint ($S \cap (N \setminus S) = \emptyset$) and their union contains all nodes in the graph ($S \cup (N \setminus S) = N$). In graph theory, a cut is formally defined as a partition of the vertices of a graph into two disjoint subsets. Since we have established that $S$ and $N \setminus S$ form a valid partition of the node set $N$, the set of all arcs in the original graph $G$ that have one endpoint in $S$ and the other endpoint in $N \setminus S$ exactly satisfies the definition of a cutset. Note that $(i, j)$ is the only arc from the spanning tree $T$ that belongs to this specific cutset; all other arcs in this cutset must be nontree arcs. $\blacksquare$
Proposition. A spanning tree $T$ is a minimum spanning tree if and only if for any tree edge $(u,v)$ there exists a cut such that $(u,v)$ is the shortest edge in the cut.
This is false! The flaw in the proposition lies in the phrase "there exists a cut." Just because an edge is the minimum-weight edge in some arbitrary cut doesn't guarantee that those edges can be combined to form a globally optimal spanning tree. The cut cannot simply be any cut, it must be deliberate: it must be a fundamental cut.
Proof. Consider a complete graph on $4$ vertices: $A, B, C,$ and $D$. Let the inner edges $(A,C)$ and $(B,D)$ have a weight of $1$. Let the outer edges $(A,B), (B,C), (C,D),$ and $(D,A)$ have a weight of $2$.
Assume a spanning tree $T$ formed by the simple path $A - B - C - D$. The edges of this tree are $(A,B), (B,C),$ and $(C,D)$. The total weight of $T$ is $2 + 2 + 2 = 6$.
Now, consider the cut that separates the vertices into two sets: $\{A, C\}$ and $\{B, D\}$. The edges crossing this cut are $(A,B), (A,D), (B,C),$ and $(C,D)$. The cheaper edges $(A,C)$ and $(B,D)$ connect vertices within the same sets, so they do not cross this cut.
Notice that every single edge crossing this cut has a weight of $2$. Because all crossing edges have the same minimum weight of $2$, the tree edges $(A,B), (B,C),$ and $(C,D)$ are all tied as the "shortest" edge in this cut. Therefore, for every edge in $T$, there exists a cut where it is the minimum-weight edge.
However, the true Minimum Spanning Tree for this graph uses the two weight $1$ edges and one weight $2$ edge (for example, $(A,C), (B,D),$ and $(A,B)$). That tree has a total weight of $1 + 1 + 2 = 4$. Because $6 > 4$, $T$ is strictly heavier than the optimal tree, meaning it is not an MST. $\blacksquare$
We must make clear that the problem of the MST has certain conditions for cut optimality. The Cut Optimality Condition is defined thusly. For every tree arc $(i,j) \in T^*$, its cost $c_{ij}$ must be less than or equal to the cost $c_{kl}$ of every non-tree arc $(k,l)$ contained in the cut formed by deleting $(i,j)$ from the tree. Mathematically, $$c_{ij} \leq c_{kl}.$$ That "cut formed by deleting $(i,j)$ from the tree" is exactly the Fundamental Cut because that tree edge must be the most efficient way to bridge those two specific components. Otherwise, it is not fundamental! Removing $(i,j)$ guarantees the creation of exactly two components.
Theorem (Cut Optimality). A spanning tree $T^*$ is a minimum spanning tree if and only if it satisfies the cut optimality conditions.
Proof.
($\Rightarrow$): Suppose $T^*$ is a minimum spanning tree, but the cut optimality condition is violated. This means there exists a tree arc $(i,j) \in T^*$ and some nontree arc $(k,l)$ in the cutset induced by deleting $(i,j)$, such that $c_{ij} > c_{kl}$. If we delete $(i,j)$, the tree splits into two components. Because $(k,l)$ crosses this exact same cut, adding $(k,l)$ reconnects the two components, creating a new valid spanning tree $T' = T^* - (i,j) + (k,l)$. The cost of this new tree is $Cost(T^*) - c_{ij} + c_{kl}$. Since $c_{ij} > c_{kl}$, the new tree $T'$ has a strictly lower cost than $T^*$, which contradicts the premise that $T^*$ was optimal. Thus, the condition must hold.
The Path Optimality Condition, also known as the Cycle Optimality Condition, is a counterpart to cut optimality. For every nontree arc $(k,l)$ of $G$, its cost $c_{kl}$ must be greater than or equal to the cost $c_{ij}$ of every tree arc $(i,j)$ contained in the unique path of $T^*$ connecting nodes $k$ and $l$. Mathematically, $$c_{ij} \leq c_{kl}.$$ While the Cut Optimality Condition focuses on why a tree edge is good enough to be in the MST, the Path Optimality Condition explains why a non-tree edge was too expensive.
Think about it. For a tree $T*$ to be a MST, the edge $e$ must be the maximum weight edge in the cycle. If there were any edge $f \in P$ such that $c_f > c_e$, then we can replace $f$ with $e$ because clearly $e$ is cheaper. That presents with an entirely new MST with a smaller total weight. This means $T*$ was not the minimum!
Theorem (Path Optimality). A spanning tree $T^*$ is a minimum spanning tree if and only if it satisfies the path optimality conditions.
Proof. Suppose the path optimality conditions are not satisfied. This means there is a nontree arc $(k,l)$ and a tree arc $(i,j)$ on the path between $k$ and $l$ such that $c_{ij} > c_{kl}$. If we add $(k,l)$ to the tree, it forms a cycle. If we then remove $(i,j)$ to break the cycle, we get a new spanning tree $T' = T^* - (i,j) + (k,l)$. Because $c_{ij} > c_{kl}$, this new tree has a strictly lower cost than $T^*$, contradicting the optimality of $T^*$. $\quad \blacksquare$
Let $(k,l)$ be an arc not in the tree $T^*$. Observe that a tree arc $(i,j)$ is on the fundamental cycle created by adding $(k,l)$ if and only if the nontree arc $(k,l)$ is in the fundamental cutset induced by deleting $(i,j)$. Because these two structural definitions are perfectly symmetric, the path optimality conditions are mathematically equivalent to the cut optimality conditions.
Before exploring the specific algorithms, we must define the properties that allow greedy strategies to succeed on the MST problem.
Exchange Property. For an edge $e$ with the absolute smallest weight in a graph $G$, and any minimum spanning tree $T$ that does not contain $e$, there must exist some edge $e'$ in $T$ such that $(T \setminus \{e'\}) \cup \{e\}$ is still a valid minimum spanning tree. This implies that the globally cheapest edge can always safely be included in an optimal solution.
Proof of Exchange Property. Let $T$ be a minimum spanning tree of $G$ that does not contain the globally minimum edge $e$.
$T$ is a spanning tree, so by the definition of spanning trees, it is fully connected. Ergo, adding the edge $e$ to $T$ must create exactly one fundamental cycle, let's call it $C$. Since $C$ is a cycle, it must contain at least one other edge $e'$ that currently belongs to $T$.
$e$ is defined as the edge with the absolute smallest weight in the entire graph $G$, which means we know mathematically that $w(e) \le w(e')$.
If we remove $e'$ from $T$ and add $e$, we break the cycle $C$ and restore the acyclic, connected property of a spanning tree. Let this new tree be $T^* = (T \setminus \{e'\}) \cup \{e\}$.
The total weight of $T^*$ is: $$ w(T^*) = w(T) - w(e') + w(e) $$
$w(e) \le w(e')$, so it logically follows that $w(T^*) \le w(T)$. However, because $T$ is already defined as a minimum spanning tree, no valid spanning tree can have a weight strictly less than $w(T)$. Therefore, $w(T^*) = w(T)$, proving that $T^*$ is also a valid minimum spanning tree. $\quad \blacksquare$
Self-Reducibility: Suppose $T$ is an MST of graph $G$, and $e$ is an edge known to be in $T$. Let $G'$ and $T'$ be the graph and tree obtained by "shrinking" (contracting) edge $e$ into a single point. Then $T'$ is guaranteed to be a minimum spanning tree of the contracted graph $G'$. This optimal substructure allows us to iteratively build the tree and reduce the problem size.
Proof of Self-Reducibility. Let $w(T)$ be the total weight of the MST $T$. Because $T'$ is obtained by contracting the edge $e$, the weight of the contracted tree is exactly the weight of the original tree minus the weight of $e$: $$ w(T') = w(T) - w(e) $$
We can prove this by contradiction. Suppose $T'$ is not a minimum spanning tree of $G'$. Then there must exist some other valid spanning tree $T''$ in $G'$ that is strictly cheaper, such that $w(T'') < w(T')$.
If we take this hypothetical better tree $T''$ and "un-shrink" the contracted node back into the original edge $e$, we construct a new valid spanning tree for the original graph $G$. Let's call it $T^* = T'' \cup \{e\}$.
The weight of this newly constructed tree $T^*$ would be: $$ w(T^*) = w(T'') + w(e) $$
Substituting our assumption that $w(T'') < w(T')$, we get: $$ w(T^*) < w(T') + w(e) $$
Now, since $w(T') + w(e) = w(T)$, this implies: $$ w(T^*) < w(T) $$
This directly contradicts our initial, foundational premise that $T$ is a minimum spanning tree of $G$. Since $w(T^*)$ cannot possibly be less than the minimum weight $w(T)$, our assumption must be false. Therefore, $T'$ must indeed be an optimal minimum spanning tree of $G'$. $\quad \blacksquare$
Soon, we will examine two algorithms based on something called the Cut Property
Kruskal's algorithm is a classic example of The Greedy Algorithm in action, building a minimum spanning tree (MST) by making locally optimal choices. The visual animation of the algorithm demonstrates how it iteratively evaluates edges from the lowest weight to the highest.
KRUSKAL(V, E, w)
1: Order the arcs a1, a2, ..., am in non-decreasing order of cost
2: T = ∅
3: for i = 1 to m do
4: if T ∪ {ai} does not have a cycle then
5: T = T ∪ {ai}
6: return T
To understand why this approach works, we must look at its Correctness. An edge (l,k) is not selected if and only if selected edges already form a path conneting $l$ and k. Because we process edges in increasing order of weight, those selected edges were considered before (l, k) and hence, each has cost at most $c_{lk}$. Therefore, the path - optimality condition holds. This guarantees that we never reject an edge that is strictly required to keep the overall cost of the tree at an absolute minimum. We will prove this using mathematical induction on the number of edges added to the growing forest.
Proof. Let $F$ be the set of edges selected by the algorithm at any intermediate step. Our inductive hypothesis is that $F$ is always a subset of some minimum spanning tree $T$. The base case holds trivially, as an empty set of edges is a subset of any MST. Now, assume the algorithm considers the next edge $e=(u,v)$ in non-decreasing order of weight. If adding $e$ to $F$ creates a cycle, the algorithm correctly rejects it. This is justified because all edges currently in $F$ were processed before $e$ and thus have weights less than or equal to $w(e)$. Consequently, $e$ is the heaviest edge on the newly formed cycle. By the established cycle property of graphs, the uniquely heaviest edge on any cycle cannot belong to an MST. Conversely, if adding $e$ does not create a cycle, we must prove $F \cup \{e\}$ remains a subset of an MST. If $e \in T$, our inductive step is immediately satisfied. If $e \notin T$, adding $e$ to $T$ will create a cycle $C$. Because $e$ connects two distinct trees in the forest $F$, there must be another edge $e'$ on cycle $C$ that also bridges these two trees but is not in $F$. Since Kruskal's algorithm evaluates edges in sorted order, and $e'$ was not selected prior to $e$ (despite not forming a cycle in $F$), it must be that $w(e) \le w(e')$. By swapping $e'$ with $e$ in $T$, we construct a new tree $T'$ with weight $w(T') = w(T) - w(e') + w(e) \le w(T)$. Since $T$ is an MST, $T'$ must also be an MST, and it now contains $F \cup \{e\}$, completing the inductive proof. $\blacksquare$
The algorithm's mathematical foundation relies heavily on the Exchange Property. For an edge e with the smallest weight in a graph G and a minimum spanning tree T without e, there must exist an edge e' in T such that $(T\setminus e') \cup \{e\}$ is still a minimum spanning tree. This property proves that making a greedy, locally optimal choice does not prevent us from arriving at the globally optimal spanning tree.
Proof. Let $G=(V,E)$ be a connected, undirected graph with a weight function $w$ assigning real numbers to its edges. Let $e$ be an edge of strictly minimum weight in $G$. We want to prove that if $T$ is a minimum spanning tree (MST) of $G$ that does not contain $e$, there must exist an edge $e'$ in $T$ such that $T' = (T \setminus \{e'\}) \cup \{e\}$ is also a minimum spanning tree. Because $T$ is a spanning tree, it is connected and acyclic. Adding the edge $e = (u,v)$ to $T$ must therefore create exactly one simple cycle, which we will call $C$. Since $e$ is not in $T$, there must be at least one other edge on this cycle, say $e'$, that connects the remaining components if removed. Because $e$ is an edge of minimum weight in the entire graph $G$, the weight of $e$ must be less than or equal to the weight of $e'$, meaning $w(e) \le w(e')$. By removing $e'$ and inserting $e$, we break the cycle and re-establish a spanning tree, yielding $T' = (T \setminus \{e'\}) \cup \{e\}$. The total weight of this new tree is $w(T') = w(T) - w(e') + w(e)$. Because $w(e) \le w(e')$, it follows algebraically that $w(T') \le w(T)$. However, $T$ was defined as a minimum spanning tree, meaning no spanning tree can have a weight strictly less than $w(T)$. Therefore, $w(T')$ must exactly equal $w(T)$, proving that $T'$ is also a valid minimum spanning tree. $\blacksquare$
Furthermore, the graph exhibits Self-Reducibility. Suppose T is a minimum spanning tree of a graph G and e is an edge of T. Let G' and T' be obtained from G and T, respectively by shrinking e into a point. Then T' is a minimum spanning tree of G'. This recursive structure allows the algorithm to safely treat merged components as single, unified nodes as it progresses.
Theorem. If $T$ is an MST of $G$, and $e$ is an edge belonging to $T$, then contracting $e$ yields a new graph $G'$ for which $T' = T \setminus \{e\}$ is a minimum spanning tree.
Proof. We will prove this by contradiction. Let the total weight of the original MST be denoted as $w(T)$. By definition of the contraction, the weight of the tree is the sum of its parts, giving us $w(T) = w(T') + w(e)$. Suppose, for the sake of contradiction, that $T'$ is not a minimum spanning tree of the contracted graph $G'$. This implies there exists some other valid spanning tree $T''$ in $G'$ that has a strictly lower total weight, meaning $w(T'') < w(T')$. If we take $T''$ and expand the contracted vertex back into the original edge $e$, the union of edges $T'' \cup \{e\}$ will form a valid spanning tree in the original uncontracted graph $G$. The total weight of this newly formed spanning tree would be exactly $w(T'') + w(e)$. Because we assumed $w(T'') < w(T')$, we can add $w(e)$ to both sides of the inequality to yield $w(T'') + w(e) < w(T') + w(e)$. Substituting our earlier definition, this simplifies to $w(T'' \cup \{e\}) < w(T)$. This directly contradicts our initial, foundational premise that $T$ is a minimum spanning tree of $G$, as we have supposedly found a tree with a smaller weight. Therefore, our assumption must be false: no such tree $T''$ can exist, and $T'$ is undeniably a minimum spanning tree of $G'$. $\blacksquare$
When looking at the Running Time Analysis, we break the process down into phases. First, sort the arc costs. Depending on the sorting algorithm used, this takes $O(\min (m \log n, m \log C))$ time, where $C$ is the maximum edge weight (leveraging non-comparative sorts like Radix sort when applicable). The repeated step is to determine if T + ai has a cycle. When arc (j,k) is considered, we want to know if j and k are in the same connected component. If they are not, they result in two components being merged. To do this efficiently, we need a very simple way of keeping track of components.
We can manage the connected components of the nodes in the graph using a simple, yet fairly efficient data structure. Store each component C as set. Then, let First(C) denote the first node in the component of C, acting as the set's representative element. For each element j in a component C, let First(j) = First(C) = first node of C. Remark: adding (i,j) to a forest F creates a cycle if and only if (i,j) are in the same component; i.e., First(i)=First(j). To optimize the merging process, we employ a union-by-size heuristic. When merging components C and D, put the larger component first. If $|C| > |D|$, First(C U D) := First(C).
This introduces the formal class of Data Structures "Disjoint Sets". One way to implement this is via a Linked - list representation of disjoint sets $\{S_{1},S_{2},...,S_{k}\}$, where each node has a direct pointer back to the head. The structure must support three primary operations. Make - Set(x) creates a new set containing only x. Union(x, y) unions sets containing x and y, respectively, say $S_{x}$ and $S_{y}$ into a set $S_{x}\cup S_{y}$. Finally, Find - Set(x) returns a pointer to $S_{x}$, allowing us to check component membership in constant time.
Here is the formalized Kruskal Algorithm with Data Structure integrated together:
MST-KRUSKAL(G, w)
1: A ← ∅;
2: for each node v ∈ V(G)
3: Make - Set(v);
4: sort the edges of G into nondecreasing order by w;
5: for each (u,v) ∈ E(G) in the ordering
6: if Find - Set(u) ≠ Find - Set(v) then
7: A ← A ∪ {(u,v)}
8: Union(u, v);
9: return A
In the final Analysis of the Running Time, we evaluate the cost of our set operations. Time to determine if First(i) = First(j) for i,j: $O(1)$ per arc. $O(m)$ in total. Time to merge components S and T, assume $|S|\ge|T|$. $O(1)$ per node of T (the smaller side). Because we always merge the smaller set into the larger one, each node i is on the smaller side of a merge at most $\log n$ times. (Because the number of nodes in the component containing i at least doubles when it is merged.) Therefore, Total time spent merging: $O(n \log n)$. Combining the sorting time with the disjoint-set operations gives us the Total running time. $O(\min(m \log n, m \log_{n}C + n \log n))$.
Prim's Algorithm is a Dijkstra-like algorithm that builds a single, contiguous tree outward from an arbitrary starting node. It maintains a set $S$ of nodes already incorporated into the tree, and at each step, it greedily selects the minimum cost arc that bridges the cut from $S$ to the unvisited nodes $N - S$.
PRIM(V, E, w)
1: S = {1} // Start with an arbitrary node
2: T = ∅
3: while S ≠ N do
4: Find the minimum cost arc (i, j) from S to N - S
5: Add j to S
6: Add (i, j) to T
7: return T
Remark. This algorithm builds a minimum cost spanning tree from the first node. This inherently guarantees that the cut optimality conditions are strictly satisfied at the end.
As Prim's begins, the minimum cost arc from yellow nodes to green nodes can be found by placing arc values in a priority queue.
Throughout its execution, the Prim's algorithm maintains a three-part loop invariant:
It is easy to take as a given that Prim's Algorithm is correct, but we need a proof of correctness. We will use induction to show that the set of edges selected by the algorithm at any point is a subset of some MST.
Proof. Base Case: The algorithm starts with a single arbitrary vertex and no edges, which is a trivial subset of any existing MST. Inductive Step: Assume that after $k$ iterations, the set of edges $U_k$ is a subset of some MST, $T$. In the next iteration, Prim's algorithm adds a new edge, $e$, which is the minimum-cost edge crossing the cut between the current set of vertices in the tree ($S$) and the remaining vertices ($V \setminus S$).
The correctness of Prim algorithm was just proved using a variation of the cut-optimality condition, which states that a spanning tree $T$ is a minimum spanning tree if, for every edge $(i, j) \in T$, that edge is the minimum weight edge in the cutset induced by deleting $(i, j)$ from $T$. When you remove an edge from a tree, the tree splits into exactly two connected components; the set of all edges in the original graph that connect these two components is the "induced cut".
But that is not enough to prove the correctness of Prim's. The reason for that is, it is clear that for each tree edge $(i,j)$, there is a cut such that $(i,j)$ has the smallest cost in the cut. However, this cut is not the one induced by deleting $(i,j)$ from the spanning tree. In other words, in certain examples, a tree edge $(i,j)$ might be the smallest edge in a cut, but not necessarily the specific cut created by deleting that edge from the final spanning tree. So, while an edge might be chosen greedily (because it's the smallest available at that moment), the "induced cut" logic of the standard definition doesn't always align with the step-by-step growth of the tree in Prim's.
While the generic algorithm simply looks for the minimum cost arc, an efficient implementation requires a fast way to select this new edge. This is achieved by maintaining all vertices not yet in the tree in a min-priority queue, Q, based on a specific key attribute.
v.key:Stores the minimum weight of any edge connecting vertex v to a vertex already established in the tree. By convention, if no such edge exists, v.key = ∞.v.π:Stores the parent node of v in the tree.
Instead of maintaining a distinct set of edges $T$, the algorithm implicitly maintains the set of edges in the minimum spanning tree, $A$, using parent pointers: $$A = \{ (v, v.\pi) : v \in V \setminus (\{r\} \cup Q) \}$$ When the algorithm terminates and $Q$ is empty, the final Minimum Spanning Tree is: $$A = \{ (v, v.\pi) : v \in V \setminus \{r\} \}$$
MST-PRIM(G, w, r)
1: for each u ∈ G.V
2: u.key = ∞
3: u.π = NIL
4: r.key = 0
5: Q = G.V
6: while Q ≠ ∅
7: u = EXTRACT-MIN(Q)
8: for each v ∈ G.Adj[u]
9: if v ∈ Q and w(u, v) < v.key
10: v.π = u
11: v.key = w(u, v)
The running time of Prim's algorithm depends entirely on how the min-priority queue Q is implemented:
EXTRACT-MIN operation runs |V| times, taking O(log V) time per extraction, totaling O(V log V). The inner for loop executes O(E) times altogether (the sum of the lengths of all adjacency lists is 2|E|). The assignment in line 11 involves an implicit DECREASE-KEY operation, which takes O(log V) time.
Totally, $O(V log V + E log V) = O(E log V)$ time is taken.
EXTRACT-MIN takes O(log V) amortized time, and the DECREASE-KEY operation takes O(1) amortized time.
$O(E + V log V)$ total time is taken to run.
CLRS, practically the bible of computer science, offers a great many exercises and problems whose solutions would be useful to know.
Show that for each MST $T$ of $G$, there is a way to sort the edges in Kruskal's algorithm so that it returns $T$.
Kruskal's algorithm only branches in its decision-making when there are multiple edges with the exact same weight. To guarantee the algorithm produces a specific MST $T$, we simply adjust our sorting comparator: when two edges have the same weight, we resolve the tie by prioritizing the edge that belongs to $T$. Because all edges in $T$ belong to a valid MST, they will never form a cycle when added in weight order, ensuring every edge in $T$ is safely selected.
Give a simple implementation of Prim's algorithm for an adjacency matrix that runs in $O(V^2)$ time.
Instead of using a binary heap or Fibonacci heap, which are rather complex, we implement the priority queue $Q$ using a simple 1D array of size $|V|$, where key[v] stores the minimum edge weight to the tree. The following operations are used:
Total time complexity: $O(V^2) + O(V^2) = O(V^2)$. This is actually optimal for dense graphs where $|E| \approx |V|^2.$
When is the Fibonacci heap implementation asymptotically faster?
In many cases:
The point is: the Fibonacci heap is asymptotically faster strictly when $|E|$ grows faster than $|V|$, specifically when $\omega(V) \le |E| \le O(V^2)$.
How fast can Kruskal's run if edge weights are integers in the range $1$ to $|V|$? What if the range is $1$ to $W$?
The bottleneck of Kruskal's algorithm is sorting the edges, which normally takes $O(E \log E)$. If weights are integers from $1$ to $|V|$, we can use Counting Sort, which sorts the edges in $O(V + E)$ time. The disjoint-set operations will then dominate the runtime, taking $O(E \alpha(V))$ time, where $\alpha$ is the inverse Ackermann function. If weights are in the range $1$ to $W$, Counting Sort takes $O(W + E)$ time. If $W$ is a constant or $O(E)$, the total running time remains bounded by the disjoint-set operations: $O(E \alpha(V))$.
How fast can Prim's run if edge weights are integers in the range $1$ to $|V|$? What if the range is $1$ to $W$?
We can optimize Prim's algorithm by replacing the standard binary heap with an array of "buckets" (doubly linked lists). Since the maximum possible edge weight is $W$, a vertex's key can never exceed $W$ (except for the initial $\infty$). We create an array of size $W + 1$.
Decrease-Key: When a vertex's key is updated, we simply move it from its current bucket to the new bucket corresponding to its new key. This takes $O(1)$ time per update. Total time for all updates is $O(E)$.Extract-Min: To find the minimum, we scan the array starting from the lowest index until we find a non-empty bucket. In the worst case, we might scan $O(W)$ buckets. Since we do this $|V|$ times, the total extraction time is $O(V + W)$ if we maintain a pointer to the current minimum bucket, or $O(VW)$ for a simpler implementation.If edge weights are uniformly distributed over the half-open interval $[0, 1)$, which algorithm can you make run faster?
Kruskal's algorithm can be made faster in expectation. Since the edge weights are uniformly distributed in a known range, we can use Bucket Sort to sort the edges, which runs in $O(E)$ expected time. With the sorting bottleneck reduced to linear expected time, the overall running time of Kruskal's algorithm is dominated by the disjoint-set operations, yielding an expected time of $O(E \alpha(V))$. Prim's algorithm does not benefit as directly or simply from uniformly distributed weights, making Kruskal's the faster choice here.
How quickly can we update an existing MST if we add a new vertex and its incident edges to the graph?
Let $T$ be the existing MST, which has $|V| - 1$ edges. When a new vertex $v$ is added along with $k$ incident edges, we don't need to re-examine the entire original graph.
We can construct a new, much smaller graph $G'$ consisting only of the edges in $T$ plus the $k$ new edges. This graph has $|V| + 1$ vertices and $(|V| - 1) + k$ edges. Since $k \le |V|$, the number of edges in $G'$ is $O(V)$. We can simply run Prim's or Kruskal's algorithm on $G'$. Running Kruskal's on this reduced graph takes $O(V \log V)$ time, which is much faster than recalculating from scratch.
Evaluate Professor Borden's proposed algorithm that partitions vertices into two sets, recursively finds the MST of each, and unites them with the minimum weight edge crossing the cut.
The algorithm fails. It operates on the flawed assumption that an MST will only cross the arbitrary cut $(V_1, V_2)$ exactly once.
We present the following counterexample. Consider a graph with $4$ vertices: $A, B, C, D$.
Partition them into $V_1 = \{A, B\}$ and $V_2 = \{C, D\}$.
Let the edge weights be:
Borden's algorithm will find the MST of $V_1$ as $(A, B)$ and the MST of $V_2$ as $(C, D)$. It then finds the minimum edge crossing the cut, say $(A, C)$ with weight $1$. The total tree returned is $\{(A, B), (C, D), (A, C)\}$ with a total weight of 21.
However, the true MST is $\{(A, C), (B, D), (A, B)\}$ with a total weight of $1 + 1 + 10 = \mathbf{12}$. The algorithm restricts the cut to a single edge, preventing it from utilizing multiple light edges that might cross the partition.
Professor Borden's algorithm fails because it ignores the Fundamental Cycle Lemma. In the counterexample provided, the true MST requires multiple edges to cross the cut $(V_1, V_2)$.
By the Fundamental Cycle Lemma, adding any edge crossing the cut—such as $(B, D)$—to the set of edges selected by Borden's algorithm would create a unique cycle. In the counterexample, adding $(B, D)$ creates the cycle $C = \{(B, D), (D, C), (C, A), (A, B)\}$.
The weights of these edges are:
According to the lemma, we can break this cycle by deleting any edge and still maintain a spanning tree. To minimize the total weight, we should delete the heaviest edges in the cycle. By deleting the two edges with weight $10$, we are left with the true MST.
Borden's algorithm is fundamentally restricted because it only allows one edge to cross the cut, whereas the Lemma proves that cycles (and thus potential optimizations) can involve multiple crossing edges.
Let $G = (V, E)$ be an undirected, connected graph whose weight function is $w: E \to \mathbb{R}$, and suppose that $|E| \ge |V|$ and all edge weights are distinct. We define a second-best minimum spanning tree as follows. Let $\mathcal{T}$ be the set of all spanning trees of $G$, and let $T^*$ be the minimum spanning tree of $G$. Then a second-best minimum spanning tree is a spanning tree $T'$ such that: $$w(T') = \min_{T'' \in \mathcal{T} \setminus \{T^*\}} \{w(T'')\}$$
(a)
Since all edge weights are distinct, Kruskal's algorithm will only ever have one strict order to process the edges, and the cut property dictates a single unique choice for every cut. Therefore, the minimum spanning tree $T$ is unique.
However, the second-best MST does not have to be unique. Consider a graph with vertices $A, B, C, D$. Let the edges be $(A,B)=1, (B,C)=2, (C,D)=10, (A,D)=11$.
The unique MST uses edges of weights 1, 2, and 10 (total weight = 13).
To form a second-best MST, we could swap the edge of weight 10 for the edge of weight 11, resulting in a tree of weight 14. But what if we had another independent cycle in the graph where a swap also increased the total weight by exactly 1? We could have two completely different edge swaps that result in the exact same second-best total weight, making the second-best MST non-unique even with distinct edge weights.
(b)
Let $T$ be the unique MST, and $T'$ be a second-best MST. Because $T$ and $T'$ are different spanning trees, there must be at least one edge $(x,y) \in T'$ that is not in $T$. Adding $(x,y)$ to $T$ creates a fundamental cycle. To resolve this cycle and form a new spanning tree, we must remove an edge $(u,v)$ that is in $T$ but not in $T'$.
The new tree $T'' = T - \{(u,v)\} \cup \{(x,y)\}$ is a valid spanning tree. Since $T$ is the MST, $w(x,y) > w(u,v)$. The difference in weight between $T''$ and $T$ is minimized when we pick $(x,y)$ to be an edge just outside the MST and $(u,v)$ to be the maximum weight edge on the path it connects. Therefore, the second-best MST can always be formed by a single edge swap from the MST.
(c)
We can compute the maximum weight edge on the unique path between every pair of vertices in the MST $T$ using a simple traversal.
For each vertex $u \in V$, we perform a Depth First Search (DFS) or Breadth First Search (BFS) starting from $u$. As we traverse the tree, we keep track of the maximum edge seen so far. When we visit a new vertex $y$ from its parent $x$ in the search tree, we can calculate $\max[u, y]$ in constant time using the previously computed value:
$$\max[u,y] = \max(\max[u,x], w(x,y))$$
Since a full traversal from a single root takes $O(V)$ time (because $T$ has $V-1$ edges), doing this for all $V$ possible roots takes $O(V \cdot V) = O(V^2)$ time.
(d) Combining the logic from the previous parts, we can find the second-best MST with the following steps:
Time Complexity: $O(E \log V) + O(V^2) + O(E) = O(V^2)$.
(Since $E \le V^2$, the $V^2$ term dominates the $E \log V$ term).
For a very sparse connected graph $G = (V,E)$, we can further improve upon the $O(E + V \log V)$ running time of Prim's algorithm with Fibonacci heaps by preprocessing $G$ to decrease the number of vertices before running Prim's algorithm. In particular, we choose, for each vertex $u$, the minimum-weight edge $(u, v)$ incident on $u$, and we put $(u, v)$ into the minimum spanning tree under construction. We then contract all chosen edges.
Rather than contracting these edges one at a time, we first identify sets of vertices that are united into the same new vertex. Then we create the graph that would have resulted from contracting these edges one at a time, but we do so by "renaming" edges according to the sets into which their endpoints were placed. Several edges from the original graph may be renamed the same as each other. In such a case, only one edge results, and its weight is the minimum of the weights of the corresponding original edges.
Initially, we set the minimum spanning tree $T$ being constructed to be empty, and for each edge $(u, v) \in E$, we initialize the attributes $(u, v).orig = (u, v)$ and $(u, v).c = w(u, v)$. We use the $orig$ attribute to reference the edge from the initial graph that is associated with an edge in the contracted graph. The $c$ attribute holds the weight of an edge, and as edges are contracted, we update it according to the above scheme for choosing edge weights. The procedure
MST-REDUCEtakes inputs $G$ and $T$, and it returns a contracted graph $G'$ with updated attributes $orig'$ and $c'$. The procedure also accumulates edges of $G$ into the minimum spanning tree $T$.
MST-REDUCE(G, T)
1 for each v ∈ G.V
2 v.mark = FALSE
3 MAKE-SET(v)
4 for each u ∈ G.V
5 if u.mark == FALSE
6 choose v ∈ G.Adj[u] such that (u, v).c is minimized
7 UNION(u, v)
8 T = T ∪ {(u, v).orig}
9 u.mark = v.mark = TRUE
10 G'.V = {FIND-SET(v) : v ∈ G.V}
11 G'.E = ∅
12 for each (x, y) ∈ G.E
13 u = FIND-SET(x)
14 v = FIND-SET(y)
15 if (u, v) ∉ G'.E
16 G'.E = G'.E ∪ {(u, v)}
17 (u, v).orig' = (x, y).orig
18 (u, v).c' = (x, y).c
19 else if (x, y).c < (u, v).c'
20 (u, v).orig' = (x, y).orig
21 (u, v).c' = (x, y).c
22 construct adjacency lists G'.Adj for G'
23 return G' and T
(a) Let $T$ be the set of edges returned by MST-REDUCE, and let $A$ be the minimum spanning tree of the graph $G'$ formed by the call MST-PRIM(G', c', r), where $c'$ is the weight attribute on the edges of $G'.E$ and $r$ is any vertex in $G'.V$. Prove that $T \cup \{ (x, y).orig' : (x, y) \in A \}$ is a minimum spanning tree of $G$.
In MST-REDUCE, every vertex selects the minimum-weight edge incident to it. By the cut property (or Theorem 23.1 regarding safe edges), the lightest edge crossing the cut between a single vertex and the rest of the graph must belong to some Minimum Spanning Tree. Therefore, all edges added to $T$ during this phase are safe and belong to an MST. Contracting these edges preserves the MST property for the remaining graph. If $A$ is the MST of the contracted graph $G'$, then combining the edges of $A$ (mapped back to their original edges via the $orig'$ attribute) with the safely chosen edges in $T$ yields a valid MST for the original graph $G$.
(b) Argue that $|G'.V| \le |V| / 2$.
During the reduction phase, every vertex $u \in V$ selects exactly one incident edge to contract. Since each edge connects two vertices, the "worst-case" scenario for minimizing contractions is when vertices pair up exclusively (e.g., $u$ picks $(u,v)$ and $v$ picks $(v,u)$). In this scenario, every pair of vertices merges into a single new vertex, exactly halving the total number of vertices. If vertices form larger connected components, the reduction is even greater. Thus, the new number of vertices is at most half the original.
(c)
Show how to implement MST-REDUCE so that it runs in $O(E)$ time. (Hint: Use simple data structures.)
To achieve $O(E)$ time, we must avoid the $O(E \alpha(V))$ overhead of standard disjoint-set data structures. Because every vertex selects exactly one edge, the selected edges form a collection of trees (a forest).
We can implement this efficiently by:
(d) Suppose that we run $k$ phases of MST-REDUCE, using the output $G'$ produced by one phase as the input $G$ to the next phase and accumulating edges in $T$. Argue that the overall running time of the $k$ phases is $O(kE)$.
Let $E_i$ be the number of edges in the graph during phase $i$. When we contract vertices, we might create self-loops (which are discarded) or parallel edges (which are resolved by keeping only the lightest edge). Therefore, the number of edges never increases: $E_k \le E_{k-1} \dots \le E_1 \le E$. Since each phase takes $O(E_i)$ time, and $E_i \le E$ for all $i$, running $k$ phases takes $\sum_{i=1}^k O(E_i) = O(kE)$ time.
(e) Suppose that after running $k$ phases of MST-REDUCE, as in part (d), we run Prim’s algorithm by calling MST-PRIM(G', c', r). Show how to pick $k$ so that the overall running time is $O(E \log \log V)$. Argue that your choice of $k$ minimizes the overall asymptotic running time.
After $k$ phases, the number of vertices is at most $V / 2^k$. If we then run Prim's algorithm with a Fibonacci heap on the contracted graph, the time taken is:
$$O(E + V_k \log V_k) \le O\left(E + \frac{V}{2^k} \log V\right)$$
To balance the reduction time $O(kE)$ and the Prim's execution time, we want $\frac{V}{2^k} \log V \le V$, which means we need $2^k \ge \log V$.
If we choose $k = \lceil \log \log V \rceil$, the number of vertices drops enough so that the $\log V$ factor is canceled out. The time for Prim's becomes $O(E + V) = O(E)$ (since $E \ge V-1$ for a connected graph). The preprocessing time becomes $O(E \log \log V)$. Thus, the overall asymptotic running time is dominated by the preprocessing step: $O(E \log \log V)$. This choice of $k$ perfectly minimizes the overall asymptotic bound.
(f) For what values of $|E|$ (in terms of $|V|$) does Prim’s algorithm with preprocessing asymptotically beat Prim’s algorithm without preprocessing?
Standard Prim's with a Fibonacci heap takes $O(E + V \log V)$. Our hybrid algorithm takes $O(E \log \log V)$.
We want to know when $E \log \log V$ is strictly better than $E + V \log V$.
A bottleneck spanning tree $T$ of an undirected graph $G$ is a spanning tree of $G$ whose largest edge weight is minimum over all spanning trees of $G$. We say that the value of the bottleneck spanning tree is the weight of the maximum-weight edge in $T$.
(a)
Argue that a minimum spanning tree is a bottleneck spanning tree.
We can prove this by contradiction. Suppose $T$ is a Minimum Spanning Tree (MST) but not a Bottleneck Spanning Tree (BST). Let $T'$ be a valid BST. By definition, the maximum edge weight in $T$, let's call it $w_{max}$, must be strictly greater than the maximum edge weight in $T'$, which we'll call $w'_{max}$ (so $w_{max} > w'_{max}$).
Let $e = (u, v)$ be an edge in $T$ with weight $w_{max}$. If we remove $e$ from $T$, the tree splits into two disconnected components, $C_1$ and $C_2$. Since $T'$ is a valid spanning tree of the entire graph, it must contain at least one edge $e'$ that crosses the cut between $C_1$ and $C_2$.
Because $e'$ is in $T'$, its weight is at most $w'_{max}$, which means $w(e') < w(e)$. If we add $e'$ to $T - \{e\}$, we reconnect the two components and form a new spanning tree. The total weight of this new spanning tree is exactly $w(T) - w(e) + w(e')$. Since $w(e') < w(e)$, the new tree has a strictly smaller total weight than $T$. This directly contradicts our initial premise that $T$ is an MST. Therefore, every MST must also be a bottleneck spanning tree.
(b)
Give a linear-time algorithm that given a graph $G$ and an integer $b$, determines whether the value of the bottleneck spanning tree is at most $b$.
We can determine this by simply ignoring all edges that are "too heavy" and checking if the remaining edges can still connect the graph.
(c)
Use your algorithm for part (b) as a subroutine in a linear-time algorithm for the bottleneck-spanning-tree problem. (Hint: You may want to use a subroutine that contracts sets of edges, as in the MST-REDUCE procedure described in Problem 23-2.)
We can find the bottleneck spanning tree recursively by leveraging linear-time median selection and edge contraction.
MST-REDUCE). This contraction takes $O(V + E)$ time. We then recursively call our algorithm on the new contracted graph, keeping only the remaining edges from $E_{> m}$. Again, the number of edges being passed to the recursive call is exactly half.In this problem, we give pseudocode for three different algorithms. Each one takes a connected graph and a weight function as input and returns a set of edges $T$. For each algorithm, either prove that $T$ is a minimum spanning tree or prove that $T$ is not a minimum spanning tree. Also describe the most efficient implementation of each algorithm, whether or not it computes a minimum spanning tree.
(a)
MAYBE-MST-A(G, w)
1: sort the edges into nonincreasing order of edge weights w
2: T = E
3: for each edge e, taken in nonincreasing order by weight
4: if T - {e} is a connected graph
5: T = T - {e}
6: return T
This algorithm does compute a minimum spanning tree. It is commonly known as the Reverse-Delete Algorithm. It works by applying the cycle property in reverse. Because we examine edges in nonincreasing order (heaviest first), when we consider an edge $e$, if removing it does not disconnect the graph, $e$ must be part of some cycle in $T$. Furthermore, because we process the heaviest edges first, $e$ is guaranteed to be the heaviest edge on that cycle. The cycle property states that the heaviest edge on any cycle cannot be part of the MST, so it is safe to remove it.
Proof. We will prove that MAYBE-MST-A correctly computes a minimum spanning tree by relying on the cycle property. The algorithm initializes $T$ to include all edges in $E$ and processes them in nonincreasing order of weight. When an edge $e$ is examined, if $T - \{e\}$ remains connected, it implies that $e$ is part of a cycle within the current set of edges $T$. Because the algorithm considers edges from heaviest to lightest, all other edges currently in $T$ (including those completing the cycle with $e$) must have weights less than or equal to $w(e)$. Therefore, $e$ is guaranteed to be a maximum-weight edge on this cycle. By the standard cycle property of minimum spanning trees, the heaviest edge on any cycle can be safely excluded without preventing the formation of an MST. Removing $e$ preserves the invariant that $T$ contains the edges of at least one MST of $G$. After all edges are processed, $T$ remains connected and contains no cycles (otherwise, the heaviest edge of that cycle would have been removed). Thus, $T$ is a minimum spanning tree. $\blacksquare$
We can implement by sorting the edges takes $O(E \log E)$ time. For each of the $E$ edges, we must check if removing it disconnects the graph. We can do this by running a Breadth-First Search (BFS) or Depth-First Search (DFS) to see if we can still reach the endpoints of $e$ without using $e$, taking $O(V + E)$ time per edge. The total time complexity is $O(E \log E + E(V + E)) = O(E^2)$. (With more advanced dynamic graph data structures like Euler tour trees, this can be optimized to $O(E \log V \log \log V)$, but the $O(E^2)$ DFS approach is the standard straightforward implementation).
IMPLEMENT-MST-A(G, w)
1: Sort edges in E into an array A in nonincreasing order of weights w
2: Initialize an adjacency list for T containing all edges in E
3: for each edge e = (u, v) in A:
4: Remove e from T's adjacency list
5: Initialize a boolean array visited[1..V] to False
6: Run DFS(u, visited, T)
7: if visited[v] is False:
8: // Removing e disconnected u and v, so T - {e} is not connected
9: Add e back to T's adjacency list
10: return T
Proof of Correctness. The implementation directly mirrors the pseudocode. To check if $T - \{e\}$ remains connected, we temporarily remove $e=(u,v)$ from our working graph $T$ and attempt to find a path between $u$ and $v$ using Depth-First Search (DFS). If $v$ is reachable from $u$, the endpoints are still connected via some alternate path, meaning $e$ was part of a cycle. If $v$ is not reachable, $e$ is a bridge in the current graph, and removing it would disconnect the graph, so we restore it. As established in the algorithm's correctness proof, discarding the heaviest edge of any cycle iteratively leaves exactly the edges of a minimum spanning tree. $\blacksquare$
Proof of Running Time. Sorting the array of edges $A$ takes $O(E \log E)$ time. The algorithm then iterates exactly $E$ times. Inside the loop, it performs a DFS on the graph $T$. A standard DFS on an adjacency list takes time proportional to the number of vertices and current edges, which is bounded by $O(V + E)$. Therefore, the loop takes $E \cdot O(V + E) = O(E^2)$ time. The total running time is bounded by the loop, resulting in $O(E^2)$. $\blacksquare$
Proof of Space Complexity. Storing the graph $T$ as an adjacency list requires $O(V + E)$ space. The sorting step can be done in-place or with $O(E)$ auxiliary space. The DFS requires a visited array of size $O(V)$ and a call stack of size at most $O(V)$ in the worst case (a linear graph). Thus, the dominant space requirement is the adjacency list, giving a space complexity of $O(V + E)$. $\blacksquare$
(b)
MAYBE-MST-B(G, w)
1: T = ∅
2: for each edge e, taken in arbitrary order
3: if T ∪ {e} has no cycles
4: T = T ∪ {e}
5: return T
This algorithm does not compute a minimum spanning tree. It computes a valid spanning tree, but because it considers edges in an arbitrary order without looking at their weights, it will likely not be the minimum one.
Proof. We will prove by counterexample that MAYBE-MST-B does not necessarily compute a minimum spanning tree. Consider a graph $G = (V, E)$ with three vertices $V = \{A, B, C\}$ and three edges with the following weights: $w(A,B) = 1$, $w(B,C) = 2$, and $w(A,C) = 10$. The unique minimum spanning tree for this graph consists of the edges $(A,B)$ and $(B,C)$, yielding a total weight of $3$. Because the algorithm processes edges in an arbitrary order, suppose it processes them in the sequence: $(A,C)$, then $(B,C)$, then $(A,B)$. First, $(A,C)$ is added to $T$ since it forms no cycles. Next, $(B,C)$ is added to $T$ since it also forms no cycles. Finally, $(A,B)$ is considered, but adding it would form the cycle $A-B-C-A$. The algorithm rejects $(A,B)$. The resulting spanning tree $T = \{(A,C), (B,C)\}$ has a total weight of $12$, which is strictly greater than the optimal weight of $3$. Therefore, the algorithm fails to reliably compute a minimum spanning tree. $\blacksquare$
This is exactly Kruskal's algorithm, just without the initial sorting step. We can use a disjoint-set data structure to keep track of connected components and check for cycles. Processing each edge takes $O(\alpha(V))$ amortized time. Total running time is $O(E \alpha(V))$, which is nearly linear.
IMPLEMENT-MST-B(G, w)
1: T = ∅
2: for each vertex v in V:
3: MAKE-SET(v)
4: for each edge e = (u, v) in E (in any arbitrary order):
5: if FIND-SET(u) ≠ FIND-SET(v):
6: T = T ∪ {e}
7: UNION(u, v)
8: return T
Proof of Correctness. We previously proved this algorithm does not compute a minimum spanning tree, but rather an arbitrary spanning tree. This implementation correctly executes that logic using a Disjoint-Set data structure. MAKE-SET initializes each vertex in its own component. Processing edges arbitrarily, we use FIND-SET to check if $u$ and $v$ are in the same component. If they are, adding $e$ would create a cycle, so we skip it. If they are not, adding $e$ connects two disjoint components without forming a cycle, so we add $e$ to $T$ and merge the components using UNION. This strictly maintains a cycle-free graph and eventually spans all reachable vertices, perfectly matching the original pseudocode's intent. $\blacksquare$
Proof of Running Time. Initializing the disjoint sets takes $O(V)$ operations. We then iterate over the $E$ edges. Inside the loop, we perform two FIND-SET operations and at most one UNION operation. Using union-by-rank and path compression, a sequence of $m$ operations on $n$ elements takes $O(m \alpha(n))$ time, where $\alpha$ is the extremely slow-growing inverse Ackermann function. Here, we do $O(E)$ operations on $V$ elements. The total running time is $O(V + E \alpha(V)) = O(E \alpha(V))$. $\blacksquare$
Proof of Space Complexity. The algorithm requires storing the resulting tree $T$, which takes $O(V)$ space since a forest has at most $V-1$ edges. The Disjoint-Set data structure requires two arrays (parent and rank/size) of size $V$, taking $O(V)$ space. Thus, the total space complexity is $O(V)$. $\blacksquare$
(c)
MAYBE-MST-C(G, w)
1: T = ∅
2: for each edge e, taken in arbitrary order
3: T = T ∪ {e}
4: if T has a cycle c
5: let e' be a maximum-weight edge on c
6: T = T - {e'}
7: return T
This algorithm does compute a minimum spanning tree. This is a dynamic approach that constantly maintains a Minimum Spanning Forest. When a new edge $e$ is added, it might form a cycle. If a cycle is formed, the algorithm finds the maximum-weight edge $e'$ on that specific cycle and removes it. By the exact definition of the cycle property, the heaviest edge on any cycle cannot be in the MST. Therefore, removing it is always safe, regardless of the order in which the edges are processed. At the end, $T$ will be connected (spanning) and cycle-free, leaving only the optimal MST edges.
Proof. We will prove that MAYBE-MST-C correctly computes a minimum spanning tree. The algorithm begins with an empty set $T$ and adds edges one by one. If the addition of an edge $e$ creates a cycle $c$ in $T$, the algorithm immediately identifies a maximum-weight edge $e'$ on $c$ and removes it. By the cycle property, a maximum-weight edge on any cycle in a graph is not required to form a minimum spanning tree. By removing $e'$, we resolve the cycle while maintaining the invariant that the edges in $T$ can always be extended to form a minimum spanning tree of the graph constructed from all edges considered so far. Since every cycle formed is immediately broken by discarding its heaviest edge, $T$ is always maintained as a forest. After all edges in $E$ have been processed, every edge of $G$ has either been included in $T$ or discarded because it was the heaviest edge on some cycle. Because $G$ is connected, the resulting cycle-free subgraph $T$ spans all vertices of $G$ and contains only edges necessary for the optimal weight. Thus, $T$ is a minimum spanning tree. $\blacksquare$
A simple implementation maintains $T$ as an adjacency list. For each added edge $e = (u, v)$, we first run a DFS/BFS in $T$ from $u$ to find the path to $v$ before actually adding $e$. This path, combined with $e$, forms the cycle. During the traversal, we track the maximum edge weight on the path. This search takes $O(V)$ time since $T$ is a forest and has at most $V-1$ edges. We do this for all $E$ edges, yielding a total time of $O(E V)$.
IMPLEMENT-MST-C(G, w)
1: Initialize an empty adjacency list for T
2: for each edge e = (u, v) in E (arbitrary order):
3: Initialize a boolean array visited[1..V] to False
4: path = FIND-PATH-DFS(u, v, visited, T)
5: if path is NOT empty:
6: // A cycle is formed by path + {e}
7: Find edge e' with the maximum weight w in path ∪ {e}
8: if e' ≠ e:
9: Add e to T's adjacency list
10: Remove e' from T's adjacency list
11: else:
12: Add e to T's adjacency list
13: return T
Proof of Correctness. The implementation correctly maintains the logic of adding an edge, detecting a cycle, and removing the heaviest edge on that cycle. Before formally adding $e=(u,v)$ to $T$, we check if a path already exists between $u$ and $v$ in $T$ using DFS. If a path exists, adding $e$ would close a cycle consisting of `path` and $e$. We find the maximum weight edge $e'$ on this explicit cycle. If $e'$ is the new edge $e$, we simply discard $e$. If $e'$ is an older edge in $T$, we insert $e$ and remove $e'$. By the cycle property, dropping $e'$ maintains the minimum spanning forest invariant, proving the implementation correctly achieves the MST. $\blacksquare$
Proof of Running Time. The outer loop runs $E$ times. Inside the loop, FIND-PATH-DFS traverses $T$. Because $T$ is strictly maintained as a forest (no cycles), it contains at most $V-1$ edges. A DFS on a forest takes $O(V)$ time. Finding the maximum weight edge on the path also takes $O(V)$ time, as the path length is bounded by $V$. Adding or removing an edge from an adjacency list takes $O(V)$ time in the worst case (or $O(1)$ with doubly-linked lists). The dominant operation inside the loop is the $O(V)$ DFS. Thus, the total running time is $O(EV)$. $\blacksquare$
Proof of Space Complexity. The graph $T$ is a forest, so its adjacency list stores at most $V-1$ edges, requiring $O(V)$ space. The DFS requires a "visited" array of size $O(V)$, a recursion stack of at most $O(V)$, and space to store the current path, also bounded by $O(V)$. The overall space complexity is tightly bounded to $O(V)$. $\blacksquare$
Using advanced data structures like Link-Cut trees, we can find the maximum edge on a path and update the tree dynamically in $O(\log V)$ time per edge, reducing the total running time to $O(E \log V)$.
IMPLEMENT-MST-C-ADVANCED(G, w)
1: Initialize a Link-Cut Tree forest F containing all vertices in V
2: T = ∅
3: for each edge e = (u, v) in E (arbitrary order):
4: if CONNECTED(u, v) in F:
5: // e forms a cycle with the existing path between u and v
6: e' = FIND-MAX-EDGE(u, v) in F
7: if w(e) < w(e'):
8: CUT(e') in F
9: T = T - {e'}
10: LINK(u, v, e) in F
11: T = T ∪ {e}
12: else:
13: // e connects two disjoint components
14: LINK(u, v, e) in F
15: T = T ∪ {e}
16: return T
Proof of Correctness. The logic strictly follows the dynamic MST algorithm proven previously. The Link-Cut tree $F$ acts as an efficient oracle for the current state of our spanning forest $T$. Because standard Link-Cut trees maintain weights on vertices rather than edges, we implicitly represent each edge in $T$ as an auxiliary vertex connected to its endpoints. When considering a new edge $e = (u, v)$, the CONNECTED(u, v) operation checks if a path already exists. If so, adding $e$ would form a cycle. The FIND-MAX-EDGE(u, v) operation traverses the represented path in $F$ to find the maximum-weight edge $e'$ on this cycle. If the new edge $e$ is strictly lighter than $e'$, then $e'$ is the heaviest edge on the cycle. By the cycle property, $e'$ cannot be in the MST, so we remove it via CUT(e') and insert $e$ via LINK(u, v, e). If $e$ itself is the heaviest edge on the cycle ($w(e) \ge w(e')$), it is implicitly discarded without modifying $F$. Thus, the forest is correctly updated, and at the end, $T$ is exactly the minimum spanning tree. $\blacksquare$
Proof of Running Time. The algorithm processes each of the $E$ edges exactly once. For each edge, it performs a constant number of Link-Cut Tree operations: CONNECTED, FIND-MAX-EDGE, and potentially CUT and LINK. Using Sleator and Tarjan's Link-Cut Trees backed by splay trees, each of these operations takes $O(\log N)$ amortized time, where $N$ is the number of nodes in the forest. Because we only maintain tree edges in $F$, there are at most $V$ original vertices and $V-1$ auxiliary edge-vertices in the data structure at any time, giving $N \le 2V - 1$. Therefore, the amortized cost per edge processed is $O(\log V)$. Over $E$ edges, the total running time is exactly $O(E \log V)$. $\blacksquare$
Proof of Space Complexity. The set $T$ stores the edges of the current forest, which is at most $V-1$ edges, taking $O(V)$ space. The Link-Cut Tree $F$ must store objects for the graph's vertices and the edges currently in the tree. Since the tree never contains more than $V-1$ edges, $F$ contains at most $2V-1$ nodes. Each node in a Link-Cut Tree requires a constant amount of information (pointers for the parent, left child, right child, and path-maximum values). Consequently, the Link-Cut tree requires $O(V)$ space. Overall space complexity is tightly bounded at $O(V)$. $\blacksquare$
Prove that a spanning tree $T$ is minimum if and only if for every non-tree edge $u, v)$ there exists a path $P$ from $u$ to $v$ such that every edge $(w, x) \in P$, $(w,x) \leq c(u,v)$ for some constant c$$.
To establish this fundamental cycle property, we must prove the statement in both directions. The function $c$ here represents the weight or cost of an edge.
Proof. ($\Rightarrow$): assuming $T$ is a minimum spanning tree, we must show that the path condition holds. Let $(u, v)$ be any edge not present in the tree $T$. Because $T$ is a spanning tree, adding the edge $(u, v)$ to $T$ inherently creates exactly one simple cycle. This cycle is formed by the edge $(u, v)$ itself and the unique path $p$ within $T$ that connects vertex $u$ to vertex $v$. Suppose for the sake of contradiction that there exists some edge $(w, x)$ on this path $p$ such that $c(w, x) > c(u, v)$. If we remove $(w, x)$ from $T$, the tree is split into two disconnected components. We can then reconnect these components by adding the edge $(u, v)$, yielding a new spanning tree $T'$. The total weight of $T'$ would be exactly the weight of $T$ minus $c(w, x)$ plus $c(u, v)$. Because we assumed $c(w, x) > c(u, v)$, the new tree $T'$ must have a strictly smaller total weight than $T$. However, this contradicts our initial premise that $T$ is a minimum spanning tree. Therefore, our assumption must be false, and it must be true that every edge $(w, x)$ on the path $p$ satisfies $c(w, x) \leq c(u, v)$.
($\Leftarrow$): assuming that for every non-tree edge $(u, v)$ the unique path $p$ in $T$ between $u$ and $v$ consists only of edges with weight less than or equal to $c(u, v)$, we must show that $T$ is a minimum spanning tree. Let $T^*$ be an actual minimum spanning tree of the graph. If $T$ and $T^*$ are identical, the proof is complete. If they are different, there must be at least one edge $(u, v)$ that is in $T^*$ but not in $T$. Adding this edge $(u, v)$ to $T$ creates a cycle containing a path $p$. Since $T^*$ cannot contain this entire cycle (as it is a tree), there must be some edge $(w, x)$ on $p$ that is in $T$ but absent from $T^*$. According to our assumed condition, $c(w, x) \leq c(u, v)$. If we take $T^*$ and swap the edge $(u, v)$ for the edge $(w, x)$, we create a new valid spanning tree. Because $c(w, x) \leq c(u, v)$, the weight of this new tree is less than or equal to the weight of $T^*$. Since $T^*$ is already a minimum spanning tree, its weight cannot decrease, meaning $c(w, x)$ must exactly equal $c(u, v)$, and the new tree is also a minimum spanning tree. By systematically repeating this swapping procedure for all differing edges, we can transform $T^*$ into $T$ without ever increasing the total weight. This demonstrates that $T$ shares the same minimum weight as $T^*$, confirming that $T$ is indeed a minimum spanning tree.
Prove that a spanning tree is minimum if and only if for every tree edge $(u, v)$ there exists a $(s,t)$-cut $(S, T)$ such that $u \in S$, $v \in T$, and $c(u, v)$ is the smallest one in the cut for some constant $c$.
This statement describes the cut property of minimum spanning trees. It can be rewritten as a theorem.
Theorem (Cut Property). Let $e = (v, w)$ be the minimum-weight edge crossing cut $(S, V \setminus S)$ in the graph $G = (V, E)$. Then, $e$ belongs to every minimum spanning tree of $G$.
Proof. ($\Rightarrow$): assuming $T$ is a minimum spanning tree, we must demonstrate that every edge within it satisfies the cut condition. Let $(u, v)$ be any edge that belongs to $T$. Removing $(u, v)$ from $T$ naturally partitions the vertices of the tree into two distinct, disconnected sets, which we will call $S$ and $V \setminus S$. Because $u$ and $v$ were directly connected by the removed edge, one vertex (say $u$) will belong to $S$, and the other ($v$) will belong to $V \setminus S$. This partition defines an $st$-cut $(S, V \setminus S)$ in the overall graph. We must prove that $c(u, v)$ is the minimum weight among all edges crossing this cut. Suppose, for the sake of contradiction, there is another edge $(x, y)$ crossing the cut such that $c(x, y) < c(u, v)$. If we insert $(x, y)$ into the partitioned tree, it reconnects the two components, forming a new spanning tree $T'$. The total weight of $T'$ equals the weight of $T$ minus $c(u, v)$ plus $c(x, y)$. Given that $c(x, y) < c(u, v)$, $T'$ possesses a strictly lower total weight than $T$. This directly contradicts the established fact that $T$ is a minimum spanning tree. Consequently, no such edge $(x, y)$ can exist, meaning $c(u, v)$ must be the smallest weight edge crossing the cut.
($\Leftarrow$): Assume $T$ is a spanning tree where, for every edge $e = (u, v)$ in $T$, there exists a cut separating $u$ and $v$ such that $e$ is the minimum weight edge crossing that cut. We must prove $T$ is a minimum spanning tree.
Let $T^*$ be an actual minimum spanning tree of the graph. If $T = T^*$, we are done.
Suppose $T \neq T^*$. Then there is at least one edge $e$ that belongs to $T$ but does not belong to $T^*$.
By our initial assumption, because $e \in T$, there is some cut $(S, V \setminus S)$ where $e$ is the lightest edge crossing the cut. Let's say $c(e) \le c(e')$ for all other edges $e'$ crossing this cut. If we add our edge $e$ to the known MST $T^*$, it must create exactly one cycle, because $T^*$ is already a connected spanning tree.Because $e$ connects a vertex in $S$ to a vertex in $V \setminus S$, this cycle must cross the cut boundary, loop around, and cross back over the boundary to close the loop. Therefore, there must be at least one other edge on this cycle, let's call it $e'$, that also crosses the cut $(S, V \setminus S)$ and belongs to $T^*$.
We know $e$ is the absolute lightest edge crossing this cut, which guarantees $c(e) \le c(e')$. If we remove $e'$ from $T^*$ and replace it with $e$, we break the cycle and form a new spanning tree, let's call it $T^{**}$. The total weight of our new tree is $w(T^{**}) = w(T^*) - c(e') + c(e)$. Since $c(e) \le c(e')$, it must be true that $w(T^{**}) \le w(T^*)$. $T^*$ is a minimum spanning tree, so no tree can weigh strictly less than it. Therefore, $w(T^{**})$ must equal $w(T^*)$, making our new tree $T^{**}$ also a valid MST.
We have successfully swapped an edge from $T$ into $T^*$ without increasing the total weight. If we repeat this exchange process for every edge in $T$ that isn't in $T^*$, we will eventually transform the MST $T^*$ entirely into $T$, proving that the total weight of $T$ is identical to the MST. Thus, $T$ is a minimum spanning tree. $\blacksquare$
Alternate Proof. Let $T$ be a spanning tree that doesn't include $e = (u, v)$. We'll construct a different spanning tree $T'$ such that $w(T') < w(T)$ and hence $T$ can't be the MST.
Since $T$ is a spanning tree, there's a $u$-$v$ path $P$ in $T$. Since the path starts in $S$ and ends up outside $S$, there must be an edge $e' = (u', v')$ on this path where $u' \in S$, $v' \notin S$.
Let $T' = T - \{e'\} + \{e\}$. This is still connected, since any path in $T$ that needed $e'$ can be routed via $e$ instead, and it has no cycles, so it is a spanning tree.
But since $e$ was the lightest edge between $S$ and $V \setminus S$,
$$w(T') = w(T) - w(e') + w(e) \le w(T) - w(e') + w(e') = w(T) \quad \blacksquare$$
Imagine you have a map of cities (vertices) connected by roads (edges), and you've found a way to connect them all together using the absolute minimum amount of pavement (a Minimum Spanning Tree, or MST). Now, imagine you draw a line completely dividing the map into two halves (a cut). The Cut Property states that the single shortest road crossing that dividing line must be part of your MST.
We began with a two part proof that sought valid conclusions to each direction of the theorem.
Part 1 ($\Rightarrow$) asked Why every edge in an MST is the cheapest across its cut.
Suppose you have a perfectly built Minimum Spanning Tree ($T$), which is the absolute cheapest way to connect all the points in your graph. If you take scissors and snip any single connection (let's call it edge $(u,v)$) in that tree, the tree will break into two completely separate chunks. The Cut is the gap between these two chunks.
The proof asks: what if the edge you just snipped wasn't actually the cheapest way to connect those two chunks? What if there was a cheaper edge, $(x,y)$, hiding somewhere else that also crosses that same gap? Therein lies the contradiction! So, if that cheaper edge existed, you could swap it in. You would leave $(u,v)$ snipped and use $(x,y)$ to reconnect the two chunks. Now you have a perfectly connected tree that costs less than your original tree. But wait—we started by assuming the original tree was the MST (the absolute cheapest possible)! You cannot get cheaper than the absolute cheapest, which means that hypothetical lighter edge $(x,y)$ simply cannot exist. Therefore, the edge you snipped must be the cheapest connection across that gap.
Part 2 ($\Leftarrow$) examined a different, related question: Why a tree made entirely of minimum cut edges is an MST.
Now, flip the scenario. Assume you have a spanning tree where you already know that every single edge inside it is the absolute cheapest way to bridge some gap (cut) in the graph.
Since every single edge in your tree satisfies this rule, every single piece of your tree is a mathematically perfect, optimal choice. If every individual piece of the structure is optimally chosen, then the sum of those pieces must form a globally optimal structure. Therefore, your tree is undeniably an MST.
The alternate uses a proof by contradiction. What if the shortest road crossing the line, let's call it road $e$, is not in our supposed MST? Since the tree already connects all cities, there must already be some other, winding path connecting the two sides of our drawn line. This winding path must cross the dividing line somewhere, via some other road, let's call it $e'$.
If we add the short road $e$ to our map, we create a loop (a cycle). If we then break that loop by removing the other crossing road $e'$, our cities are all still completely connected! However, because $e$ was the lightest crossing edge, we swapped a heavier/longer road for a lighter/shorter one. The total amount of pavement used is now strictly less than what we started with. This proves our original setup couldn't possibly have been the true Minimum Spanning Tree to begin with. Therefore, the shortest road crossing the cut is always safe to include.
Design an example such that there is an edge $(u, v)$ which is selected by Prim's Algorithm from a cut $(S, T)$. Suppose $T*$ is a minimum spanning tree obtained by Prim's Algorithm. $(S', T')$ is the cut obtained from $T*$ by deleting an edge $(u, v)$. We want $(S', T') \neq (S,T)$.
To fulfill this requirement, we need to construct a graph where the cut maintained by Prim's Algorithm during the selection of a specific edge dynamically shifts by the time the algorithm finishes building the entire minimum spanning tree. Let us consider a graph with four vertices: $A$, $B$, $C$, and $D$. We assign the following undirected edges and weights: $(A, B)$ with a weight of 2, $(A, C)$ with a weight of 3, $(B, D)$ with a weight of 4, and $(C, D)$ with a weight of 5. We will execute Prim's Algorithm starting from vertex $A$.
In the first iteration, the visited set is initialized as $S = \{A\}$. The unvisited set is $T_{cut} = \{B, C, D\}$. The edges crossing this initial cut are $(A, B)$ and $(A, C)$. Prim's Algorithm greedily selects the minimum weight edge crossing the cut, which is $(A, B)$ with a weight of 2. For this chosen edge, $u = A$ and $v = B$. The cut at the exact moment of this selection is defined by $S = \{A\}$ and $T_{cut} = \{B, C, D\}$. The algorithm then proceeds, adding $B$ to the visited set. In the second iteration, the visited set becomes $\{A, B\}$, and the cut is now between $\{A, B\}$ and $\{C, D\}$. The crossing edges are $(A, C)$ and $(B, D)$. The algorithm selects $(A, C)$ with a weight of 3. In the third and final iteration, the visited set is $\{A, B, C\}$, the cut is between $\{A, B, C\}$ and $\{D\}$, and the algorithm selects $(B, D)$ with a weight of 4.
The final minimum spanning tree $T^*$ obtained by Prim's Algorithm consists of the edges $(A, B)$, $(A, C)$, and $(B, D)$. We now analyze what happens when we delete the specific edge $(u, v) = (A, B)$ that was selected in the very first step. Removing $(A, B)$ from $T^*$ severs the tree into two disconnected components. By examining the remaining edges in the tree, $(A, C)$ and $(B, D)$, we can clearly see the composition of these components. Vertex $A$ is connected to $C$, forming the component $S' = \{A, C\}$. Vertex $B$ is connected to $D$, forming the component $T' = \{B, D\}$. The cut obtained from the final minimum spanning tree is therefore $(S', T') = (\{A, C\}, \{B, D\})$. Comparing this to the original cut at the time of selection, which was $S = \{A\}$ and $T_{cut} = \{B, C, D\}$, we can explicitly see that $S \neq S'$. The addition of vertex $C$ to the MST in a later step, via an alternate route that did not pass through $B$, fundamentally altered the final component structure. Therefore, we have successfully demonstrated an example where $(S', T') \neq (S, T_{cut})$.
Sollin's algorithm (also known as Borůvka's algorithm) builds the MST by growing multiple trees simultaneously. It is particularly well-suited for parallel network implementations.
SOLLIN(V, E, w)
1: for each i ∈ N do Ni := {i} // Treat all nodes as singleton components
2: T* := ∅
3: while |T*| < n - 1 do
4: for each tree Nk of the forest do
5: Find the least cost arc (ik, jk) from Nk to N \ Nk
6: // In case of ties, use a consistent tie breaking rule to avoid cycles
7: Let T* = T* ∪ {(i1, j1), (i2, j2), ...}
During the execution of Sollin's algorithm, at each full iteration (pass), every individual component in the forest simultaneously finds and adds the cheapest edge leaving it. Because every component pairs up with at least one other component, the total number of independent trees in the forest is reduced by at least half during every single pass.
Consequently, the size of the smallest component at least doubles every iteration. Starting with $V$ singleton components, it takes at most $O(\log V)$ iterations to merge them all into a single spanning tree.
During each pass, finding the minimum outgoing edge for all components requires scanning the adjacency lists, checking every edge in the graph. This takes $O(E)$ time per pass. Multiplying the $O(E)$ scan time by the $O(\log V)$ iterations yields a straightforward implementation running time of $O(E \log V)$.
While often called Sollin's algorithm, the first minimum-spanning-tree algorithm was actually published in a 1926 paper by O. Borůvka. In fact, Borůvka's algorithm is essentially just running $O(\log V)$ iterations of the MST-REDUCE procedure (which aggressively contracts edges and vertices) described earlier!
Similarly, the algorithm commonly known as Prim's (1957) was actually invented much earlier by V. Jarník in 1930. Kruskal published his algorithm in 1956. The fundamental reason these greedy approaches work so perfectly for MSTs is rooted in abstract algebra: the set of forests of a graph forms a mathematical structure known as a graphic matroid (which will be gone into detail later).
For highly dense graphs where $E = \Omega(V \log V)$, standard Prim's algorithm with a Fibonacci heap already runs in optimal $O(E)$ time. However, for sparser graphs, researchers have combined ideas from Prim, Kruskal, and Borůvka with highly advanced data structures to push the bounds closer to linear time:
So a key computational question remains: can we achieve a true linear running time of $O(V + E)$? You could, using:
While greedy algorithms do not work for every problem (such as the 0-1 knapsack or certain scheduling tasks), there is a beautiful combinatorial theory that perfectly describes many situations where the greedy method does yield optimal solutions. This theory revolves around mathematical structures known as matroids.
A matroid is defined as an ordered pair $M = (S, \mathcal{I})$ that strictly satisfies three specific conditions.
Condition 1 (Finite Set): $S$ is a finite set of elements.
Condition 2 (Hereditary Property): $\mathcal{I}$ is a nonempty family of subsets of $S$, which we call the independent subsets. This family is hereditary, meaning that if a subset $B$ is independent ($B \in \mathcal{I}$), then any subset $A$ contained entirely within $B$ ($A \subseteq B$) must also be independent ($A \in \mathcal{I}$). Because it is nonempty and hereditary, the empty set $\emptyset$ is always guaranteed to be a member of $\mathcal{I}$.
Condition 3 (Exchange Property): If $A \in \mathcal{I}$ and $B \in \mathcal{I}$ are two independent subsets, and $A$ is strictly smaller than $B$ ($|A| < |B|$), then there must exist some element $x$ that is in $B$ but not in $A$ ($x \in B \setminus A$) such that adding $x$ to $A$ keeps the new set independent ($A \cup \{x\} \in \mathcal{I}$).
The concept of a matroid was originally introduced by Hassler Whitney while studying matrices. In a "matric matroid," the elements of $S$ are the rows of a matrix, and a subset of rows belongs to $\mathcal{I}$ if those rows are linearly independent in the standard algebraic sense.
To better understand matroids, it is meritful to examine its roots in Linear Algebra and Graph Theory. A set of vectors is said to be independent if no vector in teh set can be created by adding or scaling others. For example, a vector pointing northward and another eastward are two separate vectors; they are independent as they are pointing in two completely different directions. However, a vector pointing northeast is not independent because it is only possible by combining the north and east vectors. We say that the northeast pointing vector is redundant. Graph theory has its own notion of indenependence too. In a graph, a set of edges is said to be independent if they do not form a cycle (a loop). Trees, and collections of them called Forests, do not contain loops, so they are independent! Here, redundance is defined by whether or not an edge closes a loop and forms a cycle (i.e., that edge is redundant).
As you can see, both of these notions of independence and redundance are very similar. In fact, they are the exact same. Matroids are a single structure born forth to describe both. Simply put, a matroid is just a collection of items in which groups of such items are independent. Per the conditions outlined above, the groups must be
A highly practical example is the graphic matroid, denoted as $M_G = (S_G, \mathcal{I}_G)$, which is defined for a given undirected graph $G = (V, E)$.
In a graphic matroid, the finite set $S_G$ is simply defined as $E$, the complete set of edges in the graph.
A subset of edges $A$ belongs to the independent family $\mathcal{I}_G$ if and only if $A$ is acyclic. In other words, a set of edges is considered "independent" if and only if the subgraph $G_A = (V, A)$ forms a forest (a collection of valid, cycle-free trees).
This graphic matroid is the underlying mathematical engine that guarantees the correctness of the greedy Minimum Spanning Tree algorithms we covered in the previous section. If a system follows the three laws ("conditions") stipulated above, greedy algorithms will always work. Picking the best option in the moment does not always lead to the best result in the end. However, thanks to Condition 3 (the Exchange Property), we are guaranteed that picking the best independent item every step of the way will eventually lead to the best solution. When you apply Kruskal's algorithm, for example, you are effectively processing a graphic matroid. You start with an empty set (which is independent by Condition 2) and greedily add edges to your tree under construction, provided they don't form a cycle (maintaining independence in $\mathcal{I}_G$).
Because the sets of acyclic forests in a graph perfectly satisfy the matroid exchange property (Condition 3), a greedy strategy is mathematically guaranteed to safely grow the independent set until it becomes a "maximal" independent set. In the context of a connected graphic matroid, a maximal independent set is precisely a minimum spanning tree. The matroid structure essentially proves why the greedy choice never traps you in a local optimum during MST construction.
The same thing, however, cannot be said of Prim's. While it's process:
Theorem. The Graphic Matroid. If $G = (V, E)$ is an undirected graph, then $M_G = (S_G, \mathcal{I}_G)$ is a matroid.
Proof. Clearly, $S_G = E$ is a finite set. Furthermore, $\mathcal{I}_G$ is hereditary, since a subset of a forest is inherently a forest. Putting it another way, removing edges from an acyclic set of edges cannot possibly create cycles. Therefore, the first two properties of a matroid are satisfied.
It remains to be shown that $M_G$ satisfies the exchange property. Suppose that $G_A = (V, A)$ and $G_B = (V, B)$ are forests of $G$ and that $|B| > |A|$. That is, $A$ and $B$ are both acyclic sets of edges, and $B$ contains strictly more edges than $A$ does.
We claim that a forest $F = (V_F, E_F)$ contains exactly $|V_F| - |E_F|$ trees. To see why, suppose that $F$ consists of $t$ trees, where the $i$-th tree contains $v_i$ vertices and $e_i$ edges. Then, we have:
$$|E_F| = \sum_{i=1}^{t} e_i = \sum_{i=1}^{t} (v_i - 1) = \sum_{i=1}^{t} v_i - t = |V_F| - t$$
This implies that $t = |V_F| - |E_F|$. Applying this logic, the forest $G_A$ contains $|V| - |A|$ trees, and the forest $G_B$ contains $|V| - |B|$ trees.
Since $|B| > |A|$, the forest $G_B$ must contain fewer trees than forest $G_A$. Because forest $G_B$ has fewer trees, it must contain some tree $T$ whose vertices span across two different trees in forest $G_A$. Moreover, since tree $T$ is connected, it must contain an edge $(u, v)$ such that vertices $u$ and $v$ belong to two completely different trees in forest $G_A$.
Since the edge $(u, v)$ connects vertices in two different trees in forest $G_A$, we can safely add the edge $(u, v)$ to forest $G_A$ without creating a cycle. Therefore, $M_G$ satisfies the exchange property, completing the proof that $M_G$ is a matroid. $\blacksquare$
Given a matroid $M = (S, \mathcal{I})$, we call an element $x \notin A$ an extension of $A \in \mathcal{I}$ if we can add $x$ to $A$ while preserving independence; that is, $x$ is an extension of $A$ if $A \cup \{x\} \in \mathcal{I}$. As an example, consider our graphic matroid $M_G$. If $A$ is an independent set of edges, then an edge $e$ is an extension of $A$ if and only if $e$ is not already in $A$ and the addition of $e$ to $A$ does not create a cycle.
If $A$ is an independent subset in a matroid $M$, we say that $A$ is maximal if it has no extensions. That is, $A$ is maximal if it is not contained in any larger independent subset of $M$. In the context of a graphic matroid, a maximal independent subset forms a spanning forest (or a spanning tree, if the original graph is connected).
Theorem. Uniform Size of Maximal Independent Subsets. All maximal independent subsets in a matroid have the exact same size.
Proof. Suppose to the contrary that $A$ is a maximal independent subset of a matroid $M$ and there exists another, strictly larger maximal independent subset $B$ of $M$. Because $|B| > |A|$, the matroid exchange property implies that for some element $x \in B \setminus A$, we can extend $A$ to a larger independent set $A \cup \{x\}$. However, this directly contradicts our initial assumption that $A$ is already maximal (meaning it has no extensions). Therefore, all maximal independent subsets must be of equal size. $\blacksquare$
As an illustration of this theorem, consider a graphic matroid $M_G$ for a connected, undirected graph $G$. Every maximal independent subset of $M_G$ must be a free tree with exactly $|V|-1$ edges that connects all the vertices of $G$. Such a tree is uniquely called a spanning tree of $G$.
We say that a matroid $M=(S, \mathcal{I})$ is weighted if it is associated with a weight function $w$ that assigns a strictly positive weight $w(x) > 0$ to each element $x \in S$. The weight function $w$ extends to subsets of $S$ by simple summation:
$$w(A) = \sum_{x \in A} w(x)$$
for any subset $A \subseteq S$. For example, if $w(e)$ denotes the weight of an edge $e$ in a graphic matroid $M_G$, then $w(A)$ is the total weight of the edges in the independent edge set $A$.
Many problems for which a greedy approach provides optimal solutions can be formulated in terms of finding a maximum-weight independent subset in a weighted matroid. We are given a weighted matroid $M=(S, \mathcal{I})$, and we wish to find an independent set $A \in \mathcal{I}$ such that $w(A)$ is maximized. We call such a subset an optimal subset of the matroid.
Because the weight $w(x)$ of any element $x \in S$ is strictly positive, an optimal subset is always a maximal independent subset—it always strictly increases the total weight to make $A$ as large as possible.
In the minimum-spanning-tree problem, we want to find a subset of edges that connects all vertices with the minimum total length. To view this as a matroid problem (which looks for a maximum weight), we define a new weight function $w'$ for the graphic matroid $M_G$. Let $w'(e) = w_0 - w(e)$, where $w_0$ is a constant larger than the maximum length of any edge. In this new weighted matroid, all weights are strictly positive.
Each maximal independent subset $A$ corresponds to a spanning tree with exactly $|V|-1$ edges. The total weight of $A$ in our matroid is:
$$w'(A) = \sum_{e \in A} w'(e) = \sum_{e \in A} (w_0 - w(e)) = (|V| - 1)w_0 - \sum_{e \in A} w(e) = (|V| - 1)w_0 - w(A)$$
Because $(|V|-1)w_0$ is a fixed constant, any independent subset $A$ that maximizes the quantity $w'(A)$ must inherently minimize the original length $w(A)$. Thus, any algorithm capable of finding an optimal subset in an arbitrary matroid can automatically solve the minimum-spanning-tree problem.
The following greedy algorithm (from CLRS) works for any weighted matroid. It takes a weighted matroid $M$ and its positive weight function $w$, and returns an optimal subset $A$. It is greedy because it considers each element in monotonically decreasing order of weight, instantly adding it to $A$ if doing so maintains independence.
GREEDY(M, w)
1: A = ∅
2: sort M.S into monotonically decreasing order by weight w
3: for each x ∈ M.S, taken in monotonically decreasing order by weight w(x)
4: if A ∪ {x} ∈ M.Î
5: A = A ∪ {x}
6: return A
The following analysis is due to CLRS. We see that Line 4 checks whether adding element $x$ to $A$ would maintain $A$ as an independent set. Since the empty set is independent, and each iteration of the loop strictly guarantees $A$'s independence before adding an element, the subset $A$ is always independent by induction. If $A$ would remain independent, then Line 5 adds $x$ to $A$. Otherwise, $x$
is discarded. Since the empty set is independent, and since each iteration of the for
loop maintains $A$'s independence, the subset $A$ is always independent, by induction.
Therefore, GREEDY always returns an independent subset $A$.
Let $n$ denote $|S|$. The sorting phase takes $O(n \log n)$ time. The independence check on line 4 executes exactly $n$ times. If each check takes $O(f(n))$ time, the total running time of the generic GREEDY algorithm is bounded by $O(n \log n + n f(n))$.
To prove GREEDY is correct, we first show that making the best local choice (picking the heaviest available independent element) is always safe. This is known as the greedy-choice property.
Lemma (Greedy-choice Property of Matroids). Suppose that $M = (S, \mathcal{I})$ is a weighted matroid with weight function $w$ and that $S$ is sorted into monotonically decreasing order by weight. Let $x$ be the first element of $S$ such that $\{x\}$ is independent. If such an $x$ exists, then there exists an optimal subset $A$ of $S$ that contains $x$.
Think of $x$ as the "premium" item. Since it’s the heaviest thing you can legally pick at the start, if you had an optimal collection $B$ that didn't include $x$, you could swap something in $B$ for $x$. Because $x$ is heavier than (or equal to) anything else in $B$, this swap won't hurt your total weight—it might even help. This proves you never "regret" picking the heaviest possible independent element first.
Proof. If no such $x$ exists, the only independent set is the empty set, and the claim is trivially true. Otherwise, let $B$ be any nonempty optimal subset. If $x \in B$, we are done. If $x \notin B$, we construct a set $A$ that contains $x$ and is at least as heavy as $B$. First, note that for any $y \in B$, $w(x) \ge w(y)$ because $x$ was the first independent element in the sorted list. Since $B$ is independent and $\mathcal{I}$ is hereditary, $\{y\}$ is independent for any $y \in B$.
Now, use the exchange property. Start with the set $\{x\}$. Since $|\{x\}| < |B|$, there exists an element in $B$ that can be added to $\{x\}$ to maintain independence. Repeatedly find new elements of $B$ to add until $|A| = |B|$. In this construction, $A$ will consist of $x$ and all but one element of $B$ (let's call the discarded element $y$). Thus, $A = B - \{y\} \cup \{x\}$. The weight of the new set is $w(A) = w(B) - w(y) + w(x)$. Since $w(x) \ge w(y)$, we have $w(A) \ge w(B)$. Because $B$ was an optimal subset, $A$ must also be optimal. $\blacksquare$
We need to be sure that if the algorithm skips an element because it's not "independent" at the beginning, that element won't suddenly become useful later. Matroids guarantee this stability.
Lemma (Hereditary Extensions). If $x$ is an element of $S$ that is an extension of some independent subset $A$, then $x$ is also an extension of $\emptyset$.
Proof. Since $x$ is an extension of $A$, $A \cup \{x\}$ is independent. Since $\mathcal{I}$ is hereditary, the subset $\{x\}$ must also be independent. Thus, $x$ is an extension of $\emptyset$. $\blacksquare$
Corollary (The "Never-Useful" Rule). If $x$ is not an extension of $\emptyset$, then $x$ is not an extension of any independent subset $A$.
This is the contrapositive of the lemma above. In a Graphic Matroid, this is very clear: an edge that isn't independent by itself is a self-loop (a cycle of length 1). No matter how many other edges you add to the graph, that self-loop will always be a cycle. It can never "become" acyclic. Therefore, the greedy algorithm makes no mistake by permanently passing over elements that fail the independence check initially.
Once you commit to your first best choice $x$, the "rest of the problem" is just another matroid problem on the remaining elements that don't conflict with $x$.
Lemma (Optimal-substructure Property of Matroids). Let $x$ be the first element of $S$ chosen by GREEDY. The problem of finding a maximum-weight independent subset containing $x$ reduces to finding a maximum-weight independent subset of the weighted matroid $M' = (S', \mathcal{I}')$, where:
$S' = \{y \in S : \{x, y\} \in \mathcal{I}\}$
$\mathcal{I}' = \{B \subseteq S - \{x\} : B \cup \{x\} \in \mathcal{I}\}$
and the weight function is $w$ restricted to $S'$. This $M'$ is called the contraction of $M$ by $x$.
Proof. If $A$ is any maximum-weight independent subset of $M$ containing $x$, then $A' = A - \{x\}$ is an independent subset of $M'$. Conversely, any independent subset $A'$ of $M'$ yields an independent subset $A = A' \cup \{x\}$ of $M$. Since $w(A) = w(A') + w(x)$ in both cases, an optimal solution for $M'$ must correspond to an optimal solution for $M$ that contains $x$. $\blacksquare$
Theorem (Correctness). If $M = (S, \mathcal{I})$ is a weighted matroid with weight function $w$, then GREEDY(M, w) returns an optimal subset.
Proof. By the Never-Useful Corollary, any elements GREEDY passes over initially can be forgotten; they can never be useful. Once GREEDY selects the first element $x$, the Greedy-Choice Lemma ensures there exists an optimal subset containing $x$. Finally, the Optimal-Substructure Lemma implies that the remaining steps of the algorithm are effectively finding an optimal subset in the contracted matroid $M'$. By induction, the entire sequence of choices leads to a maximum-weight independent subset. $\blacksquare$
The versatility of matroid theory is best demonstrated by examining various combinatorial and algebraic structures that satisfy the three defining axioms of matroids. Let us recall that they are:
Consider a finite set $S$ and an integer $k \le |S|$. We can define a structure $M = (S, \mathcal{I}_k)$ where a subset $A$ is considered independent if and only if its cardinality does not exceed $k$ ($|A| \le k$). This is known as a uniform matroid.
To verify this is a matroid, we first observe that $\mathcal{I}_k$ is hereditary; if a set $B$ has at most $k$ elements, any subset $A \subseteq B$ must also have at most $k$ elements, thus preserving independence. The exchange property is equally straightforward: if $A, B \in \mathcal{I}_k$ and $|A| < |B|$, then $B$ contains at least one element $x$ not in $A$. Since $|A| < |B| \le k$, the set $A \cup \{x\}$ will have a size $|A| + 1$, which is still $\le k$. Therefore, $A \cup \{x\}$ remains independent, satisfying the exchange property.
The very term "matroid" originates from the study of matrices. Given an $m \times n$ matrix $T$ over a field, let $S$ be the set of columns of $T$. We define a subset $A \subseteq S$ as independent if the columns in $A$ are linearly independent in the algebraic sense.
This structure, called a matric matroid, satisfies the hereditary property because any subset of a linearly independent set of vectors is itself linearly independent. The exchange property is a fundamental result of linear algebra: if we have a set of linearly independent vectors $A$ and a larger set of linearly independent vectors $B$, there must exist a vector $x \in B \setminus A$ that is not in the span of $A$. Adding this $x$ to $A$ results in a larger linearly independent set, fulfilling the matroid exchange axiom.
A fascinating property of matroids is their symmetry through duality. If $(S, \mathcal{I})$ is a matroid, we can define a dual structure $(S, \mathcal{I}^*)$ where the maximal independent sets (bases) of $\mathcal{I}^*$ are precisely the complements of the maximal independent sets of $\mathcal{I}$.
Specifically, a set $A'$ is in $\mathcal{I}^*$ if there exists some maximal independent set $A \in \mathcal{I}$ such that $A' \subseteq S \setminus A$. This duality ensures that every matroid has a corresponding "mirror" matroid. Proving that this dual structure satisfies the exchange property is more complex but relies on the fact that the rank function of a matroid and its dual are mathematically linked, preserving the greedy-choice infrastructure across both sets.
Independence can also be defined by how elements are distributed across a partition. Let $S$ be partitioned into nonempty disjoint subsets $S_1, S_2, \dots, S_k$. We define $A \in \mathcal{I}$ if and only if $|A \cap S_i| \le 1$ for all $i=1, \dots, k$. This is known as a partition matroid.
The hereditary property holds because if $B$ contains at most one element from each $S_i$, any subset $A$ of $B$ certainly cannot contain more than one from any $S_i$. For the exchange property, suppose $|A| < |B|$. Since each set $A$ and $B$ can have at most one element per partition, and $B$ is larger, there must be some partition $S_j$ that contains an element $x \in B$ but contains no elements from $A$. Adding $x$ to $A$ maintains the property that each partition has at most one member, thus $A \cup \{x\}$ is independent.
The standard weighted-matroid problem seeks to find a maximum-weight maximal independent subset. However, many real-world problems, like the Minimum Spanning Tree (MST), require finding a minimum-weight solution. We can transform a minimum-weight problem into a standard maximum-weight matroid problem by modifying the weight function $w$.
Let $w_{max}$ be the maximum weight of any single element in $S$. We define a new weight function $w'(x) = w_{max} - w(x)$ for all $x \in S$. Because Theorem 16.6 proves that all maximal independent sets in a matroid have the same size (let this size be $h$), we can observe the relationship between the total weights $W$ and $W'$ of a maximal independent set $A$:
$$w'(A) = \sum_{x \in A} (w_{max} - w(x)) = h \cdot w_{max} - \sum_{x \in A} w(x) = h \cdot w_{max} - w(A)$$
Since $h \cdot w_{max}$ is a constant value for all maximal independent sets, maximizing $w'(A)$ is mathematically identical to minimizing $w(A)$. This transformation allows us to use the same GREEDY algorithm for both optimization directions, provided all transformed weights $w'(x)$ remain non-negative.
A highly practical application of matroid theory is optimally scheduling a set of unit-time tasks on a single processor. Each task comes with a specific deadline and a penalty that must be paid if the task is not completed by that deadline. While scheduling problems can often become highly complex, mapping this specific constraint system to a matroid allows us to find the optimal schedule using a straightforward greedy algorithm.
A unit-time task is a job that requires exactly one unit of continuous processing time. The problem provides the following inputs:
Our objective is to create a schedule—a specific permutation of $S$ dictating the order of execution—that minimizes the total penalty incurred from missed deadlines. Alternatively, minimizing the penalties of the late tasks is mathematically identical to maximizing the sum of the penalties of the early tasks. This shifts our goal to finding a maximum-weight set of tasks that can all be completed on time.
To reduce the search space, we can categorise the tasks in a given schedule as either early (finishing on or before their deadline) or late (finishing after their deadline). We can then systematically rearrange any arbitrary schedule into a highly organized canonical form without ever changing which tasks are early and which are late.
First, we can transform any schedule into an early-first form, where every early task is executed before any late task. If a schedule has an early task $a_i$ positioned after a late task $a_j$, we simply swap them. The task $a_i$ is moved to an earlier time slot, so it effortlessly remains early. The task $a_j$ is pushed to a later time slot; since it was already missing its deadline in the earlier slot, it definitely still misses it in the later slot. By repeating this, all early tasks bubble to the front.
Second, we can organize the early-first schedule into canonical form, where the early tasks are strictly ordered by monotonically increasing deadlines. Suppose we have two adjacent early tasks, $a_i$ finishing at time $k$ and $a_j$ finishing at time $k+1$, but they are out of deadline order, meaning $d_j < d_i$. Because $a_j$ is currently early, we know its finish time $k+1 \le d_j$. If we swap their positions, $a_j$ now finishes at time $k$. Since $k < k+1 \le d_j$, task $a_j$ is still early. Task $a_i$ now finishes at time $k+1$. However, we know $k+1 \le d_j < d_i$, which guarantees $k+1 \le d_i$. Thus, $a_i$ also remains early.
This organization dictates that the entire search for an optimal schedule reduces to merely selecting the correct subset $A$ of tasks to be early. Once we choose $A$, we simply sort $A$ by increasing deadlines, schedule them sequentially, and then append the rejected tasks $S \setminus A$ at the end in any arbitrary order we want.
We define a subset of tasks $A$ as independent if there exists a valid schedule for $A$ such that no task in $A$ is late. Let $\mathcal{I}$ represent the family of all independent sets of tasks. To rapidly check if a set is independent without simulating a schedule, we use a counting metric. Let $N_t(A)$ denote the number of tasks in subset $A$ whose deadline is $t$ or earlier, for $t = 0, 1, \dots, n$. By definition, $N_0(A) = 0$.
The Counting Lemma. For any set of tasks $A$, the following statements are entirely equivalent:
Proof of the Counting Lemma. To prove that (1) implies (2), consider the contrapositive. If $N_t(A) > t$ for some specific time $t$, it means the set $A$ contains more than $t$ tasks that absolutely must be finished by time $t$. Since the processor can only complete one task per unit of time, it is physically impossible to complete more than $t$ tasks in $t$ time slots. Thus, $A$ cannot possibly be scheduled without lateness. If (2) holds, (3) naturally follows. Scheduling tasks by increasing deadlines means you are fulfilling the most urgent tasks first. The condition $N_t(A) \le t$ guarantees that you will never encounter a bottleneck where a task's deadline arrives before its assigned execution slot. The $i$-th task in the sorted sequence will have a deadline of at least $i$. Finally, (3) implies (1) trivially, because if scheduling them in increasing deadline order yields zero late tasks, then an early schedule exists, fulfilling the definition of independence. $\blacksquare$
We can use (2) of the Counting Lemma to compute whether or not a given set of tasks is independent! A naive implementation of the greedy scheduling algorithm might repeatedly simulate schedules or scan up to the maximum possible deadline $n$ to verify the $N_t(A) \le t$ condition. This results in $O(n)$ time per check, leading to an overall runtime of $O(n^2)$. However, by leveraging the mathematical properties of the capacity constraints, we can determine whether a given subset $A$ is independent in strictly $O(|A|)$ time.
The underlying rationale for this optimization is the observation that a set of $k = |A|$ tasks requires exactly $k$ time slots. If we pack these $k$ tasks as early as possible, they will never occupy a time slot beyond $t = k$. Consequently, any task with a deadline strictly greater than $k$ imposes no tighter constraint on the schedule than a task with a deadline exactly equal to $k$. If the capacity constraint $N_t(A) \le t$ holds for all $t$ from $1$ up to $k$, then for any $t > k$, the maximum possible number of tasks in the entire set is $k$, ensuring that $N_t(A) \le k < t$. The condition is naturally satisfied for all $t > k$.
Thus, we only need to verify the prefix sums of the deadline counts up to $t = |A|$. We can achieve this by clamping any deadline greater than $|A|$ down to $|A|$ and using a frequency array.
INDEPENDENT(A)
1: k = |A|
2: let count[1..k] be a new array initialized to 0
3: for each task x ∈ A
4: // Clamp deadlines greater than k down to k
5: idx = min(x.deadline, k)
6: count[idx] = count[idx] + 1
7:
8: running_sum = 0
9: for t = 1 to k
10: running_sum = running_sum + count[t]
11: if running_sum > t
12: return FALSE
13: return TRUE
Proof of Correctness and Complexity. The algorithm initializes an array of size $k = |A|$, which takes $O(|A|)$ time. The first loop iterates exactly $|A|$ times, tallying the clamped deadlines. The variable running_sum acts precisely as $N_t(A)$ because it cumulatively counts all tasks in $A$ with a deadline of $t$ or earlier. The second loop iterates $k$ times, checking if $N_t(A) > t$. If the condition is violated at any step, the set cannot be scheduled, and the algorithm correctly returns FALSE. If the loop completes, it guarantees $N_t(A) \le t$ for all $1 \le t \le |A|$. As previously established, this is mathematically sufficient to guarantee $N_t(A) \le t$ for all $t > |A|$.
Because every loop bounds strictly to $|A|$, the overall time complexity of this independence check is strictly $O(|A|)$. When integrated into the broader greedy algorithm, evaluating the independence of $n$ tasks progressively takes $O(1) + O(2) + \dots + O(n) = O(n^2)$ total time for the checks, but utilizing disjoint-set data structures can further optimize the global scheduling process to near-linear time.
It turns out that minimising the sum of the penalties of the late tasks is the same thing as trying to maximise the sum of the penalties of the early tasks. We can use GREEDY to find an independent set $A$ of tasks of greatest total penalty.
To deploy the greedy algorithm, we must mathematically establish that the system of tasks and independent subsets forms a matroid.
Theorem (Matroid Property of Task Scheduling). Let $S$ be a finite set of unit-time tasks, each with an associated deadline. Let $\mathcal{I}$ be the collection of all independent sets of tasks, where a set is independent if there exists a schedule such that no task is late. Then the system $(S, \mathcal{I})$ is a matroid.
Proof. The set of tasks is finite, so that is out of the way. Now, we can focus our proof on conditions (2) and (3) of matroids. First, the hereditary property clearly holds. If a set of tasks can be scheduled without any being late, then removing any task from that set creates a subset that is also perfectly schedulable without lateness. Formally, if $A \in \mathcal{I}$ and $B \subseteq A$, then $B \in \mathcal{I}$. This follows directly from the definition of independence. If the tasks in $A$ can be scheduled without any task being late, then removing tasks from that schedule to form $B$ cannot result in any of the remaining tasks becoming late. Thus, $B$ is also independent.
Second, we must prove the exchange property. Suppose $A, B \in \mathcal{I}$ and $|B| > |A|$. We must show there exists some task $x \in B \setminus A$ such that $A \cup \{x\} \in \mathcal{I}$. To find this task, we utilize the Counting Lemma, which states that a set $X$ is independent if and only if for all $t \in \{0, 1, \dots, n\}$, the number of tasks in $X$ with deadlines $\le t$ (denoted $N_t(X)$) satisfies $N_t(X) \le t$.
Let $k$ be the largest time step such that $N_k(B) \le N_k(A)$. Such a $k$ must exist because at $t=0$, $N_0(B) = N_0(A) = 0$. However, we know that at the final time step $n$, $N_n(B) = |B|$ and $N_n(A) = |A|$. Since $|B| > |A|$, it follows that $N_n(B) > N_n(A)$, which implies that $k < n$.
By our choice of $k$ as the largest such value, it must be true that for all $j$ in the range $k+1 \le j \le n$, the inequality reverses: $N_j(B) > N_j(A)$. Specifically, for $j = k+1$, $B$ contains more tasks with deadlines $\le k+1$ than $A$ does. Since $N_k(B) \le N_k(A)$, the "extra" tasks in $B$ must have deadlines exactly equal to $k+1$. Therefore, there must exist at least one task $x \in B \setminus A$ whose deadline $d_x$ is $k+1$.
Now, let $A' = A \cup \{x\}$. We check if $A'$ satisfies the Counting Lemma constraint $N_t(A') \le t$ for all $t$:
Since the capacity constraint holds for all $t$, $A \cup \{x\}$ is independent. This satisfies the exchange property. $\blacksquare$
The problem is a weighted matroid, so GREEDY is mathematically guaranteed to find the optimal schedule. The algorithm sorts the $n$ tasks into monotonically decreasing order by their penalties $w_i$. It iterates through the tasks, attempting to add each one to the growing independent set $A$. It accepts the task if the resulting set $A$ still passes the $N_t(A) \le t$ capacity check, and permanently rejects it otherwise.
Using a straightforward array-based check for the $N_t(A) \le t$ condition, determining independence takes $O(n)$ time. Since the greedy algorithm tests all $n$ elements, the total running time for the independence checks is $O(n^2)$.
Consider a processor scheduling scenario involving seven unit-time tasks. The deadlines are $d = \{4, 2, 4, 3, 1, 4, 6\}$ and their corresponding penalties are $w = \{70, 60, 50, 40, 30, 20, 10\}$. The tasks are already conveniently listed in monotonically decreasing order of their penalty weights.
The greedy algorithm evaluates them in weight order:
The optimal early set is established. We arrange these early tasks into monotonically increasing deadline order to form the canonical optimal schedule: $\langle a_2, a_4, a_1, a_3, a_7 \rangle$. We then append the late tasks at the end, resulting in a final execution order of $\langle a_2, a_4, a_1, a_3, a_7, a_5, a_6 \rangle$. The total penalty incurred is simply the sum of the rejected tasks: $w_5 + w_6 = 30 + 20 = 50$, which is mathematically proven to be the absolute minimum penalty possible.
To fully grasp the mechanics of the greedy scheduling algorithm on the task matroid, let us try it on another example. Consider a modified instance of the scheduling problem involving seven unit-time tasks. The deadlines remain $d = \{4, 2, 4, 3, 1, 4, 6\}$, but we apply a transformation to the penalties. Let the new penalty for each task be $w'_i = 80 - w_i$, where $w_i$ represents the original penalty. This yields the following modified set of penalties: $w' = \{10, 20, 30, 40, 50, 60, 70\}$.
The generic greedy algorithm on a weighted matroid demands that we first sort the elements in monotonically decreasing order by their weight. Sorting the tasks by $w'_i$ yields the following sequence of evaluation:
We start with an empty set of early tasks, $A = \emptyset$, and iteratively attempt to add each task, accepting it if and only if the capacity constraint $N_t(A) \le t$ holds for all $t$.
The optimal independent set of early tasks is $A = \{a_7, a_6, a_5, a_4, a_3\}$. To construct the canonical schedule, we sort $A$ by monotonically increasing deadlines ($d_5=1, d_4=3, d_6=4, d_3=4, d_7=6$) and append the rejected late tasks ($a_1, a_2$) at the end. The final optimal schedule is $\langle a_5, a_4, a_6, a_3, a_7, a_1, a_2 \rangle$. The minimum total penalty incurred is the sum of the penalties of the late tasks: $w'_1 + w'_2 = 10 + 20 = 30$.
The problem of making change for a given value of $n$ cents using the fewest number of coins is a classic algorithmic challenge. While the greedy approach—taking as many of the largest denomination coins as possible, then moving to the next largest, and so forth—feels intuitive, its optimality depends entirely on the specific set of coin denominations available.
Consider the standard US coin denominations: quarters (25¢), dimes (10¢), nickels (5¢), and pennies (1¢). The greedy algorithm operates by repeatedly selecting the largest coin value that does not exceed the remaining amount to be changed. For this specific set of denominations, the greedy algorithm always yields the optimal (minimum) number of coins.
To prove this, we must show that any optimal solution must match the greedy choice. Let an optimal solution consist of $q$ quarters, $d$ dimes, $n$ nickels, and $p$ pennies. We can establish bounds on these quantities based on simple exchange arguments:
Because of these bounds, the maximum value that can be constructed without using a quarter is $2 \times 10 + 0 \times 5 + 4 \times 1 = 24$ cents. Consequently, to make change for any amount $n \ge 25$, an optimal solution must include at least one quarter. By induction, this logic applies down the chain of denominations: any amount between 10¢ and 24¢ must use a dime, and amounts between 5¢ and 9¢ must use a nickel. This perfectly mirrors the greedy strategy, proving its optimality for these denominations.
The greedy algorithm also optimally solves the coin change problem when the denominations are powers of some base $c > 1$. Let the available coins be $c^0, c^1, c^2, \dots, c^k$.
In an optimal solution, the number of coins of any denomination $c^i$ (where $i < k$) must be strictly less than $c$. If we had $c$ coins of weight $c^i$, their total combined value would be $c \cdot c^i = c^{i+1}$. We could simply replace these $c$ coins with a single coin of denomination $c^{i+1}$, strictly reducing the total number of coins used.
The maximum value we can accumulate using exclusively coins of denomination $c^{i-1}$ or smaller, without hitting the threshold of $c$ coins for any single denomination, is given by the geometric series:
$$\sum_{j=0}^{i-1} (c-1)c^j = (c-1) \frac{c^i - 1}{c - 1} = c^i - 1$$
This reveals a crucial threshold: using all available smaller denominations to their maximum legal limit yields a total value of exactly $c^i - 1$. Therefore, to form any value $n \ge c^i$, it is mathematically impossible to do so using only coins smaller than $c^i$; the solution must utilize a $c^i$ coin. This forces the optimal solution to pick the largest possible coin, exactly as the greedy algorithm dictates.
This greedy strategy is not universally optimal for arbitrary coin systems. Consider a set of denominations $\{1, 3, 4\}$ and a target amount of $n = 6$.
In this example, the greedy algorithm would first select the largest coin, 4, leaving a remainder of 2. It would then fill the remainder with two 1-cent coins. The greedy solution is $4 + 1 + 1$, utilizing 3 coins. However, the optimal solution uses just two 3-cent coins ($3 + 3 = 6$), utilizing only 2 coins. The greedy choice of taking the heavily weighted '4' paints the algorithm into a corner, forcing it to use suboptimal smaller coins.
To compute the optimal change for any arbitrary set of $k$ denominations $D = \{d_1, d_2, \dots, d_k\}$ (assuming $1 \in D$ to guarantee a solution), we abandon the greedy approach and use dynamic programming. We define $C[j]$ as the minimum number of coins required to make change for exactly $j$ cents.
To find $C[j]$, we consider the last coin added to the optimal set. If that last coin has denomination $d_i$, then the remaining amount is $j - d_i$, and the total coins used is $1 + C[j - d_i]$. We iterate over all available denominations to find the choice that minimizes this count.
MAKE-CHANGE-DP(n, D)
1: Let C[0..n] be a new array
2: C[0] = 0
3: for j = 1 to n
4: C[j] = ∞
5: for i = 1 to k
6: if D[i] ≤ j and 1 + C[j - D[i]] < C[j]
7: C[j] = 1 + C[j - D[i]]
8: return C[n]
The outer loop runs $n$ times, and the inner loop evaluates $k$ denominations. Thus, the algorithm operates in $O(nk)$ time, successfully resolving the optimal coin count for any denomination set.
Suppose we have a single processor and a set of $n$ tasks $S = \{a_1, a_2, \dots, a_n\}$, where each task $a_i$ requires a known processing time $p_i$. Our goal is to order these tasks to minimize the average completion time, $\frac{1}{n} \sum_{i=1}^n c_i$, where $c_i$ is the time task $a_i$ finishes.
If all tasks are available immediately and must run to completion without interruption (non-preemptively), the optimal strategy is Shortest Processing Time First (SPT). The algorithm simply sorts the tasks in ascending order of their processing times $p_i$ and executes them in that order.
SPT-SCHEDULE(S)
1: n = S.length
2: Sort the tasks in S in monotonically increasing order of their processing time p
3: current_time = 0
4: total_completion_time = 0
5: for i = 1 to n do
6: current_time = current_time + S[i].p
7: S[i].c = current_time
8: total_completion_time = total_completion_time + S[i].c
9: average_completion_time = total_completion_time / n
10: return S, average_completion_time
To rigorously prove why SPT minimizes the average completion time, consider the structure of the sum of completion times. If we schedule the tasks in the order $1, 2, \dots, n$, the completion time of the $i$-th task is the sum of its own processing time and the processing times of all preceding tasks: $c_i = \sum_{j=1}^i p_j$.
The sum of all completion times expands to:
$$\sum_{i=1}^n c_i = p_1 + (p_1 + p_2) + (p_1 + p_2 + p_3) + \dots + (p_1 + p_2 + \dots + p_n)$$
$$\sum_{i=1}^n c_i = \sum_{i=1}^n (n - i + 1) p_i$$
In this weighted sum, the processing time of the task scheduled first ($p_1$) is multiplied by $n$, the second by $n-1$, and so on. To minimize this sum, we must pair the largest multipliers with the smallest processing times. Therefore, $p_1 \le p_2 \le \dots \le p_n$ minimizes the total sum, and consequently, the average. If any task with a larger processing time was placed before a task with a smaller processing time, an exchange argument demonstrates that swapping them would strictly decrease the total sum. Since sorting is the bottleneck, this algorithm runs in $O(n \log n)$ time.
The problem becomes highly dynamic if tasks are not immediately available but instead arrive at specific release times $r_i$, and we are allowed to suspend (preempt) a running task to execute another.
The optimal algorithm for this scenario is Shortest Remaining Processing Time First (SRPT). At any given instant, the processor should execute the available task that requires the least amount of time to finish. If a new task arrives with a processing time smaller than the remaining processing time of the currently executing task, the processor immediately preempts the current task and switches to the new one.
The optimality of SRPT is rooted in a continuous application of the exchange argument. Whenever the processor runs a task $A$ while another available task $B$ has a shorter remaining time, the schedule is suboptimal. If we were to swap a small slice of $A$'s execution time with $B$'s execution time, task $B$ would complete earlier, significantly dropping its completion time $c_B$. While $A$ is delayed, pushing $c_A$ back, $B$ finishes so much earlier that the net sum of completion times decreases.
To implement this efficiently, we maintain a Min-Priority Queue (such as a binary min-heap) of all currently released tasks, ordered by their remaining processing time.
SRPT-SCHEDULE(T)
1: // T is an array of n tasks, where each task has:
2: // .release_time (r_i)
3: // .remaining_time (initially p_i)
4: // .completion_time (c_i, to be computed)
5:
6: Sort T in ascending order by release_time
7: Q = new Min-Priority Queue keyed by task.remaining_time
8: current_time = 0
9: i = 1 // Index to track which tasks have been released
10: n = T.length
11:
12: while i <= n or Q is not empty
13: // If no tasks are ready, advance time to the next task's release
14: if Q is empty and current_time < T[i].release_time
15: current_time = T[i].release_time
16:
17: // Release tasks: Insert all tasks arriving at or before current_time into Q
18: while i <= n and T[i].release_time <= current_time
19: INSERT(Q, T[i])
20: i = i + 1
21:
22: // Execute the task with the minimum remaining processing time
23: current_task = EXTRACT-MIN(Q)
24:
25: // Determine how long we can run it before the next task arrives
26: if i <= n
27: time_to_next_release = T[i].release_time - current_time
28: else
29: time_to_next_release = ∞
30:
31: // Check if the current task finishes before the next release
32: if current_task.remaining_time <= time_to_next_release
33: // Event: Task completes
34: current_time = current_time + current_task.remaining_time
35: current_task.completion_time = current_time
36: // (Task is not re-inserted into Q)
37: else
38: // Event: Task is preempted by a new release
39: current_task.remaining_time = current_task.remaining_time - time_to_next_release
40: current_time = current_time + time_to_next_release
41: INSERT(Q, current_task) // Push the unfinished task back into the queue
The algorithm begins its initialization phase between lines 6 and 10 by sorting the array of tasks according to their release times, ensuring they can be evaluated chronologically. It also establishes an empty min-priority queue, denoted as $Q$, which is specifically designed to automatically maintain the task with the smallest remaining processing time at its root. As the scheduling progresses, there may be periods where the processor has no available work. If the priority queue is empty but unreleased tasks still exist in the timeline (lines 14 and 15), the system avoids simulating empty cycles by efficiently fast-forwarding the current time directly to the exact moment the next scheduled task drops.
During the task release phase, handled between lines 18 and 20, any task whose release time has been reached is pushed into the priority queue. Because $Q$ is structured as a min-heap keyed by the remaining processing time, inserting a newly released task takes exactly $O(\log n)$ time. This underlying data structure inherently and automatically re-evaluates the queue upon every insertion, ensuring that the task with the absolute shortest remaining time is always instantly accessible at the front.
The core execution and preemption logic governs the remainder of the loop from lines 23 through 41. The processor extracts the highest-priority task from $Q$ and immediately compares its remaining required processing time against the time window available before the very next task release. If the current task is small enough to finish entirely before any new task arrives (evaluated around line 32), the algorithm advances the global clock by the task's duration, permanently logs its final completion time, and allows it to cleanly exit the system. Conversely, if a new task is scheduled to arrive before the current one can finish (handled at line 37), a preemption event occurs. The system simulates running the current task right up until the exact moment of the new release, subtracts this executed duration from the task's remaining processing time, and tosses the unfinished task back into $Q$. Once back in the queue, it is immediately weighed against the newly released task, perfectly preserving the Shortest Remaining Processing Time policy without missing a beat.
Since each of the $n$ tasks is inserted into the priority queue once and extracted once, and preemption operations simply involve re-inserting the interrupted task, the priority queue operations dictate the time complexity. Managing the heap takes $O(\log n)$ per event. With at most $2n$ events (release and completion for each task), the entire SRPT scheduling algorithm executes in $O(n \log n)$ time.
The relationship between graph theory and linear algebra provides one of the most elegant proofs in computer science: demonstrating that acyclic edge sets form a matroid by mapping them to linearly independent vectors in a matrix. This connection is formalized using the incidence matrix of a graph.
Consider an undirected graph $G = (V, E)$ and its incidence matrix $M$. This matrix has dimensions $|V| \times |E|$, where the rows represent vertices and the columns represent edges. The entry $M_{v,e} = 1$ if edge $e$ is incident on vertex $v$, and $M_{v,e} = 0$ otherwise. We analyze this matrix over the field of integers modulo 2 (often denoted as GF(2)), where $1 + 1 = 0$.
Proof. To establish that a set of columns in $M$ is linearly independent over GF(2) if and only if the corresponding set of edges is acyclic, we must prove both directions of the implication.
First, suppose the set of edges contains a cycle. If we isolate the columns of $M$ that correspond exclusively to the edges in this cycle and sum them together, we evaluate the degree of each vertex within the cycle subgraph. By definition, every vertex in a cycle has exactly two incident edges from that cycle. Therefore, in the row for any vertex involved in the cycle, the sum of the entries will be $1 + 1 = 2$. Modulo 2, this is exactly $0$. Vertices not in the cycle will sum to $0$. The linear combination of these columns yields the zero vector, proving that if a cycle exists, the columns are linearly dependent.
Conversely, suppose the set of edges is acyclic (forming a forest). We want to show the corresponding columns are linearly independent. Any forest with at least one edge must contain at least one leaf (a vertex of degree 1). In the matrix, the row corresponding to this leaf vertex will contain exactly a single $1$ across all columns in our subset. Any linear combination that sums to the zero vector must therefore assign a coefficient of $0$ to the column representing the leaf's incident edge. If we remove this edge from the forest, we are left with a smaller forest, which again must have a leaf. By induction, every column must have a coefficient of $0$ to satisfy the zero vector equation. Thus, an acyclic set of edges corresponds to strictly linearly independent columns.
The columns of any matrix over a field inherently form a matric matroid, and we have established a perfect one-to-one correspondence between linearly independent columns in $M$ and acyclic edge sets in $G$. Thus, it rigorously follows that the acyclic subsets of an undirected graph form a matroid. This serves as a pure algebraic proof of the Graphic Matroid theorem. $\quad \blacksquare$
Suppose we are given that the acyclic edge sets of an undirected graph form a matroid and that a nonnegative weight $w(e)$ is associated with each edge. We can effortlessly find an acyclic subset of maximum total weight by deploying the generic greedy algorithm. This is functionally equivalent to Kruskal's algorithm, adapted to find a Maximum Spanning Forest.
MAXIMUM-SPANNING-FOREST(G, w)
1: A = ∅
2: Sort the edges of G.E in monotonically decreasing order of weight w
3: for each vertex v ∈ G.V do
4: MAKE-SET(v)
5: for each edge e = (u, v) ∈ G.E taken in decreasing order do
6: if FIND-SET(u) ≠ FIND-SET(v)
7: A = A ∪ {e}
8: UNION(u, v)
9: return A
MAXIMUM-SPANNING-FOREST leverages disjoint-set data structures with path compression and union by rank; checking for cycle creation (independence) takes near-constant time. The sorting step dominates the execution, yielding an overall time complexity of $O(E \log V)$.
The matroid framework behaves very differently when we shift our focus to directed graphs and redefine independence to mean "contains no directed cycles." While this family of sets is hereditary (removing edges from a Directed Acyclic Graph leaves a DAG), it completely fails the exchange property.
Consider a simple directed graph with three vertices: 1, 2, and 3. Let the independent family $\mathcal{I}$ consist of all edge subsets lacking directed cycles. Let $A = \{(1,2), (2,3)\}$. This set has size 2 and forms a directed path, so $A \in \mathcal{I}$. Let $B = \{(2,1), (3,2), (3,1)\}$. This set has size 3. The only paths are $3 \to 2 \to 1$ and $3 \to 1$. There are no directed cycles, so $B \in \mathcal{I}$.
According to the matroid exchange property, because $|B| > |A|$, there must exist some edge in $B \setminus A$ that we can add to $A$ without creating a directed cycle. Let us test every available edge in $B$:
Every single extension fails. It is impossible to augment $A$ with an edge from $B$ while preserving independence. Therefore, the system of directed acyclic subgraphs is absolutely not a matroid.
We can construct an incidence matrix for a directed graph (without self-loops) over the real numbers. The matrix $M$ has dimensions $|V| \times |E|$, where $M_{v,e} = -1$ if edge $e$ leaves vertex $v$, $M_{v,e} = 1$ if edge $e$ enters vertex $v$, and $0$ otherwise.
If a subset of edges forms a directed cycle, summing their corresponding columns yields the zero vector. This happens because every vertex in the directed cycle has exactly one edge entering it ($+1$) and exactly one edge leaving it ($-1$), causing the row sums to cancel out to $0$. Consequently, if a set of columns in this directed incidence matrix is linearly independent, the corresponding set of edges cannot possibly contain a directed cycle.
This creates an apparent paradox: linear algebraic independence always defines a matroid (as established in matric matroids), yet we just proved that directed acyclic subgraphs do not form a matroid. How can both be true?
The resolution lies in understanding that the converse of the matrix property carries a stricter condition than just "no directed cycles." If the matrix columns are linearly independent, there are no directed cycles. However, the absence of directed cycles is not sufficient to guarantee linear independence in this matrix.
Suppose the edges form an undirected cycle—for instance, edges $(1,2)$, $(1,3)$, and $(2,3)$. This is a DAG, but we can still form the zero vector by manipulating the signs: column $(1,2)$ minus column $(1,3)$ plus column $(2,3)$ equals the zero vector. By multiplying columns by either $+1$ or $-1$ depending on whether the directed edge aligns with or opposes an arbitrary cyclic traversal, we can force the rows to cancel to zero for any underlying undirected cycle.
Therefore, a set of columns in the directed incidence matrix is linearly independent if and only if the corresponding edges contain no cycles of any kind (neither directed nor undirected). The matroid defined by this matrix corresponds to the standard acyclic forests of the undirected skeleton of the graph, which we already know is a valid graphic matroid. There is no contradiction because the independent family defined by the matrix is much stricter than the family of "directed acyclic subgraphs."
The unit-time task scheduling problem from earlier relies on finding a maximum-weight independent set of tasks. The standard greedy algorithm accepts or rejects tasks based on a capacity check, and only after determining the optimal set does it construct the final schedule. An alternative, more direct approach constructs the schedule iteratively on the fly.
Suppose we have $n$ initially empty time slots, where slot $i$ represents the unit of time finishing at time $i$. We process the tasks in monotonically decreasing order of their penalty weights. For each task $a_j$ with deadline $d_j$, we attempt to assign it to the latest available time slot that is less than or equal to $d_j$. If such a slot exists, the task is scheduled early and incurs no penalty. If no such slot exists (meaning all slots prior to $d_j$ are filled), the task is doomed to be late, and we simply assign it to the latest of the remaining unfilled slots in the entire schedule to keep it out of the way.
The logic of this strategy is captured in the following straightforward procedural logic:
LATEST-SLOT-SCHEDULE(S)
1: Sort the tasks in S in monotonically decreasing order of penalty w
2: Initialize an array slot[1..n] to EMPTY
3: for each task a_j ∈ S, taken in sorted order do
4: Find the largest integer i ≤ d_j such that slot[i] == EMPTY
5: if such an i exists
6: slot[i] = a_j // Task is scheduled early
7: else
8: Find the largest integer k ≤ n such that slot[k] == EMPTY
9: slot[k] = a_j // Task is scheduled late
10: return slot
Proof of Optimality. To establish that this alternative algorithm always produces an optimal answer, we must prove that it selects the exact same independent set of early tasks as the standard matroid greedy algorithm. The standard algorithm evaluates a task $a_j$ and adds it to the early set $A$ if and only if the capacity constraint $N_t(A) \le t$ is maintained for all time steps $t$.
Our alternative algorithm attempts to place task $a_j$ in the latest possible valid slot $i \le d_j$. By placing the task as late as legally possible, the algorithm strategically preserves the maximum number of empty slots prior to time $i$. Reserving the earliest slots is critical because subsequent tasks may have very tight deadlines (e.g., $d = 1$ or $d = 2$). If we arbitrarily placed $a_j$ in slot 1 when its deadline was 4, we might unnecessarily block a future task that must be scheduled in slot 1.
If the algorithm successfully finds an empty slot $i \le d_j$, it means scheduling $a_j$ does not violate the capacity constraint $N_t \le t$ for any $t$. The task is legally independent and is accepted. If no such slot exists, it means slots $1$ through $d_j$ are completely filled with previously processed tasks. Because tasks are processed in decreasing order of weight, the tasks currently occupying those slots all possess a higher (or equal) penalty than $a_j$. Adding $a_j$ would force one of those heavier tasks to be late, violating independence. Therefore, the task $a_j$ is rightfully relegated to a late slot. This algorithm mirrors the exact acceptance and rejection logic of the matroid greedy property, it is mathematically guaranteed to construct an optimal schedule. $\quad \blacksquare$
While the logic is sound, a naive implementation of finding the "latest available slot" requires scanning backward from $d_j$ step-by-step. In the worst case, this scanning takes $O(n)$ time per task, yielding an $O(n^2)$ overall time complexity. We can use the fast disjoint-set forest data structure (which utilizes path compression and union by rank) to dramatically accelerate this search to near-constant time.
We configure the disjoint-set forest to map contiguous blocks of filled time slots to their nearest available empty slot to the left. We initialize $n + 1$ sets, labeled $0$ through $n$. The set $0$ represents the "dummy" slot indicating that no time slots are available. Initially, every slot $i$ is empty, so each slot is its own set, and the "representative" (root) of set $i$ is simply $i$.
When we evaluate task $a_j$ with deadline $d_j$, we perform a FIND-SET(d_j). By design, this operation instantly returns the largest available empty slot $i \le d_j$.
FIND-SET(d_j) returns $0$, all slots before $d_j$ are filled. The task is late.FIND-SET(d_j) returns $i > 0$, we assign task $a_j$ to slot $i$. Because slot $i$ is now filled, it can no longer be the available slot for future queries. We must merge its set with the set immediately to its left. We perform UNION(i, i - 1), ensuring that the new root of this merged block correctly points to the representative of the left-hand set (the next available empty slot).This transforms the scheduling logic into a sequence of highly optimized set operations:
DSU-SCHEDULE(S)
1: Sort the tasks in S in monotonically decreasing order of penalty w
2: Initialize array slot[1..n] to EMPTY
3: for i = 0 to n do
4: MAKE-SET(i)
5: root[i] = i // The root tracks the actual available slot index
6:
7: for each task a_j ∈ S, taken in sorted order do
8: available_slot = root[FIND-SET(d_j)]
9: if available_slot > 0
10: slot[available_slot] = a_j
11: // Merge the newly filled slot with the contiguous block to its left
12: UNION(available_slot, available_slot - 1)
13: // Ensure the new root points to the available slot of the left block
14: new_rep = FIND-SET(available_slot)
15: root[new_rep] = root[FIND-SET(available_slot - 1)]
16: else
17: // Task is late; handle via secondary disjoint-set or linear placement
18: Place a_j in the latest globally available slot
19: return slot
The initialization requires sorting the $n$ tasks, which consumes $O(n \log n)$ time. Creating the $n+1$ disjoint sets takes $O(n)$ time. For each of the $n$ tasks, we perform a constant number of FIND-SET and UNION operations. Using the standard heuristics of path compression and union by rank, a sequence of $m$ operations on a disjoint-set forest with $n$ elements executes in $O(m \alpha(n))$ time, where $\alpha(n)$ is the extremely slow-growing inverse Ackermann function.
Thus, the total time spent dynamically routing tasks to their correct slots is bounded by $O(n \alpha(n))$. Because $\alpha(n) \le 4$ for any practical input size, this phase is effectively linear. The entire algorithm's running time is therefore dominated strictly by the initial sorting step, resulting in a highly efficient overall time complexity of $O(n \log n)$.
The problem of caching involves maintaining a small, high-speed memory (the cache) to store a subset of data from a much larger, slower main memory. As a program executes, it generates a sequence of $n$ memory requests, $\langle r_1, r_2, \dots, r_n \rangle$. The cache has a strict capacity of $k$ elements. Upon requesting an element $r_i$, if $r_i$ is already in the cache, the system registers a cache hit. If $r_i$ is absent, a cache miss occurs. The system must fetch $r_i$ from main memory and, if the cache is already full (containing $k$ elements), it must evict exactly one element to make room for $r_i$. The primary objective of any cache-management algorithm is to strategically select which elements to evict to minimize the total number of cache misses over the entire sequence.
In real-world computer architecture, caching is an on-line problem; the system must make eviction decisions without knowing what requests are coming next. However, exploring the off-line caching problem—where the entire sequence of future requests is known in advance—is crucial. Finding the absolute minimum number of cache misses provides a theoretical lower bound, allowing us to evaluate how well practical on-line heuristics (like Least Recently Used, or LRU) perform compared to mathematical perfection.
The optimal strategy for off-line caching is a greedy approach known as furthest-in-future (also historically referred to as Belady's optimal algorithm). When forced to evict an element, this strategy scans the remaining sequence of future requests and evicts the cached item whose next access occurs furthest in the future (or an item that is never requested again).
To implement this algorithm efficiently, we must avoid scanning the entire remaining sequence every time a cache miss occurs. Instead, we can precompute the exact index of the "next use" for every request in a single backward pass. We can then maintain the cache using a Maximum Priority Queue (like a max-heap), where each element's priority key is the index of its next occurrence in the sequence. Elements that never appear again are assigned a priority of $\infty$.
FURTHEST-IN-FUTURE(r, k)
1: n = r.length
2: Let next_use[1..n] be a new array
3: Let last_seen be a dictionary/hash map mapping elements to their next index
4: // Precompute the next use index for every request
5: for i = n down to 1 do
6: if r[i] is in last_seen
7: next_use[i] = last_seen[r[i]]
8: else
9: next_use[i] = ∞
10: last_seen[r[i]] = i
11:
12: Let C be a Max-Priority Queue representing the cache, keyed by next_use
13: Let E be a sequence of eviction decisions
14: for i = 1 to n do
15: if r[i] is in C
16: // Cache hit: update the element's next use time
17: INCREASE-KEY(C, r[i], next_use[i])
18: Append NONE to E
19: else
20: // Cache miss
21: if C.size == k
22: // Cache is full, evict the item needed furthest in the future
23: evicted_item = EXTRACT-MAX(C)
24: Append evicted_item to E
25: else
26: Append NONE to E
27: INSERT(C, r[i] with key next_use[i])
28: return E
The precomputation phase (lines 5-10) scans the request sequence of length $n$ exactly once from right to left. Assuming $O(1)$ average time for dictionary lookups and insertions, this phase takes $O(n)$ time. During the main loop (lines 14-27), we process $n$ requests. For each request, we either perform an INCREASE-KEY operation (on a cache hit) or an EXTRACT-MAX and INSERT operation (on a cache miss). Because the priority queue never holds more than $k$ elements, each of these heap operations takes $O(\log k)$ time. Therefore, the total running time of this highly optimized furthest-in-future algorithm is $O(n \log k)$.
To prove that a greedy strategy yields the optimal solution, we must first demonstrate that the problem exhibits optimal substructure. This means that an optimal sequence of decisions for the entire problem incorporates optimal decisions for the remaining subproblems.
Let $C_i$ represent the exact state of the cache (the specific $k$ elements it holds) immediately after processing the $i$-th request $r_i$. The remaining subproblem is to process the suffix of requests $\langle r_{i+1}, r_{i+2}, \dots, r_n \rangle$ starting from the initial cache state $C_i$, with the goal of minimizing misses.
Suppose an optimal global schedule $S^*$ processes the entire sequence $\langle r_1, \dots, r_n \rangle$ with a minimum total number of misses, $M^*$. After $i$ steps, $S^*$ leaves the cache in some state $C_i$ and accumulates $m_1$ misses. Let $m_2$ be the number of misses $S^*$ incurs on the remaining suffix, such that $M^* = m_1 + m_2$. If the subproblem (processing the suffix from state $C_i$) did not exhibit optimal substructure, there would exist an alternative valid sequence of eviction decisions $S'$ for the suffix that incurs $m'_2 < m_2$ misses. If such an $S'$ existed, we could graft $S'$ onto the first $i$ steps of $S^*$. The newly combined schedule would process the entire sequence with $m_1 + m'_2 < M^*$ misses. However, this strictly contradicts the premise that $S^*$ is globally optimal. Thus, the optimal global schedule must necessarily consist of optimal subproblem solutions, confirming the optimal substructure property.
Proof of Optimality: The Greedy-Choice Property. Having established optimal substructure, we must rigorously prove the greedy-choice property: evicting the element whose next access is furthest in the future never prevents us from achieving a global minimum of cache misses. We prove this using an exchange argument.
Let $S_G$ be the schedule produced by our furthest-in-future greedy algorithm, and let $S^*$ be an optimal schedule that minimizes total misses. If $S_G = S^*$, the greedy algorithm is optimal. Suppose they differ. Let $i$ be the very first step (request) where the eviction decisions of $S_G$ and $S^*$ diverge. At step $i$, a cache miss occurs for request $r_i$. The greedy algorithm $S_G$ evicts element $f$ (the one needed furthest in the future), while the optimal algorithm $S^*$ evicts element $e \neq f$. Note that because they agreed up to step $i$, their cache contents just before this eviction were identical.
We will construct a new schedule, $S^{**}$, which mimics $S^*$ completely, except at step $i$, it chooses to evict $f$ instead of $e$. We must show that $S^{**}$ is still a valid schedule and incurs no more misses than $S^*$.
Since $f$ is the element needed furthest in the future, we know definitively that the next request for $e$ (if it ever occurs) happens before the next request for $f$. Let's trace the execution of $S^{**}$ compared to $S^*$ immediately after step $i$:
In every possible scenario, the modified schedule $S^{**}$ realigns its cache to match $S^*$ perfectly before or at the moment $f$ is finally requested. Most importantly, $S^{**}$ never incurs more misses than $S^*$; in fact, if $S^*$ missed on $e$, $S^{**}$ actively saved a miss. Since $S^*$ was defined as optimal, $S^{**}$ must also be optimal, despite adopting the greedy choice at step $i$.
By repeatedly applying this exchange argument at every step where the greedy schedule diverges from the optimal schedule, we can progressively transform any optimal schedule into the furthest-in-future greedy schedule without ever increasing the total number of cache misses. Therefore, the furthest-in-future strategy unconditionally produces the minimum possible number of cache misses. $\blacksquare$
At the heart of many advanced algorithm design techniques lies the concept of self-reducibility. This mathematical property allows a problem to be expressed in terms of smaller or simpler instances of itself. Divide-and-Conquer, Dynamic Programming, and Greedy algorithms are three of the most popular classes of algorithms that inherently leverage self-reducibility to build global solutions from local evaluations. Beyond these foundational paradigms, self-reducibility is also the critical engine driving other sophisticated frameworks, such as the Local Ratio method. Within this ecosystem, Matroids occupy a highly specialized subset of the Greedy paradigm, relying on a strict Exchange Property to guarantee optimality.
Incidentally, on the topic of diagrams, we an illustrate the Greedy approach to algorithm design below.
The Local Ratio method provides a robust alternative to standard greedy approaches, particularly when dealing with optimization problems involving weights. In a Greedy approach, all that is done is actually quite simple: pick the best element, add it to a solution, and then reduce the problem. This paradigm can fail. Recall: a greedy approach makes decisions based on a static, metric—like picking the item with the lowest cost or highest value-to-weight ratio. This is short-sighted, especially in complex problems (like finding a minimum weight Vertex Cover) where a node with a low cost might only cover one edge, while a slightly more expensive node covers fifty. A pure greedy algorithm gets trapped by these local illusions. The Local Ration method is a bit more robust in that an element is not simply "picked," it is "paid for" by weights. The basic mathematical idea is surprisingly elegant. Consider an optimization problem where the goal is to evaluate $$\min_{x \in \Omega} c(x)$$ or $$\max_{x \in \Omega} c(x)$$ for an element $x.$ Instead of making a definitive "take it or leave it" decision based on the original weights, Local Ratio incrementally deconstructs the weights themselves. We isolate a local bottleneck, pay for it by splitting the weight function, and defer the final decision until the problem becomes trivial. We aren't breaking the graph into smaller piece; we are breaking the cost function.
Like we said before, the Local Ratio technique is quite useful in the design and analysis of approximation algorithms for NP-hard optimisation problems. We will now begin defining more rigorously the notation and terms we will need. An optimisation problem is comprised of a family of problem instances associated with
The minimum spanning tree (MST) problem is a classic example of an optimisation problem. There, we seek those solutions under the objective that the spanning tre is of, well, minimum total cost given a graph with edge weights. This problem is polynomial-time solvable, but there are others which abide by the same paradigm that are not. These problems are called NP-Hard problems, and for them, it suffices to find approximate solutions. A solution whose cost is within a factor of $r \geq 1$ of the optimum is said to be $r$-approximate. An $r$-approximation algorithm is guaranteed to return $r$-approximate solutions.
For minimisation problem, a feasible solution $S$ is said to be $r$-approximate if $$w(S) \leq r \cdot w(S*)$$ where $w(S)$ is the cost of $S$ and $S*$ is the optimal solution. With some algebra, we can play around and get a feasible solution that is said to be $r$-approximate for the maximisation case if $$w(S) \geq w(S*_/r.$$ In both cases, the closer $S$ is to $S*$, the smaller the $r$-value. $r$ has been defined as the approximation factor, which gives rise to the approximation ratio, where $$inf\{r | r \text{ is a performance guarantee of the algorithm}\}.$$
Historically, many approximation algorithms relied on localized payments—making a down payment on several items in each round and arguing that the optimal cost (OPT) must drop proportionally because every optimal solution involves some of those items. However, this weakness becomes apparent in problems where no single set of items is necessarily involved in every optimal solution (such as the Feedback Vertex Set problem). The Local Ratio Theorem elegantly bypasses this by focusing not on the items themselves, but on linearly decomposing the weight function.
Suppose the objective (cost or weight) function can be linearly decomposed into two constituent functions: $w(x) = w_1(x) + w_2(x)$.
The Local Ratio Theorem (Exact Case). Let $\Omega$ be a set of feasible solutions, and let $w, w_1, w_2$ be objective functions mapping $\Omega$ to the real numbers such that $w = w_1 + w_2$. If a feasible solution $x^* \in \Omega$ is an optimal solution for both objective functions $w_1$ and $w_2$ simultaneously, then $x^*$ is guaranteed to be an optimal solution for the original objective function $w$.
At its core, this theorem guarantees that optimization is additive. If you can find a solution that happens to be perfectly optimal for two separate, simpler cost landscapes ($w_1$ and $w_2$), you don't need to worry about the complex landscape of $w$. You are already at its global optimum. This gives us mathematical permission to solve a hard problem by slicing its weight function into easier, bite-sized pieces.
Proof. $x^*$ minimizes both sub-functions independently. This means that for any other feasible $x$, we know that $w_1(x) \ge w_1(x^*)$ and $w_2(x) \ge w_2(x^*)$. By summing these inequalities, we obtain $w(x) = w_1(x) + w_2(x) \ge w_1(x^*) + w_2(x^*) = w(x^*) \quad \blacksquare$.
While the algebra is trivially simple, the implication is massive. The proof shows that as long as we never make a sub-optimal choice in $w_1$ and $w_2$, our final combination is strictly protected. There are no "cross-terms" or hidden penalties when recombining the weights.
The true power of the theorem emerges when applied to approximations for NP-hard covering problems (like we said before). Using it, the theorem proves that approximation ratios are preserved under linear decomposition.
Theorem (Local Ratio—Minimization Problems). Let $\mathcal{F}$ be a set of feasibility constraints on vectors in $\mathbb{R}^n$. Let $w, w_1, w_2 \in \mathbb{R}^n$ be such that $w = w_1 + w_2$. Let $x \in \mathbb{R}^n$ be a feasible solution (with respect to $\mathcal{F}$) that is $r$-approximate with respect to $w_1$ and with respect to $w_2$. Then, $x$ is $r$-approximate with respect to $w$ as well.
Proof. Let $x^*, x_1^*$, and $x_2^*$ be the absolute optimal solutions with respect to $w, w_1$, and $w_2$, respectively. Because $x^*$ is a valid feasible solution overall, it cannot be better than the absolute optimums for the sub-functions. Therefore, $w_1 \cdot x_1^* \le w_1 \cdot x^*$ and $w_2 \cdot x_2^* \le w_2 \cdot x^*$. Thus,
$$w \cdot x = w_1 \cdot x + w_2 \cdot x \le r(w_1 \cdot x_1^*) + r(w_2 \cdot x_2^*) \le r(w_1 \cdot x^*) + r(w_2 \cdot x^*) = r(w \cdot x^*) \blacksquare$$
Algorithms based on the Local Ratio Theorem are typically recursive and follow a general structure: If a zero-cost solution can be found, return it. Otherwise, find a decomposition of $w$ into two weight functions $w_1$ and $w_2 = w - w_1$, and solve the problem recursively on $w_2$.
The true genius lies in how we deliberately engineer the function $w_1(x)$. We want to construct $w_1(x)$ such that it possesses a vastly large set of optimal (or $r$-approximate) solutions. To formalize this shared trait across all local ratio algorithms, we use the following definition:
Definition 1 (Fully r-effective). Given a set of constraints $\mathcal{F}$ on vectors in $\mathbb{R}^n$ and a number $r \ge 1$, a weight vector $w \in \mathbb{R}^n$ is said to be fully $r$-effective if there exists a number $b$ such that $b \le w \cdot x \le r \cdot b$ for all feasible solutions $x$.
If $w_1$ is fully $r$-effective, then every single valid solution in the entire problem space is automatically an $r$-approximation for $w_1$. When this is achieved, navigating the problem mathematically reduces to simply finding the optimum of the residual function, $\min_{x \in \Omega} w_2(x)$. Because $w_1$ is essentially "solved automatically," we ignore it and focus entirely on $w_2$. By repeating this successive weight deduction, we continuously shrink the numerical complexity until trivial (zero-cost) base cases emerge.
The classical Activity Selection problem asks us to evaluate a set of intervals $[s_1, f_1), [s_2, f_2), \dots, [s_n, f_n)$ and find a subset of nonoverlapping intervals to maximize the absolute cardinality (number of tasks scheduled). Two intervals are deemed nonoverlapping if their intersection is empty: $[s_i, f_i) \cap [s_j, f_j) = \emptyset.$
However, a theoretical puzzle arises: what if each interval is associated with a nonnegative weight, and we instead want to find a subset of nonoverlapping intervals that maximizes the total weight? Under these weighted conditions, the standard matroid exchange property does not hold, meaning a naive greedy approach will fail to find the optimal schedule.
The weighted scheduling problem forces us to schedule jobs on a single processor strictly with no preemption, and each job may be scheduled in one interval only. Mathematically, this is formulated as maximizing $\sum_{I} p(I) \cdot x_I$ subject to the binary constraint $x_I \in \{0,1\}$ for each instance $I$, and the capacity constraint $\sum_{I: s(I) \le t < f(I)} x_I \le 1$ for every slice of time $t$. This simply mandates that the processor handles at most one job simultaneously.
To solve this, we define Maximal Solutions. A feasible schedule is defined as $I$-maximal if it either currently contains instance $I$, or it does not contain $I$ but attempting to add $I$ to the schedule would render it infeasible due to overlaps.
We apply the Local Ratio technique by constructing an effective profit function. Let $\hat{I}$ be the interval in our set that ends first. We define the first profit decomposition $p_1(I)$ as equal to $p(\hat{I})$ if $I$ is in direct conflict with $\hat{I}$ (including $\hat{I}$ itself), and $0$ otherwise. The residual profit function is simply $p_2(I) = p(I) - p_1(I)$. Note that this subtraction means $p_2(I)$ can drop into negative values.
For every possible feasible solution $x$, the dot product $p_1 \cdot x \le p(\hat{I})$. Therefore, every $\hat{I}$-maximal solution automatically maximizes $p_1$, guaranteeing optimality for that slice of the objective function. By running this recursively, the problem reduces cleanly.
MAX-IS(S, p)
1: if S == ∅ then
2: return ∅
3: if ∃ I ∈ S such that p(I) ≤ 0 then
4: return MAX-IS(S - {I}, p)
5: Let Î ∈ S be the interval that ends first
6: for each I ∈ S do
7: p1(I) = p(Î) × (1 if I conflicts with Î, else 0)
8: IS = MAX-IS(S, p - p1)
9: if IS is Î-maximal then
10: return IS
11: else
12: return IS ∪ {Î}
To understand how the MAX-IS algorithm leverages the Local Ratio Theorem, we must first recognise the shift from minimisation to maximisation. The fundamental logic of the theorem remains identical: if we can decompose our profit function $p = p_1 + p_2$, and find a solution that is simultaneously optimal for both $p_1$ and $p_2$, it is guaranteed to be optimal for $p$.
In the unweighted Activity Selection problem, any maximal independent set (a schedule where no more jobs can fit) is just as good as another if they have the same size, which perfectly aligns with the Matroid Exchange Property. However, the introduction of weights breaks this. A single, highly profitable job might overlap with three low-profit jobs. A naive greedy algorithm (Matroid approach) cannot safely decide whether to take the one heavy job or the three light ones without looking ahead. Local Ratio bypasses this by decomposing the weights instead of agonizing over the intervals themselves.
Because we are maximizing profit, intervals with zero or negative profit actively harm or provide no value to our objective. Line 3 and Line 4 simply prune these out. This acts as a secondary base case alongside the empty set in Lines 1-2.
Recall that the Local Ratio is building a $w_1$ (or in this case, $p_1$) where almost every valid solution is optimal.
Line 5 selects $\hat{I}$, the interval that ends first. Line 6 and Line 7 define $p_1$ such that $\hat{I}$ and every interval that conflicts with it receives a profit equal to $p(\hat{I})$. All other intervals get a profit of $0$.
$\hat{I}$ is the interval that ends first, so any interval that conflicts with $\hat{I}$ must cross the exact moment in time when $\hat{I}$ ends. Because the processor can only handle one job at a time, it is physically impossible to schedule more than one of these conflicting intervals. Therefore, under the $p_1$ profit function, the absolute maximum profit any valid schedule can achieve is exactly $p(\hat{I})$. Any schedule that picks at least one of these intervals is mathematically optimal for $p_1$!
Once $p_1$ is defined, we subtract it from our total profit to find the residual: $p_2 = p - p_1$. We then trust the recursion in Line 8 to find an optimal solution for $p_2$. Note that because of the subtraction, some profits in $p_2$ might drop below zero—which is exactly why Lines 3-4 exist to clean them up in the next recursive frame.
When the recursion unwinds, we receive a schedule IS that is optimal for $p_2$. But to invoke the Local Ratio Theorem, IS must also be optimal for $p_1$.
To be optimal for $p_1$, the schedule IS must contain exactly one interval from our conflict zone (earning the $p(\hat{I})$ profit). This is what it means for the solution to be $\hat{I}$-maximal.
Line 9 checks if IS is already $\hat{I}$-maximal (i.e., it already includes an interval that conflicts with $\hat{I}$). If it does, we are safe! It is optimal for $p_1$ and $p_2$, so we just return IS (Line 10).IS missed out on the $p_1$ profit entirely. Because no intervals in IS conflict with $\hat{I}$, we can safely force it to be optimal by simply adding $\hat{I}$ to the schedule in Line 12 (return IS ∪ {Î}).By guaranteeing the solution is maximal with respect to $\hat{I}$ at every step of the unwinding recursion, we ensure the solution remains perfectly optimal for every single $p_1$ slice we carved out. The Local Ratio Theorem then mathematically cements that the final assembled schedule is optimal for the original profit function $p$.
The concept of spanning trees is extended to directed graphs through an Arborescence. An arborescence with a designated root $r$ is a subgraph $T=(V,F)$ of a directed graph $G=(V,E)$ that completely lacks any pair of opposite edges (e.g., $(u,v)$ and $(v,u)$). Furthermore, if the directions of the edges are entirely ignored, $T$ forms a standard undirected spanning tree, and there must exist a valid directed path from the root $r$ to every other vertex $v \in V$.
The problem is defined as: Given a directed graph $G=(V,E)$ populated with nonnegative edge costs $c: E \to \mathbb{R}^+$ and a starting node $r \in V$, find an arborescence with the absolute minimum total cost.
The optimal algorithm requires observing three key properties:
Key Point 1: Let $\delta_{in}(v)$ represent the set of all incoming edges directed at node $v$. For every node $v \neq r$, greedily select the single cheapest edge $e_v$ from $\delta_{in}(v)$. If the collection of these edges $F^* = \{e_v \mid v \in V - \{r\}\}$ successfully forms an arborescence, then $F^*$ is mathematically proven to be optimal. If it does not form an arborescence, $F^*$ must contain at least one directed cycle $C$.
Key Point 2: We apply self-reducibility. For any edge $(u,v) \in \delta_{in}(v)$, define a base cost $c'(u,v) = c(e_v)$, mapping every incoming edge to the cost of the cheapest option. We then define a residual cost $c''(u,v) = c(u,v) - c'(u,v)$. Crucially, any arborescence $F$ is minimum cost with respect to the original cost $c$ if and only if it is minimum cost with respect to the residual $c''$. Why? Because the cost of any arborescence under $c'$ is strictly $c'(F) = \sum_{v \in F - \{r\}} c(e_v)$. Since every valid arborescence must have exactly one incoming edge to each vertex, this sum is a fixed constant across all possible arborescences, meaning every arborescence is an optimal solution for the cost function $c'$.
Key Point 3: If we identify a cycle $C$ (not containing root $r$) where the residual cost $c''(C) = 0$, there inherently exists a minimum-cost arborescence $T=(V,F)$ that enters this cycle $C$ exactly once. This allows us to collapse the entire cycle into a single super-node and recurse.
The problem of finding a Minimum-Cost Arborescence (often solved by the Chu-Liu/Edmonds' algorithm) is one of the most elegant applications of the Local Ratio Theorem. The algorithm succeeds by decomposing a complex cost function into a trivial base function and a residual function.
A naive greedy approach would simply look at every node $v$ and pick the absolutely cheapest incoming edge to satisfy the requirement that every node (except the root) must have exactly one parent. If you do this and miraculously form an arborescence (a directed tree with no cycles), you are done! It is mathematically optimal.
However, because the graph is directed, this "blindly greedy" approach frequently creates a directed cycle instead of a tree. This is where the standard greedy algorithm gets stuck, and where the Local Ratio Theorem steps in to save the day. To use the Local Ratio Theorem, we need to split our cost function $c = c_1 + c_2$. We want to engineer $c_1$ such that every single valid arborescence is equally optimal for $c_1$. How do we do this?
Let $m(v)$ be the cost of the cheapest incoming edge to node $v$. We define our base cost function $c_1$ such that every incoming edge to $v$ is assigned a cost of exactly $m(v)$.
An arborescence is essentially a directed spanning tree, which means the rules of the structure dictate that every node (except the root) must have exactly one incoming edge. Therefore, no matter which valid arborescence you build, when evaluated under the $c_1$ cost function, it will always pay exactly $\sum_{v \neq r} m(v)$.
Since the cost under $c_1$ is a fixed mathematical constant for all valid solutions, every valid arborescence is an optimal solution for $c_1$. The $c_1$ function is perfectly "fully effective." The Local Ratio Theorem guarantees that we can now completely ignore $c_1$ and focus entirely on finding the optimal solution for the residual cost, $c_2 = c - c_1$.
By subtracting $c_1$ from our original costs, our residual cost $c_2$ inherently possesses a beautiful property: for every node $v$, its cheapest incoming edge now has a cost of exactly $0$.
If our initial greedy choices formed a cycle $C$, every edge in that cycle now has a residual cost of $0$ under $c_2$. Because they are free, it is highly advantageous to use them. However, a valid arborescence cannot contain cycles.
We know an optimal arborescence must enter this cycle exactly once (to connect these nodes to the root) and then use the zero-cost edges to traverse the rest of the cycle. To solve this, we can take the entire cycle $C$ and collapse it into a single "super-node." We then recursively call the algorithm on this smaller, collapsed graph using the residual costs $c_2$. When the recursion unwinds and returns an optimal arborescence for the collapsed graph, we expand the super-node back into a cycle. Because one edge of the arborescence will enter the cycle, we simply drop the one internal edge of the cycle that points to that same entry node, breaking the cycle and forming a perfect tree.
Since the Local Ratio Theorem proves that optimizing $c_2$ simultaneously optimizes $c$, we are mathematically guaranteed that the final, unwound arborescence is the absolute minimum-cost solution for the original graph.
The Minimum Spanning Tree (MST) holds a profound geometric property relative to any other spanning tree in the same graph. Let $T$ be an arbitrary spanning tree and $T^*$ be the minimum spanning tree of a network. There mathematically exists a one-to-one and onto mapping $\sigma: E(T) \to E(T^*)$ such that for every edge $e \in E(T)$, the weight is strictly bounded by $||e|| \ge ||\sigma(e)||$.
This is proven by mapping shared edges directly ($\sigma(e) = e$ for every edge $e \in E(T^*) \cap E(T)$). For any differing edge $e \in E(T) \setminus E(T^*)$, uniting it with the MST $(e \cup T^*)$ forces the creation of a cycle $Q_e$. Because $T^*$ is minimal, for every edge $e'$ within $Q_e$, it must hold that $||e'|| \le ||e||$. Utilizing Hall's Marriage Theorem, we can orchestrate a perfect bipartite matching between $E(T) \setminus E(T^*)$ and $E(T^*) \setminus E(T)$ such that every edge maps to a smaller or equal counterpart within its respective cycle.
This $\sigma$ mapping acts as the backbone for several powerful corollaries:
Uniqueness of the MST: Suppose all edges of a connected graph $G=(V,E)$ possess distinctly unique, nonnegative weights. The graph is guaranteed to have a unique MST. If a different spanning tree $T$ existed, the $\sigma$ mapping defined above guarantees $||e|| \ge ||\sigma(e)||$. Because the trees differ, there must be at least one edge where $e \neq \sigma(e)$, and since weights are strictly distinct, $||e|| > ||\sigma(e)||$. Summing the weights yields $length(T) = \sum_{e\in E(T)} ||e|| > \sum_{e\in E(T)} ||\sigma(e)|| = length(T^*)$, proving $T$ cannot be minimal.
Steinerized Spanning Trees: Consider a set of points $P$ in a Euclidean plane and a fixed positive distance $R$. A steinerized spanning tree places Steiner points on edges to fracture them into pieces of length at most $R$. The spanning tree that requires the absolute minimum number of Steiner points is always obtained from the MST. Let $n_s(T)$ denote the minimum number of Steiner points to partition tree $T$. Leveraging the same $\sigma$ mapping property, we know $n_s(e) \ge n_s(\sigma(e))$. Integrating this over the whole tree gives $n_s(T) = \sum_{e \in E(T)} n_s(e) \ge \sum_{e \in E(T)} n_s(\sigma(e)) = n_s(T^*)$.
Alpha-Power Minimization: For a graph $G=(V,E)$ with edge weights $w: E \to \mathbb{R}^+$, the spanning tree $T$ that minimizes $\sum_{e \in E(T)} ||e||^\alpha$ for any fixed $\alpha > 0$ is precisely the minimum spanning tree. Once again, since the mapping maintains $||e|| \ge ||\sigma(e)||$, applying the power of $\alpha$ preserves the inequality, dictating that $cost(T) = \sum_{e \in E(T)} ||e||^\alpha \ge \sum_{e \in E(T)} ||\sigma(e)||^\alpha = cost(T^*)$.
While the Local Ratio Theorem and the $\sigma$-mapping rely on different mathematical engines, they share a profound, unifying philosophy: proving global optimality through localized, undeniable inequalities.
In the Local Ratio Theorem, we simplify an optimization problem by slicing the cost function. We prove that a solution $x^*$ is globally better than $x$ by finding sub-functions where $w_1(x) \ge w_1(x^*)$ and $w_2(x) \ge w_2(x^*)$. We completely bypass the need to evaluate the entire complex cost landscape; if we win on every individual slice of the weight, we win globally.
The MST $\sigma$-mapping accomplishes the exact same goal, but it slices the solution structure instead of the weights. By using Hall's Marriage Theorem to orchestrate a perfect bipartite matching, we decompose the two massive, complex trees ($T$ and $T^*$) into distinct, one-to-one edge battles. Because $T^*$ is a matroid basis, it is mathematically guaranteed to win or tie every single localized edge battle ($||e|| \ge ||\sigma(e)||$).
Just as Local Ratio algorithms sum their trivial sub-functions to guarantee a global minimum ($w_1 + w_2$), the MST proofs simply sum these localized edge battles. Whether we are summing standard weights, counting Steiner points, or applying non-linear $\alpha$-powers, the algebraic logic is identical: if you never lose a local, one-to-one comparison, you cannot possibly lose globally.