Median is the middle value in an ordered integer list. If the size of the list is even, there is no middle value. So the median is the mean of the two middle value.
For example,[2,3,4]
, the median is 3
[2,3]
, the median is (2 + 3) / 2 = 2.5
Design a data structure that supports the following two operations:
- void addNum(int num) - Add a integer number from the data stream to the data structure.
- double findMedian() - Return the median of all elements so far.
Example:
addNum(1) addNum(2) findMedian() -> 1.5 addNum(3) findMedian() -> 2
Follow up:
- If all integer numbers from the stream are between 0 and 100, how would you optimize it?
- If 99% of all integer numbers from the stream are between 0 and 100, how would you optimize it?
class MedianFinder {
priority_queue<int> maxQ;
priority_queue<int, vector<int>, greater<int>> minQ;
public:
/** initialize your data structure here. */
MedianFinder() {
}
void addNum(int num) {
minQ.push(num);
maxQ.push(minQ.top());
minQ.pop();
// Balance.
if (minQ.size() < maxQ.size()) {
minQ.push(maxQ.top());
maxQ.pop();
}
}
double findMedian() {
// Odd.
if (minQ.size() > maxQ.size()) {
return minQ.top();
}
// Even
return (minQ.top() + maxQ.top())/ 2.0;
}
};
/**
* Your MedianFinder object will be instantiated and called as such:
* MedianFinder* obj = new MedianFinder();
* obj->addNum(num);
* double param_2 = obj->findMedian();
*/
==============================
==============================
Approach 1: Simple Sorting
Intuition
Do what the question says.
Algorithm
Store the numbers in a resize-able container. Every time you need to output the median, sort the container and output the median.
Complexity Analysis
- Time complexity: .
- Adding a number takes amortized time for a container with an efficient resizing scheme.
- Finding the median is primarily dependent on the sorting that takes place. This takes time for a standard comparative sort.
- Space complexity: linear space to hold input in a container. No extra space other than that needed (since sorting can usually be done in-place).
Approach 2: Insertion Sort
Intuition
Keeping our input container always sorted (i.e. maintaining the sorted nature of the container as an invariant).
Algorithm
Which algorithm allows a number to be added to a sorted list of numbers and yet keeps the entire list sorted? Well, for one, insertion sort!
We assume that the current list is already sorted. When a new number comes, we have to add it to the list while maintaining the sorted nature of the list. This is achieved easily by finding the correct place to insert the incoming number, using a binary search (remember, the list is always sorted). Once the position is found, we need to shift all higher elements by one space to make room for the incoming number.
This method would work well when the amount of insertion queries is lesser or about the same as the amount of median finding queries.
Complexity Analysis
- Time complexity: .
- Binary Search takes time to find correct insertion position.
- Insertion can take up to time since elements have to be shifted inside the container to make room for the new element.
Pop quiz: Can we use a linear search instead of a binary search to find insertion position, without incurring any significant runtime penalty?
- Space complexity: linear space to hold input in a container.
Approach 3: Two Heaps
Intuition
The above two approaches gave us some valuable insights on how to tackle this problem. Concretely, one can infer two things:
- If we could maintain direct access to median elements at all times, then finding the median would take a constant amount of time.
- If we could find a reasonably fast way of adding numbers to our containers, additional penalties incurred could be lessened.
But perhaps the most important insight, which is not readily observable, is the fact that we only need a consistent way to access the median elements. Keeping the entire input sorted is not a requirement.
Well, if only there were a data structure which could handle our needs.
As it turns out there are two data structures for the job:
- Heaps (or Priority Queues [1])
- Self-balancing Binary Search Trees (we'll talk more about them in Approach 4)
Heaps are a natural ingredient for this dish! Adding elements to them take logarithmic order of time. They also give direct access to the maximal/minimal elements in a group.
If we could maintain two heaps in the following way:
- A max-heap to store the smaller half of the input numbers
- A min-heap to store the larger half of the input numbers
This gives access to median values in the input: they comprise the top of the heaps!
Wait, what? How?
If the following conditions are met:
- Both the heaps are balanced (or nearly balanced)
- The max-heap contains all the smaller numbers while the min-heap contains all the larger numbers
then we can say that:
- All the numbers in the max-heap are smaller or equal to the top element of the max-heap (let's call it )
- All the numbers in the min-heap are larger or equal to the top element of the min-heap (let's call it )
Then and/or are smaller than (or equal to) almost half of the elements and larger than (or equal to) the other half. That is thedefinition of median elements.
This leads us to a huge point of pain in this approach: balancing the two heaps!
Algorithm
- Two priority queues:
- A max-heap
lo
to store the smaller half of the numbers - A min-heap
hi
to store the larger half of the numbers
- A max-heap
- The max-heap
lo
is allowed to store, at worst, one more element more than the min-heaphi
. Hence if we have processed elements:- If , then
lo
is allowed to hold elements, whilehi
can hold elements. - If , then both heaps are balanced and hold elements each.
This gives us the nice property that when the heaps are perfectly balanced, the median can be derived from the tops of both heaps. Otherwise, the top of the max-heaplo
holds the legitimate median. - If , then
- Adding a number
num
:- Add
num
to max-heaplo
. Sincelo
received a new element, we must do a balancing step forhi
. So remove the largest element fromlo
and offer it tohi
. - The min-heap
hi
might end holding more elements than the max-heaplo
, after the previous operation. We fix that by removing the smallest element fromhi
and offering it tolo
.
The above step ensures that we do not disturb the nice little size property we just mentioned. - Add
A little example will clear this up! Say we take input from the stream
[41, 35, 62, 5, 97, 108]
. The run-though of the algorithm looks like this:Adding number 41
MaxHeap lo: [41] // MaxHeap stores the largest value at the top (index 0)
MinHeap hi: [] // MinHeap stores the smallest value at the top (index 0)
Median is 41
=======================
Adding number 35
MaxHeap lo: [35]
MinHeap hi: [41]
Median is 38
=======================
Adding number 62
MaxHeap lo: [41, 35]
MinHeap hi: [62]
Median is 41
=======================
Adding number 4
MaxHeap lo: [35, 4]
MinHeap hi: [41, 62]
Median is 38
=======================
Adding number 97
MaxHeap lo: [41, 35, 4]
MinHeap hi: [62, 97]
Median is 41
=======================
Adding number 108
MaxHeap lo: [41, 35, 4]
MinHeap hi: [62, 97, 108]
Median is 51.5
Complexity Analysis
- Time complexity: .
- At worst, there are three heap insertions and two heap deletions from the top. Each of these takes about time.
- Finding the mean takes constant time since the tops of heaps are directly accessible.
- Space complexity: linear space to hold input in containers.
Approach 4: Multiset and Two Pointers
Intuition
Self-balancing Binary Search Trees (like an AVL Tree) have some veryinteresting properties. They maintain the tree's height to a logarithmic bound. Thus inserting a new element has reasonably good time performance. The median always winds up in the root of the tree and/or one of its children. Solving this problem using the same approach as Approach 3 but using a Self-balancing BST seems like a good choice. Except the fact that implementing such a tree is not trivial and prone to errors.
Why reinvent the wheel? Most languages implement a
multiset
class which emulates such behavior. The only problem remains keeping track of the median elements. That is easily solved with pointers! [2]
We maintain two pointers: one for the lower median element and the other for the higher median element. When the total number of elements is odd, both the pointers point to the same median element (since there is only one median in this case). When the number of elements is even, the pointers point to two consecutive elements, whose mean is the representative median of the input.
Algorithm
- Two iterators/pointers
lo_median
andhi_median
, which iterate over thedata
multiset. - While adding a number
num
, three cases arise:- The container is currently empty. Hence we simply insert
num
and set both pointers to point to this element. - The container currently holds an odd number of elements. This means that both the pointers currently point to the same element.
- If
num
is not equal to the current median element, thennum
goes on either side of it. Whichever side it goes, the size of that part increases and hence the corresponding pointer is updated. For example, ifnum
is less than the median element, the size of the lesser half of input increases by on insertingnum
. Thus it makes sense to decrementlo_median
. - If
num
is equal to the current median element, then the action taken is dependent on hownum
is inserted intodata
. NOTE: In our given C++ code example,std::multiset::insert
inserts an element after all elements of equal value. Hence we incrementhi_median
.
- If
- The container currently holds an even number of elements. This means that the pointers currently point to consecutive elements.
- If
num
is a number between both median elements, thennum
becomes the new median. Both pointers must point to it. - Otherwise,
num
increases the size of either the lesser or higher half of the input. We update the pointers accordingly. It is important to remember that both the pointers must point to the same element now.
- If
- Finding the median is easy! It is simply the mean of the elements pointed to by the two pointers
lo_median
andhi_median
.
Complexity Analysis
- Time complexity: .
- Inserting a number takes time for a standard
multiset
scheme. [4] - Finding the mean takes constant time since the median elements are directly accessible from the two pointers.
- Inserting a number takes time for a standard
- Space complexity: linear space to hold input in container.
Further Thoughts
There are so many ways around this problem, that frankly, it is scary. Here are a few more that I came across:
- Buckets! If the numbers in the stream are statistically distributed, then it is easier to keep track of buckets where the median would land, than the entire array. Once you know the correct bucket, simply sort it find the median. If the bucket size is significantly smaller than the size of input processed, this results in huge time saving. @mitbbs8080 has an interesting implementation here.
- Reservoir Sampling. Following along the lines of using buckets: if the stream is statistically distributed, you can rely on Reservoir Sampling. Basically, if you could maintain just one good bucket (or reservoir) which could hold a representative sample of the entire stream, you could estimate the median of the entire stream from just this one bucket. This means good time and memory performance. Reservoir Sampling lets you do just that. Determining a "good" size for your reservoir? Now, that's a whole other challenge. A good explanation for this can be found in this StackOverflow answer.
- Segment Trees are a great data structure if you need to do a lot of insertions or a lot of read queries over a limited range of input values. They allow us to do all such operations fast and in roughly the same amount of time, always. The only problem is that they are far from trivial to implement. Take a look at my introductory article on Segment Trees if you are interested.
- Order Statistic Trees are data structures which seem to be tailor-made for this problem. They have all the nice features of a BST, but also let you find the order element stored in the tree. They are a pain to implement and no standard interview would require you to code these up. But they are fun to use if they are already implemented in the language of your choice. [5]
Analysis written by @babhishek21.
- Priority Queues queue out elements based on a predefined priority. They are an abstract concept and can, as such, be implemented in many different ways. Heaps are an efficient way to implement Priority Queues. ↩︎