A word of advice
This problem is a companion problem to 295. Find Median From Data Stream. This means that a lot of approaches to solve this problem are based on the methods to solve 295. Find Median From Data Stream. Perhaps try that problem before you approach this one.
Solution
Approach 1: Simple Sorting
Intuition
Do what the question says.
Algorithm
Store the numbers in a window container of size . The following operations must take place:
- Inserting the incoming element.
- Deleting the outgoing element.
- Sorting the window to find the medians.
One primitive approach is to copy consecutive elements from the input to the window and keep sorting these every time. This constitutes duplication of effort.
We can do a bit better if we instead insert and delete one element per window shift. The challenge then is to maintain the window as sorted, before and after the insert and delete operations.
Python comes with an excellent
bisect module to help perform efficient insert operations on lists while maintaining their sorted property.
Complexity Analysis
- Time complexity: to .
- Copying elements into the container takes about time each. This happens about times.
- Sorting for each of the sliding window instances takes about time each.
- Bisected insertion or deletion takes about for searching and for actual shifting of elements. This takes place about times.
- Space complexity: extra linear space for the window container.
Approach 2: Two Heaps (Lazy Removal)
Intuition
The idea is the same as Approach 3 from 295. Find Median From Data Stream. The only additional requirement is removing the outgoing elements from the window.
Since the window elements are stored in heaps, deleting elements that are not at the top of the heaps is a pain.
Some languages (like Java) provide implementations of the
PriorityQueueclass that allow for removing arbitrarily placed elements. Generally, using such features is not efficient nor is their portability assured.
Assuming that only the tops of heaps (and by extension the
PriorityQueueclass) are accessible, we need to find a way to efficiently invalidate and remove elements that are moving out of the sliding window.
At this point, an important thing to notice is the fact that if the two heaps are balanced, only the top of the heaps are actually needed to find the medians. This means that as long as we can somehow keep the heaps balanced, we could also keep some extraneous elements.
Thus, we can use hash-tables to keep track of invalidated elements. Once they reach the heap tops, we remove them from the heaps. This is the lazy removal technique.
An immediate challenge at this point is balancing the heaps while keeping extraneous elements. This is done by actually moving some elements to the heap which has extraneous elements, from the other heap. This cancels out the effect of having extraneous elements and maintains the invariant that the heaps are balanced.
NOTE: When we talk about keeping the heaps balanced, we are not referring to the actual heap sizes. We are only concerned with valid elements and hence when we talk about balancing heaps, we are referring to count of such elements.
Algorithm
- Two priority queues:
- A max-heap
loto store the smaller half of the numbers - A min-heap
hito store the larger half of the numbers
- A max-heap
- A hash-map or hash-table
hash_tablefor keeping track of invalid numbers. It holds the count of the occurrences of all such numbers that have been invalidated and yet remain in the heaps. - The max-heap
lois allowed to store, at worst, one more element more than the min-heaphi. Hence if we have processed elements:- If , then
lois allowed to hold elements, whilehican hold elements. - If , then both heaps are balanced and hold elements each.
This gives us the nice property that when the heaps are perfectly balanced, the median can be derived from the tops of both heaps. Otherwise, the top of the max-heaploholds the legitimate median. - If , then
NOTE: As mentioned before, when we are talking about keeping the heaps balanced, the actual sizes of the heaps are irrelevant. Only the count of valid elements in both heaps matter.
- Keep a
balancefactor. It indicates three situations:balance: Both heaps are balanced or nearly balanced.balance:loneeds more valid elements. Elements fromhiare moved tolo.balance:hineeds more valid elements. Elements fromloare moved tohi.
- Inserting an incoming number
in_num:- If
in_numis less than or equal to the top element oflo, then it can be inserted tolo. However this unbalanceshi(hihas lesser valid elements now). Hencebalanceis incremented. - Otherwise,
in_nummust be added tohi. Obviously, nowlois unbalanced. Hencebalanceis decremented.
- Lazy removal of an outgoing number
out_num:- If
out_numis present inlo, then invalidating this occurrence will unbalanceloitself. Hencebalancemust be decremented. - If
out_numis present inhi, then invalidating this occurrence will unbalancehiitself. Hencebalancemust be incremented. - We increment the count of this element in the hash_table table.
- Once an invalid element reaches either of the heap tops, we remove them and decrement their counts in the hash_table table.
Complexity Analysis
- Time complexity: .
- Either (or sometimes both) of the heaps gets every element inserted into it at least once. Collectively each of those takes about time. That is such insertions.
- About removals from the top of the heaps take place (the number of sliding window instances). Each of those takes about time.
- Hash table operations are assumed to take time each. This happens roughly the same number of times as removals from heaps take place.
- Space complexity: extra linear space.
- The heaps collectively require space.
- The hash table needs about space.
Approach 3: Two Multisets
Intuition
One can see that
multisets are a great way to keep elements sorted while providing efficient access to the first and last elements. Inserting and deleting arbitrary elements are also fairly efficient operations in a multiset. (Refer to Approach 4 Intuition for 295. Find Median From Data Stream)
Thus, if the previous approach gives you too much heartburn, consider replacing the use of
priority_queue with multiset.
Algorithm
Inserting or deleting an element is straight-forward. Balancing the heaps takes the same route as Approach 3 of 295. Find Median From Data Stream.
Complexity Analysis
- Time complexity: .
- At worst, there are three set insertions and three set deletions from the start or end. Each of these takes about time.
- Finding the mean takes constant time since the start or ends of sets are directly accessible.
- Each of these steps takes place about times (the number of sliding window instances).
- Space complexity: extra linear space to hold contents of the window.
Approach 4: Multiset and Two Pointers
Intuition
This is same as Approach 4 for 295. Find Median From Data Stream.
But, we don't actually need two pointers.
Median elements are derived using a single iterator position (when the window size is odd) or two consecutive iterator positions (when is even). Hence keeping track of only one pointer is sufficient. The other pointer can be implicitly derived when required.
Algorithm
- A single iterator
mid, which iterates over thewindowmultiset. It is analogous toupper_medianin Approach 4 for 295. Find Median From Data Stream.lower_medianis implicitly derived frommid. It's either equal tomid(when the window size is odd) orprev(mid)[1] (when the window size is even). - We start with populating our multiset
windowwith the first elements. We setmidto the indexed element inwindow(0-based indexing; Multisets always maintain their sorted property). - While inserting an element
numintowindow, three cases arise:numis less than the value of upper medianmid.numis greater than the value of upper medianmid.numis equal to the value of upper medianmid. This situation is often handled as language-dependent. Since C++multisetinsert elements at the end of their equal range, this situation is essentially the same as the previous case.
- For the first case,
numis inserted before the upper median elementmid. Thusmidnow, no longer points to the indexed element. In fact it points to the indexed element. We fix that by decrementingmid. - For the second and third cases,
numis inserted after the upper median elementmidand hence does not spoil themiditerator. It still points to the indexed element. - Of course, the window size just increased to in all three cases. That will easily be fixed by removing the element that is about to exit the window.
- While removing an element
num, the same three cases arise as when we wanted to insert an element:numis less than the value of upper medianmid.numis greater than the value of upper medianmid.numis equal to the value of upper medianmid. Sincemidwill point to the first occurrence ofnumin the multisetwindowand we deterministically remove the first occurrence (take note that we usestd::multiset::lower_bound()[2] to find the correct occurrence to erase), this case is handled in the same manner as the first case.
- In the first and third cases, removing
numwill spoil themiditerator. Thus we need to fix that by incrementingmidbefore we remove that element. - For the second case, the
miditerator is not spoiled. No further action is required. - Once this element has been removed, the window size returns to being .
- After insertion of the incoming element and removal of the outgoing element, we are left again with some nice invariants:
- Window size is again .
- The window is still fully sorted.
midstill points to the indexed element.
- Finding the median of the window is easy! It is simply the mean of the elements pointed to by the two pointers
lo_medianandhi_median. In our case those aremidorprev(mid)(as decided by whether is odd or even) , andmidrespectively.
Complexity Analysis
- Time complexity: .
- Initializing
midtakes about time. - Inserting or deleting a number takes time for a standard
multisetscheme. [4] - Finding the mean takes constant time since the median elements are directly accessible from
miditerator. - The last two steps take place about times (the number of sliding window instances).
- Initializing
- Space complexity: extra linear space to hold contents of the window.
Further Thoughts
As noted before, this problem is essentially an extension to 295. Find Median From Data Stream. That problem had a lot of ways to go about, that frankly, are not of much use in an interview. But they are interesting to follow all the same. If you are interested take a look here. Try extending those methods to this problem.
No comments:
Post a Comment