Ah, my brain has just kicked into gear, I have a sensible suggestion now. Probably too late if this had been an interview, but never mind: Machine 1 shall be called the "control machine", and for the sake of argument either it starts with all the data, and sends it in equal parcels to the other 99 machines, or else the data starts evenly distributed between the machines, and it sends 1/99 of its data to each of the others. The partitions do not have to be equal, just close Each other machine sorts its data, and does so in a way which favours finding the lower values first.So for example a quicksort, always sorting the lower part of the partition first*.
It writes its data back to the control machine in increasing order as soon as it can (using asynchronous IO so as to continue sorting, and probably with Nagle on: experiment a bit) The control machine performs a 99-way merge on the data as it arrives, but discards the merged data, just keeping count of the number of values it has seen. It calculates the median as the mean of the 1/2 billionth and 1/2 billion plus oneth values This suffers from the "slowest in the herd" problem. The algorithm cannot complete until every value less than the median has been sent by a sorting machine.
There's a reasonable chance that one such value will be quite high within its parcel of data. So once the initial partitioning of the data is complete, estimated running time is the combination of the time to sort 1/99th of the data and send it back to the control computer, and the time for the control to read 1/2 the data. The "combination" is somewhere between the maximum and the sum of those times, probably close to the max My instinct is that for sending data over a network to be faster than sorting it (let alone just selecting the median) it needs to be a pretty damn fast network.
Might be a better prospect if the network can be presumed to be instantaneous, for example if you have 100 cores with equal access to RAM containing the data Since network I/O is likely to be the bound, there might be some tricks you can play, at least for the data coming back to the control machine. For example, instead of sending "1,2,3,.. 100", perhaps a sorting machine could send a message meaning "100 values less than 101". The control machine could then perform a modified merge, in which it finds the least of all those top-of-a-range values, then tells all the sorting machines what it was, so that they can (a) tell the control machine how many values to "count" below that value, and (b) resume sending their sorted data from that point More generally, there's probably a clever challenge-response guessing game that the control machine can play with the 99 sorting machines This involves round-trips between the machines, though, which my simpler first version avoids.
I don't really know how to blind-estimate their relative performance, and since the trade-offs are complex, I imagine there are much better solutions out there than anything I'll think of myself, assuming this is ever a real problem * available stack permitting - your choice of which part to do first is constrained if you don't have O(N) extra space. But if you do have enough extra space, you can take your pick, and if you don't have enough space you can at least use what you do have to cut some corners, by doing the small part first for the first few partitions.
Ah, my brain has just kicked into gear, I have a sensible suggestion now. Probably too late if this had been an interview, but never mind: Machine 1 shall be called the "control machine", and for the sake of argument either it starts with all the data, and sends it in equal parcels to the other 99 machines, or else the data starts evenly distributed between the machines, and it sends 1/99 of its data to each of the others. The partitions do not have to be equal, just close.
Each other machine sorts its data, and does so in a way which favours finding the lower values first. So for example a quicksort, always sorting the lower part of the partition first*. It writes its data back to the control machine in increasing order as soon as it can (using asynchronous IO so as to continue sorting, and probably with Nagle on: experiment a bit).
The control machine performs a 99-way merge on the data as it arrives, but discards the merged data, just keeping count of the number of values it has seen. It calculates the median as the mean of the 1/2 billionth and 1/2 billion plus oneth values. This suffers from the "slowest in the herd" problem.
The algorithm cannot complete until every value less than the median has been sent by a sorting machine. There's a reasonable chance that one such value will be quite high within its parcel of data. So once the initial partitioning of the data is complete, estimated running time is the combination of the time to sort 1/99th of the data and send it back to the control computer, and the time for the control to read 1/2 the data.
The "combination" is somewhere between the maximum and the sum of those times, probably close to the max. My instinct is that for sending data over a network to be faster than sorting it (let alone just selecting the median) it needs to be a pretty damn fast network. Might be a better prospect if the network can be presumed to be instantaneous, for example if you have 100 cores with equal access to RAM containing the data. Since network I/O is likely to be the bound, there might be some tricks you can play, at least for the data coming back to the control machine.
For example, instead of sending "1,2,3,.. 100", perhaps a sorting machine could send a message meaning "100 values less than 101". The control machine could then perform a modified merge, in which it finds the least of all those top-of-a-range values, then tells all the sorting machines what it was, so that they can (a) tell the control machine how many values to "count" below that value, and (b) resume sending their sorted data from that point. More generally, there's probably a clever challenge-response guessing game that the control machine can play with the 99 sorting machines.
This involves round-trips between the machines, though, which my simpler first version avoids. I don't really know how to blind-estimate their relative performance, and since the trade-offs are complex, I imagine there are much better solutions out there than anything I'll think of myself, assuming this is ever a real problem. * available stack permitting - your choice of which part to do first is constrained if you don't have O(N) extra space.
But if you do have enough extra space, you can take your pick, and if you don't have enough space you can at least use what you do have to cut some corners, by doing the small part first for the first few partitions.
LOL. Does that really work or will the OOM killer nuke it before it completes? (on any reasonable computer) – Isak Savo May 28 '10 at 21:15 Should do.
Sort knows how to do an out-of-core sort, so it won't run out of memory. – DrPizza May 29 '10 at 6:47.
How about this:- each node can take 1Billion/100 numbers. At each node the elements can be sorted and median can be found. Find the median of medians.
We can, by aggregating the counts of numbers less than median-of-median on all nodes find out x%:y% split which the median-of-medians makes. Now ask all nodes to delete elements less than the median of medians( taking example of 30%:70% split).30% numbers are deleted. 70% of 1Billion is 700million.
Now all nodes which deleted less than 3million nodes can send those extra nodes back to a main computer. The main computer redistributes in such a way that now all nodes will have almost equal number of nodes(7million). Now that the problem is reduced to 700million numbers.... goes on until we have a smaller set which can be computed on one comp.
In essence we are always reducing the problem set by at least 30% and we are achieving a lot of parallel computing through this. Each node starts with 10million and reduces its data set by 30% in each iteration. – anony Apr 3 '10 at 15:15 In the first iteration we look for 500Millionth number.In second iteration - if number of numbers deleted is 300million then we look for 200millionth number and so on... – anony Apr 4 '10 at 3:39 This looks like it's on the right track, but you don't explain very clearly how to avoid throwing away the median by accident with your 30%/70% split.
Take the following counterexample: suppose your first 29% is all zeros, and all other blocks count up by 1000, and each set of blocks is one more than the last. The 30th percentile median will throw away all of 29% of the data, and just under half of 61% of the data, which is 29+30% = 59% of the data. Oops, we just threw out the true median!
So apparently you don't mean that, or at least you mean it more cleverly than I interpreted. – Rex Kerr Apr 4 '10 at 15:11.
Oddly enough, I think if you have enough computers, you're better off sorting than using O(n) median-finding algorithms. (Unless your cores are very, very slow, though, I'd just use one and use an O(n) median-finding algorithm for merely 1e9 numbers; if you had 1e12, though, that might be less practical. ) Anyway, let's suppose we have more than log n cores to deal with this problem, and we don't care about power consumption, just getting the answer fast.
Let's further assume that this is a SMP machine with all the data already loaded in memory. (Sun's 32-core machines are of this type, for instance. ) One thread chops the list up blindly into equal sized pieces and tells the other M threads to sort them.
Those threads diligently do so, in (n/M) log (n/M) time. They then return not only their medians, but, say, their 25th and 75th percentiles as well (perverse worst cases are better if you choose slightly different numbers). Now you have 4M ranges of data.
You then sort these ranges and work upwards through the list until you find a number such that, if you throw out every range that is smaller than or contains the number, you will have thrown out half your data. That's your lower bound for the median. Do the same for the upper bound.
This takes something like M log M time, and all cores have to wait for it, so it's really wasting M^2 log M potential time. Now you have your single thread tell the others to toss all data outside the range (you should throw out about half on each pass) and repeat--this is a trivially fast operation since the data is already sorted. You shouldn't have to repeat this more than log(n/M) times before it's faster to just grab the remaining data and use a standard O(n) median finder on it.So, total complexity is something like O((n/M) log (n/M) + M^2 log M log (n/M)).
Thus, this is faster than O(n) median sort on one core if M >> log(n/M) and M^3 log M.
I hate to be the contrarian here, but I don't believe sorting is required, and I think any algorithm involving sorting a billion/100 numbers is going to be slow. Let's consider an algorithm on one computer. 1) Select 1000 values at random from the billion, and use them to get an idea of the distribution of the numbers, especially a range.2) Instead of sorting the values, allocate them to buckets based on the distribution you just calculated.
The number of buckets is chosen so that the computer can handle them efficiently, but should otherwise be as large as convenient. The bucket ranges should be so that approximately equal numbers of values go in each bucket (this isn't critical to the algorithm, but it helps efficiency.100,000 buckets might be appropriate). Note the number of values in each bucket.
This is an O(n) process.3) Find out which bucket range the median lies. This can be done by simply examining the total numbers in each bucket. 4) Find the actual median by examining the values in that bucket.
You can use a sort here if you like, since you are only sorting maybe 10,000 numbers. This approach parallelizes trivially by dividing the values between the computers. Each computer reports the totals in each bucket to a 'control' computer which does step 3.
For step 4 each computer sends the (sorted) values in the relevant bucket to the control computer (you can do both of those algorithms in parallel too, but it probably isn't worth it). The total process is O(n), since both steps 3 and 4 are trivial, provided the number of buckets is large enough.
One computer is more than enough to solve the problem. But let's assume that there are 100 computers. The only complex thing you should do is to sort the list.
Split it to 100 parts, send one part to each computer, let them be sorted there, and merge parts after that. Then take number from the middle of the sorted list (i.e. With index 5 000 000 000).
– Roman Apr 3 '10 at 13:55 1 Anyway now my rep is pretty round :) – Roman Apr 3 '10 at 13:57 -1 because sorting is not necessary for finding a median. – Pavel Shved Apr 3 '10 at 14:14 Merging is at best O(n), and you can find the median on a single core in O(n), so this seems to create a lot of extra work for no gain. – Rex Kerr Apr 3 '10 at 3:21.
Split the 10^9 numbers, 10^7 to each computer ~ 80MB on each. Each computer sorts its numbers. Then computer 1 merge-sorts its own numbers with those from computer 2, computer 3 and 4, etc ... Then computer 1 writes half of the numbers back to 2, 3 to 4, etc.Then 1 merge sorts the numbers from computers 1,2,3,4, writes them back.
And so on. Depending on the size of RAM on the computers you may get away with not writing all the numbers back to the individual computers at each step, you might be able to accumulate the numbers on computer 1 for several steps, but you do the maths. Oh, finally get the mean of the 500000000th and 500000001st values (but check there are enough 00s in there, I haven't).
EDIT: @Roman -- well if you can't believe it even it it's true then there's no point in my revealing the truth or falsehood of the proposition. What I meant to state was that brute force sometimes beats smart in a race.It took me about 15 seconds to devise an algorithm which I am confident that I can implement, which will work, and which will be adaptable to a wide range of sizes of inputs and numbers of computers, and tunable to the characteristics of the computers and networking arrangements. If it takes you, or anyone else, say 15 minutes to devise a more sophisticated algorithm I have a 14m45s advantage to code up my solution and start it running.
But I freely admit this is all assertion, I haven't measured anything.
Here we are just mergesorting all numbers. Can we do it in a better way using:- "we can find the median of two sorted lists in logn time. N is the length of each list." – anony Apr 3 '10 at 13:42 1 @anony -- while you answer your own question, I'll have my solution coded up, tested and done.
I expect that there are better ways, but sometimes parallelising a simple way leaves me free to scratch my head on the really difficult problems. – gh Performance Mark Apr 3 '10 at 13:46 have you really done it in 7 minutes? I can't believe that even if it's true.
I did the similar task (it was a university assignment) and it took about 2 hours to implement and test all remoting stuff (I used java RMI). – Roman Apr 3 '10 at 13:51 I see what you're saying, but by the same token DrPizza has an even quicker-to-think-of solution, which is to sort all the data on a single node and ignore the other 99. None of us knows how expensive data transfer should be considered, so we're all just picking a compromise that sounds vaguely plausible.
Your solution transfers all the data multiple times, so I'm a bit suspicious of it, but it's certainly a solution. – Steve Jessop Apr 3 '10 at 15:18 'vaguely plausible' -- that's good enough for me @Steve! Especially in response to a vaguely implausible question.
– gh Performance Mark Apr 3 '10 at 16:36.
Let's first work out how to find a median of n numbers on a single machine: I am basically using partitioning strategy. Problem :selection(n,n/2) : Find n/2 th number from least number. You pick say middle element k and partition data into 2 sub arrays.
The 1st contains all elements = k. If sizeof(1st sub-array) >= n/2, you know that this sub-array contains the median. You can then throw-off the 2nd sub-array.
Solve this problem selection(sizeof 1st sub-array,n/2). In else case, throw off this 1st subarray and solve selection(2nd subarray , n/2 - sizeof(1st subarray)) Do it recursively. Time complexity is O(n) expected time.
Now if we have many machines, in each iteration, we have to process an array to split, we distribute the array into diff machines. Each machine processes their chunk of array and sends back the summary to hub controlling machine i.e. Size of 1st subarray and size of 2nd subarray.
The hub machines adds up summaries and decide which subarray (1st or 2nd) to process further and 2nd parameter of selection and sends it back to each machine. And so on. This algorithm can be implemented very neatly using map reduce?
How does it look?
The median for this set of numbers 2, 3, 5, 7, 11, 13, 67, 71, 73, 79, 83, 89, 97 is 67. The median for this set of numbers 2, 3, 5, 7, 11, 13, 67, 71, 73, 79, 83, 89 is 40. Assuming the question was about 1,000,000,000 integers(x) where 0 >= x Also assuming that all 100 computers were all equal.
Using my laptop and GigE... What I found was that my laptop can sort 10,000,000 Int32's in 1.3 seconds. So a rough estimate would be that a billion number sort would take 100 x 1.3 seconds(2 minutes 10 seconds) ;). An estimate of a one-way file transfer of a 40MB file on a gigabit Ethernet is .32 seconds.
This means that the sorted results from all computers will be returned in approximately 32 seconds(computer 99 didn't get his file until 30 seconds after the start). From there it shouldn't take long to discard the lowest 499,999,998 numbers, add the next 2 and divide by 2.
It would help me understand how I can do better. – dbasnett Jul 1 at 13:37.
It calculates the median as the mean of the 1/2 billionth and 1/2 billion plus oneth values. This suffers from the "slowest in the herd" problem. The algorithm cannot complete until every value less than the median has been sent by a sorting machine.
There's a reasonable chance that one such value will be quite high within its parcel of data. So once the initial partitioning of the data is complete, estimated running time is the combination of the time to sort 1/99th of the data and send it back to the control computer, and the time for the control to read 1/2 the data.
I cant really gove you an answer,but what I can give you is a way to a solution, that is you have to find the anglde that you relate to or peaks your interest. A good paper is one that people get drawn into because it reaches them ln some way.As for me WW11 to me, I think of the holocaust and the effect it had on the survivors, their families and those who stood by and did nothing until it was too late.