find min, max, median and average

electroRF · Jan 1, 2014

Hi,

I ran into this question: write a function that receives an array of n numbers and finds min, max, median and average.

You can of course sort the array in O(nlog(n)) and then get the min, max, median and average.

I wanna do it in O(n) though.

I thought of finding the Median using Quick Sort Algorithm to find the (n/2)th element, which would take O(n) on average, or O(n^2) on worst case.

I also know of the Median of Median algorithm which finds the median in O(n) worst case.

Is there any better way you know to find the min, max, median and average?

NorthGuy · Jan 1, 2014

Since you worry about the speed at large n, this must be a huge array.

The important question here is if you're allowed to sort the array in place or you must leave it intact, and if you have to leave it intact whether you have enough memory to allocate a copy of the array you want to sort. When arrays are huge, memory could be a real problem.

There could be desirable to have the array sorted anyway. Then sorting it prior to calling your routine is the best solution regardless of the speed.

You cannot select (or create) any particular algorithm without knowing how it is going to be used.

misterT · Jan 1, 2014

If you just combine "running average" and "running median" algorithms you have the solution. Keeping track of Min and Max is very easy. Average is easy also.
Median is a bit more complex, but not that difficult at all. Google for "running/moving/rolling/streaming median algorithm".

electroRF · Jan 2, 2014

Thank you Guys.

T, that answered the question

thanks a lot

NorthGuy, actually it wasn't specified in the question if you're allowed to sort in-place, however T suggested a good solution.

NorthGuy · Jan 2, 2014

misterT said:
Median is a bit more complex, but not that difficult at all. Google for "running/moving/rolling/streaming median algorithm".

QuickSort is way faster than rolling median and requires much less memory (none if we allowed to sort in place; 20 to 50% less if we create a copy of the array for sorting).

misterT · Jan 2, 2014

NorthGuy said:
QuickSort is way faster than rolling median and requires much less memory (none if we allowed to sort in place; 20 to 50% less if we create a copy of the array for sorting).

Yes. Maybe the median should be solved using Quickselect (based on Quicksort) which is O(n). I like the rolling median because I mostly deal with real time data. And that can be done in O(log n).. but that does not mean it is faster than Quickselect.

electroRF · Jan 3, 2014

The running median is using 2 heaps - the min heap and max heap - right?

misterT · Jan 3, 2014

electroRF said:
The running median is using 2 heaps - the min heap and max heap - right?

Yes, the most common algorithm (found by google) uses two heaps. There are other algorithms too. I did not read the entire page very carefully, but here is some discussion about running median algorithms: https://dhruvbird.blogspot.fi/2011/05/computing-running-median-of-array-of.html

MrAl · Jan 3, 2014

Hello there,

Many times a problem can be greatly simplified if there is a limit to the range of data that enters the system. For example, for color channels in a normal 16 million color display the range is only 0 to 255 for each channel, so of course that means each channel can be stored in a single byte. If it wasnt like that, it could take two or more bytes just to store that one channel, or channels would have to share bytes and that makes the whole thing harder to implement.

So i just have to ask if there is a limit to the range of data that will be accepted as input. If the data only appears as integers from 1 to 100 for example, we can probably do this in a very nearly N time period without too much trouble even with a HUGE data set of over 2^32 items.

misterT · Jan 3, 2014

MrAl said:
Many times a problem can be greatly simplified if there is a limit to the range of data that enters the system. For example, for color channels in a normal 16 million color display the range is only 0 to 255 for each channel, so of course that means each channel can be stored in a single byte. If it wasnt like that, it could take two or more bytes just to store that one channel, or channels would have to share bytes and that makes the whole thing harder to implement.

So i just have to ask if there is a limit to the range of data that will be accepted as input. If the data only appears as integers from 1 to 100 for example, we can probably do this in a very nearly N time period without too much trouble even with a HUGE data set of over 2^32 items.

I don't see how the datatype (or range) would affect the big O classification of the algorithm. Of course it will affect the efficiency and overall speed, but not the big-O characteristics. And I think the original question is purely academic. I think electroRF is studying algorithms in general and therefore posting these questions.. quite interesting questions in my opinion. ElectroRF can correct me if I'm wrong, but I don't think he has any specific limits to the dataset.

NorthGuy · Jan 3, 2014

misterT said:
I don't see how the datatype (or range) would affect the big O classification of the algorithm. Of course it will affect the efficiency and overall speed, but not the big-O characteristics.

In the simplest case, imagine 0/1 data. You only need to keep count of 0s and 1s, then if there's more 1s than 0s then median is 1, otherwise 0. That's O(n). Can't do that with arbitrary float data.

misterT · Jan 3, 2014

NorthGuy said:
In the simplest case, imagine 0/1 data. You only need to keep count of 0s and 1s, then if there's more 1s than 0s then median is 1, otherwise 0. That's O(n). Can't do that with arbitrary float data.

Still O(n).. I did not say that datatype would not affect overall speed. I said that the datatype and range do not affect the big-O classification of an algorithm.. There is an O(log n) algorithm for solving running median, but it does not mean that it is faster than Quick Select which is O(n). Regardless of the datatype or range of the data.

NorthGuy · Jan 3, 2014

misterT said:
Still O(n)..

It's possible that O(n) algorithm exists (although I doubt so), but it's certainly different than the simple algorithm I described for binary data.

misterT said:
There is an O(log n) algorithm for solving running median

Running median is O(n^2). It's essentially sort by insertion.

misterT said:
but it does not mean that it is faster than Quick Select which is O(n).

QuickSelect is the same as QuickSort except that you do not sort edges because median cannot be there. It's still O(n*log(n)) for random data.

misterT said:
Regardless of the datatype or range of the data.

If you have specific data you can use data-specific algorithm which may be not applicable for general data.

NorthGuy · Jan 3, 2014

NorthGuy said:
QuickSelect is the same as QuickSort except that you do not sort edges because median cannot be there. It's still O(n*log(n)) for random data.

I thought about this, and you're right, QuickSelect finds median close to O(n) on random data.

However, there's no corresponding O(n) algorithm for full sorting. The best sorting is probably O(n*log(n)), while with binary data it is O(n).

MrAl · Jan 4, 2014

misterT said:
I don't see how the datatype (or range) would affect the big O classification of the algorithm. Of course it will affect the efficiency and overall speed, but not the big-O characteristics. And I think the original question is purely academic. I think electroRF is studying algorithms in general and therefore posting these questions.. quite interesting questions in my opinion. ElectroRF can correct me if I'm wrong, but I don't think he has any specific limits to the dataset.

Hi,

Ultimately these kinds of algorithms have to be implemented in computer code and thus time becomes an important factor which overshadows any man made notation used to compare things. In notation alone O(n) may be the same as O(8n), but i for one do not wish to wait 8 times longer for an algorithm to complete just because it happens to be classified the same as one that runs faster.

The main point was simplification through range limiting. If the range of the data can be limited (as in NorthGuy's cute example) then the algorithm often ends up being more simple, and simpler usually means faster time wise.

For example if we had to add integers we could do this pretty fast, one operation per integer. But if we had to add 4-vectors (for simplicity with integer elements) we'd have to add each element of every vector meaning it would take 4 times as long to complete. But if we examined the data and found that every one of these vectors had only one non zero element and that element was always of the same dimension, then we could simplify the problem by adding only that one element for each vector and ignore the other three, and that would reduce the time back to almost what it took to add the single integers. So clearly there is a benefit in thinking about this even if we can not formally deem it to be of some lower order.

BTW, when i talk about algorithms like this i always like to specify if the array has been sorted BEFORE entering the algorithm, because specifying an order for the algorithm when it has already been sorted is a little like cheating <chuckle> unless the data came from a previous operation that required sorting anyway. So if there is a sort involved because the algorithm needs sorted data to work properly then i think it should be stated outright.
So far i have been assuming that the data coming into the algorithm has not yet been sorted because it if was then there's almost no game. Thus what i have been talking about is based on an input data set that is NOT sorted, and we do not wish to take the time to sort it because no following algorithms will need it that way either. If there were other algorithms that did need a sorted data set, then we'd have to consider the placement of the sort in the sequence of operations and that would probably change everything.

Welcome to our site!

Electro Tech is an online community (with over 170,000 members) who enjoy talking about and building electronic circuits, projects and gadgets. To participate you need to register. Registration is free. Click here to register now.

find min, max, median and average

electroRF

Member

NorthGuy

Well-Known Member

misterT

Well-Known Member

electroRF

Member

NorthGuy

Well-Known Member

misterT

Well-Known Member

electroRF

Member

misterT

Well-Known Member

MrAl

Well-Known Member

misterT

Well-Known Member

NorthGuy

Well-Known Member

misterT

Well-Known Member

NorthGuy

Well-Known Member

NorthGuy

Well-Known Member

MrAl

Well-Known Member

Similar threads

Latest threads

New Articles From Microcontroller Tips