开发者

How to calculate median of a Map<Int,Int>?

开发者 https://www.devze.com 2023-01-03 11:32 出处:网络
For a map where the key represents a number of a sequence and the value the count how often this number appeared in the squence, how would an implementation of an algorithm in java look like to 开发者

For a map where the key represents a number of a sequence and the value the count how often this number appeared in the squence, how would an implementation of an algorithm in java look like to 开发者_StackOverflowcalculate the median?

For example:

1,1,2,2,2,2,3,3,3,4,5,6,6,6,7,7

in a map:

Map<Int,Int> map = ...
map.put(1,2)
map.put(2,4)
map.put(3,3)
map.put(4,1)
map.put(5,1)
map.put(6,3)
map.put(7,2)

double median = calculateMedian(map);
print(median);

would result in:

> print(median);
3
>

So what i am looking for is a java implementation of calculateMedian.


Linear time

If you know the total of the numbers (in your case it is 16) you can go from the beginning or the end of the map and sum up the counts until you get to round(n/2)th element, or in case the sum is even to average of floor(n/2)th and ceil(n/2)th elements = median.

If you don't know the total count you will have to go through all of them at least once.

Sublinear time

If you can decide on the data structure and can do pre-processing see wikipedia on selection algorithm and you might get even sublinear algorithm. You can also get sublinear time if you know something about the distribution of the data.

EDIT: So under assumption that we have a sequence with counts what we can do is

  • while inserting the key -> count pairs maintain another map - key -> running_total
  • this way you will have a structure in which you will be able to get total_count by looking at the last key's running_total
  • and you will be able to do a binary search to locate the element where running total is close to total_count/2

This will double the memory usage, but will give O(log n) performance for median and O(1) for total_count.


Using Guava:

Multiset<Integer> values = TreeMultiset.create();
Collections.addAll(values, 1,1,2,2,2,2,3,3,3,4,5,6,6,6,7,7);

Now the answer to your question is:

return Iterables.get(values, (values.size() - 1) / 2);

Really. That's it. (Or check if size is even and average the two central values, to be precise about it.)

If the counts are particularly large, it would be faster to use the multiset's entrySet and keep a running sum, but the simplest way is usually fine.


  • Use a SortedMap, i.e. a TreeMap
  • Iterate through the map once to calculate the total number of elements, i.e. the sum of all occurrences
  • Iterate again and add up occurences until you've reached half of the total. The number that caused the sum to exceed half of the total is the median
  • Test extensively for off-by-one errors


For in easy but maybe not-so-efficient algorithm I'd do it like this:

1. expand the map to a list.

practically spoken: iterate through the map and add the key 'value-times' to the new list. Finally sort the list.

//...
List<Integer> field = new ArrayList<Integer>();
for (Integer key:map) {
  for (int i = 0; i < map.get(key); i++) {
    field.add(key);
  }
}
Collections.sort(field);

2. calculate the median

now you have to implement a method int calculateMedian(List<Integer> sorted). This depends on the kind of median you need. If it's just the sample median, then the result is either the middlemost value (for lists with an odd number of elements) or the average of the two middlemost values (for lists with an even length). Note, that the list needs to be sorted!

(Ref: Sample Median / wikipedia)


OK, OK, even though Chris didn't mention efficiency, here's an idea how to calculate the sample median (!) without expanding the map...

Set<Integer> sortedKeys = new TreeSet<Integer>(map.keySet()); // just to be sure ;)
Integer median = null;  // Using Integer to have a 'invalid/not found/etc' state
int total = 0;
for (Integer key:sortedKeys) {
  total += map.get(key);
}
if (isOddNumber(total)) { // I don't have to implement everything, do I?
  int counter = total / 2;  // index starting with 0
  for (Integer key:sortedKeys) {
    middleMost -= map.get(key);
    if (counter < 0) {
      // the sample median was in the previous bin
      break;
    }
    median = key;
  }
} else {
  int lower = total/2;
  int upper = lower + 1;
  for (Integer key:sortedKeys) {
    lower -= map.get(key);
    upper -= map.get(key);
    if (lower < 0 && upper < 0) {
      // both middlemost values are in the same bin
      break;
    } else (lower < 0 || upper < 0) {
      // lower is in the previous, upper in the actual bin
      median = (median + key) / 2; // now we need the average
      break;
    }
    median = key;
  }
}

(I have no compiler at hand - if it has to many syntax errors, treat it as pseudo code, please ;) )

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号