while profiling a java application that calculates hierarchical clustering of thousands of elements I realized that ArrayList.get
occupies like half of the CPU needed in the clusterization part of the execution.
The algorithm searches the two more similar elements (so it is O(n*(n+1)/2) ), here's the pseudo code:
int currentMax = 0.0f
for (int i = 0 to n)
for (int j = i to n)
get content i-th and j-th
if their similarity > currentMax
update currentMax
merge the two clusters
So effectively there are a lot of ArrayList.get
involved.
Is there a faster way? I though that since ArrayList
should be a linear array of references it should be the quickest way and maybe I can't do anything since there are simple too many get
s.. but maybe I'm wrong. I don't think using a HashMap
could work since I need to get them all on every iteration and map.values()
should be backed by an ArrayList
anyway..
Otherwise should I try other collection libraries that are more optimized? Like google's one, or apache one..
EDIT:
you are somewhat confirming my doubts :(
Would I obtain a boost in perfomance trying to parallelize things? Maybe using a pool of executors that calculate similarity on more than one couple.. but I don't know if synchronization and locks on data structures would end up slowing it down.
Similarity is calculated using the dot product of tag maps of the two contents. The maps are two HashMap<Tag, Float>
.. in addition I already cache similarities in a TLongFloatHashMap
(from Trove collections) to avoid recalculating it in later iterations in which the Long
key is calculated as a hashcode of both contents (that is unique for the pair so that hash(c1, c2) == hash(c2, c1)
) so everything else is already tuned up enough.
EDIT2:
I'll post a little bit of the code just to let you understand better.. This is used to calculate the hash used to store the similarity between two elements:
private long computeKey(int h1, int h2) {
if (h1 < h2) {
int swap = h1;
h1 = h2;
h2 = swap;
}
return ((long)h1) << 32 | h2;
}
This is how corr开发者_开发问答elation is calculated:
float correlation(Map<Tag, Float> map1, Map<Tag, Float>map2, HierarchNode n1, HierarchNode n2) {
long key = computeKey(n1.hashCode, n2.hashCode);
if (cache.contains(key)) {
++hitCounter;
return cache.get(key);
}
else {
float corr = 0.0f;
Set<Map.Entry<Tag, Float>> entries;
Map<Tag, Float> curMap;
if (map1.size() < map2.size()) {
entries = map1.entrySet();
curMap = map2;
}
else {
entries = map2.entrySet();
curMap = map1;
}
for (Map.Entry<Tag, Float> ee : entries) {
Float f2 = curMap.get(ee.getKey());
if (f2 != null)
corr += ee.getValue()*f2;
}
cache.put(key, corr);
return corr;
}
}
And this is how the algorithm scans the content:
for (int j = 0; j < clusters.size(); ++j) {
skip = false;
for (int k = j+1; k < clusters.size(); ++k) {
float r = correlation(clusters.get(k).tags, clusters.get(j).tags, clusters.get(k), clusters.get(j));
if (r > max) {
max = r;
i1 = j;
i2 = k;
}
if (max == 1.0f) {
skip = true;
break;
}
}
if (skip)
break;
}
I would have used just a matrix to store all the values, but on every iteration the most similar items are removed from the list and a new item is added (which have a new tag map depending on the two chosen)
The algorithm you have is O(n²). Unless you have a way to make your algorithm do something significantly better than doing pairwise comparisons, performance is unlikely to appreciably improve. :-(
ArrayList.get is an if statement followed by an array access. There's not much to optimize there. ArrayList.get occupies half the execution time because you're not doing anything else. The important factor in the time taken is the number of iterations not what's inside the for-loop.
There's no such thing as O(n*(n+1)/2). Your algorithm is O(n2). See Plain english explanation of Big O for a more detailed explanation.
Ben is correct: you can reduce the get()
calls by getting the i-th element outside the inner loop.
What you're really looking for is something that improves on O(n2) and this requires being able to make additional assumptions about the elements. That depends on what you mean by "similarity".
Two common approaches:
- Sort the lists and merge. Overall this is O(n log n);
- Put one list into a
Map
of some kind that has (near-) constant lookup. This can reduce the algorithm to anywhere between O(n) and O(n log n) depending on the type ofMap
and the nature of the traversal.
But all this depends on what you mean by "similarity".
At the risk of stating the obvious, you might get some speed up by using this pseudo-code:
int currentMax = 0.0f
for (int i = 0 to n)
get content i-th
for (int j = i to n)
get content j-th
if their similarity > currentMax
update currentMax
merge the two clusters
It's still O(n²)
, though. If you need to compare every element to every other element to find out which pair is closest, you can't beat O(n²)
.
That said, if you're calling this multiple times, then there's optimization to be found in caching these results in a sortable map.
EDIT: If the similarity is something that is rather simple (e.g., a 1-dimensional value such as height), you might be able to sort the items in the array first, such that element[0] is most similar to element[1] which is most similar to element[0] or element[2]. In that case, you can get a speed up to O(n lg n)
.
EDIT2: Given your correlation code, your benchmark results are very suspect. I cannot imagine the situation in which those two gets take more time than the call to the correlation code (even assuming the cache is hit the vast majority of the time), which is also called O(n²)
times. Also spong makes a very good point about converting these to arrays first if get() is the bottleneck.
I got the following idea after reading chapter 6 from http://nlp.stanford.edu/IR-book/information-retrieval-book.html
public class WHN implements Comparable<WHN>{
private HierarchNode node;
private float weight;
public HierarchNode getNode() {return node;}
public float getWeight() {return weight;}
public WHN(HierarchNode node, float weight) {this.node = node;this.weight = weight;}
public int compareTo(WHN o) {return Float.compare(this.weight, o.weight); }
}
Map<Tag,<SortedMap<Float,HierarchNode>> map = new HashMap<Tag,List<WHN>>
for (HierarchNode n : cluster){
for (Map.Entry tw : n.tags.entrySet()){
Tag tag = tw.getKey();
Float weight = tw.getValue();
if (!map.ContainsKey(tag)){
map.put(tag,new ArrayList<WHN>();
}
map.get(tag).add(new WHN(n,weight));
}
for(List<WHN> l: map.values()){
Collections.Sort(l);
}
}
Then for each node: you could limit the search to the union of the elements with the N highest weights for each tag (called champion lists)
or you could keep a temporal dot product for each node and update the dot product for each tag, but only looping throught the nodes with weight higher than some fraction of the original node weight (you can find the start using Collection.binarySearch)
I suggest you read the rest of the book, as it may contain a better algorithm.
Algorithmic efficiency aside, you are calling get
too many times. Currently get
is called (in the order of) 2*size*size
times. It should be called size+size*size/2
times. This only changes constants, but it looks to me like you only need to call get
about one quarter as much as you are currently.
Try:
for (int j = 0; j < clusters.size(); ++j) {
skip = false;
HierarchNode jnode = clusters.get(j);
for (int k = j+1; k < clusters.size(); ++k) {
HierarchNode knode = clusters.get(k);
float r = correlation(knode.tags, jnode.tags, knode, jnode);
... etc ...
Depending on the magnitude of clusters.size()
, you might be able to further reduce constants by doing:
HierarchNode[] clusterArr = clusters.toArray(new HierarchNode[clusters.size()]);
and then using clusterArr[j]
and clusterArr[k]
instead of clusters.get(k)
etc.
(names mangled slightly to avoid line wrap)
There are not many complex operations in your code above. Mainly simple number reads/checks/writes. They are amazingly fast.
The issue is that .get()
is a function call - it will be much slower compared to a simple +
, =
or <=
operation. If it's too slow for you, you should start to use real arrays or (as stated by the others) optimize your algorithm first.
If you're iterating this process, finding each time the next-most-similar pair, you might do well to create a map from i, j pairs to similarity measures - depending on how processor-intensive calculating the similarity is, and how many items you've got, and how much memory you've got.
Local optimizing won't count much compared to the changing of algorithm. We are not sure what you wanted to do at first place and thus we can't give you the best/good answer.
From what I see, it seems you are having a quite lots of elements and each one contains a list of (Tag, Weight)s. So, there are unclear things here:
- Is "Weight" calculated from another place or it is static for a corresponding Tag?
- Can we use Integer instead of Float so numeric computing would be faster? BTW, there's bug in your program that you are comparing equally a Float number (I mean max == 1.0f). You should, however, use > or < with a precision range in this case.
If there are "yes", we have some local optimization. But no, please consider the following techniques (which depend on your real data and real problem as well):
- Will sorting all elements help? This will increase the chance to cut-off calculation in some cases...
- Use dynamic programming technique, e.g. incrementally update the correlations. You would have to iterate through all elements one every update but save lots of effort later.
- Parallel your code (runs on several threads) so it takes advantage of multicores environment. If you are worrying about locking, try using lock-free Map/List instead.
精彩评论