High Level Java Optimization_问答_开发者_运维开发者技术经验分享

There are many questions and answers and opinions about how to do low level Java optimization, with for, while, and do-while loops, and whether it's even necessary.

My question is more of a High Level based optimization in design. Let's assume I have to do the following:

for a given string input, count the occurrence of each letter in the string.

this is not a major problem when the string is a few sentences, but what if instead we want to count the occurrence of each word in a 900,000 word file. building loops just wastes time.

So what is the high level design pattern that can be applied to this type of problem.

I guess my major point is that I tend to use loops to solve many problems, and I would like to get out of the 开发者_JAVA百科habit of using loops.

thanks in advance

Sam

p.s. If possible can you produce some pseudo code for solving the 900,000 word file problem, I tend to understand code better than I can understand English, which I assume is the same for most visitors of this site

The word count problem is one of the most widely covered problems in the Big Data world; it's kind of the Hello World of frameworks like Hadoop. You can find ample information throughout the web on this problem.

I'll give you some thoughts on it anyway.

First, 900000 words might still be small enough to build a hashmap for, so don't discount the obvious in-memory approach. You said pseudocode is fine, so:

h = new HashMap<String, Integer>();
for each word w picked up while tokenizing the file {
  h[w] = w in h ? h[w]++ : 1
}

Now once your dataset is too large to build an in-memory hashmap, you can do your counting like so:

Tokenize into words writing each word to a single line in a file
Use the Unix sort command to produce the next file
Count as you traverse the sorted file

These three steps go in a Unix pipeline. Let the OS do the work for you here.

Now, as you get even more data, you want to bring in map-reduce frameworks like hadoop to do the word counting on clusters of machines.

Now, I've heard when you get into obscenely large datasets, doing things in a distributed enviornment does not help anymore, because the transmission time overwhelms the counting time, and in your case of word counting, everything has to "be put back together anyway" so then you have to use some very sophisticated techniques that I suspect you can find in research papers.

ADDENDUM

The OP asked for an example of tokenizing the input in Java. Here is the easiest way:

import java.util.Scanner;
public class WordGenerator {
    /**
     * Tokenizes standard input into words, writing each word to standard output,
     * on per line.  Because it reads from standard input and writes to standard
     * output, it can easily be used in a pipeline combined with sort, uniq, and
     * any other such application.
     */
    public static void main(String[] args) {
        Scanner input = new Scanner(System.in);
        while (input.hasNext()) {
            System.out.println(input.next().toLowerCase());
        }
    } 
}

Now here is an example of using it:

echo -e "Hey Moe! Woo\nwoo woo nyuk-nyuk why soitenly. Hey." | java WordGenerator

This outputs

hey
moe!
woo
woo
woo
nyuk-nyuk
why
soitenly.
hey.

You can combine this tokenizer with sort and uniq like so:

echo -e "Hey Moe! Woo\nwoo woo nyuk-nyuk why soitenly. Hey." | java WordGenerator | sort | uniq

Yielding

hey
hey.
moe!
nyuk-nyuk
soitenly.
why
woo

Now if you only want to keep letters and throw away all punctuation, digits and other characters, change your scanner definition line to:

Scanner input = new Scanner(System.in).useDelimiter(Pattern.compile("\\P{L}"));

And now

echo -e "Hey Moe! Woo\nwoo woo^nyuk-nyuk why#2soitenly. Hey." | java WordGenerator | sort | uniq

Yields

hey
moe
nyuk
soitenly
why
woo

There is a blank line in the output; I'll let you figure out how to whack it. :)

The fastest solution to this is O(n) AFAIK use a loop to iterate the string, get the character and update the count in a HashMap accordingly. At the end the HashMap contains all the characters that occurred and a count of all the occurrences.

Some pseduo-code (may not compile)

HashMap<Character, Integer> map = new HashMap<Character, Integer>();
for (int i = 0; i < str.length(); i++)
{
    char c = str.charAt(i);
    if (map.containsKey(c)) map.put(c, map.get(c) + 1);
    else map.put(c, 1);
}

It's hard for you to get much better than using a loop to solve this problem. IMO, the best way to speed up this sort of operation is to split the workload into different units of work and process the units of work with different processors (using threads, for example, if you have a multiprocessor computer).

You shouldn't assume 900,000 is a lot of words. If you have a CPU with 8 threads and 3 GHZ that's 24 billion clock cycles per second. ;)

However for counting characters using an int[] will be much faster. There is only 65,536 possible characters.

StringBuilder words = new StringBuilder();
Random rand = new Random();
for (int i = 0; i < 10 * 1000 * 1000; i++)
    words.append(Long.toString(rand.nextLong(), 36)).append(' ');
String text = words.toString();

long start = System.nanoTime();
int[] charCount = new int[Character.MAX_VALUE];
for (int i = 0; i < text.length(); i++)
    charCount[text.charAt(i)]++;
long time = System.nanoTime() - start;
System.out.printf("Took %,d ms to count %,d characters%n", time / 1000/1000, text.length());

prints

Took 111 ms to count 139,715,647 characters

Even 11x times the number of words takes a fraction of a second.

A much longer parallel version is a little faster.

public static void main(String... args) throws InterruptedException, ExecutionException {
    StringBuilder words = new StringBuilder();
    Random rand = new Random();
    for (int i = 0; i < 10 * 1000 * 1000; i++)
        words.append(Long.toString(rand.nextLong(), 36)).append(' ');
    final String text = words.toString();

    long start = System.nanoTime();
    // start a thread pool to generate 4 tasks to count sections of the text.
    final int nThreads = 4;
    ExecutorService es = Executors.newFixedThreadPool(nThreads);
    List<Future<int[]>> results = new ArrayList<Future<int[]>>();
    int blockSize = (text.length() + nThreads - 1) / nThreads;
    for (int i = 0; i < nThreads; i++) {
        final int min = i * blockSize;
        final int max = Math.min(min + blockSize, text.length());
        results.add(es.submit(new Callable<int[]>() {
            @Override
            public int[] call() throws Exception {
                int[] charCount = new int[Character.MAX_VALUE];
                for (int j = min; j < max; j++)
                    charCount[text.charAt(j)]++;
                return charCount;
            }
        }));
    }
    es.shutdown();
    // combine the results.
    int[] charCount = new int[Character.MAX_VALUE];
    for (Future<int[]> resultFuture : results) {
        int[] result = resultFuture.get();
        for (int i = 0, resultLength = result.length; i < resultLength; i++) {
            charCount[i] += result[i];
        }
    }
    long time = System.nanoTime() - start;
    System.out.printf("Took %,d ms to count %,d characters%n", time / 1000 / 1000, text.length());
}

prints

Took 45 ms to count 139,715,537 characters

But for a String with less than a million words its not likely to be worth it.

As a general rule, you should just write things in a straightforward way, and then do performance tuning to make it as fast as possible. If that means putting in a faster algorithm, do so, but at first, keep it simple. For a small program like this, it won't be too hard.

The essential skill in performance tuning is not guessing. Instead, let the program itself tell you what to fix. This is my method.

For more involved programs, like this one, experience will show you how to avoid the over-thinking that ends up causing a lot of the poor performance it is trying to avoid.

You have to use divide and conquer approach and avoid race for resources. There are different approaches and/or implementations for that. The idea is the same - split the work and parallelize the processing.

On single machine you can process chunks of the data in separate threads, although having the chunks on the same disk will slow things down considerably. H having more threads means having more context-switching, for throughput is IMHO better to have smaller amount of them and keep them busy.

You can split the processing to stages and use SEDA or something similar and with really big data you do for map-reduce - just count with the expense of distributing data across cluster.

I'll be glad of somebody point to another widely-used API.