开发者

Performance intensive string splitting and manipulation in java

开发者 https://www.devze.com 2023-01-01 20:33 出处:网络
What is the most efficient way to split a string by a very simple separator? Some background: I am porting a function I wrote in C with a bunch of pointer arithmetic to java and it is incredibly slo

What is the most efficient way to split a string by a very simple separator?

Some background:

I am porting a function I wrote in C with a bunch of pointer arithmetic to java and it is incredibly slow(After some optimisation still 5* slower). Having profiled it, it turns out a lot of that overhead is in String.split

The function in question takes a host name or ip address and makes it generic:

123.123.123.123->*.123.123.123

a.b.c.example开发者_如何学C.com->*.example.com

This can be run over several million items on a regular basis, so performance is an issue.

Edit: the rules for converting are thus:

  • If it's an ip address, replace the first part
  • Otherwise, find the main domain name, and make the preceding part generic.

foo.bar.com-> *.bar.com foo.bar.co.uk-> *.bar.co.uk

I have now rewritten using lastIndexOf and substring to work myself in from the back and the performance has improved by leaps and bounds.

I'll leave the question open for another 24 hours before settling on the best answer for future reference

Here's what I've come up with now(the ip part is an insignificant check before calling this function)

private static String hostConvert(String in) {
    final String [] subs = { "ac", "co", "com", "or", "org", "ne", "net", "ad", "gov", "ed" };

    int dotPos = in.lastIndexOf('.');
    if(dotPos == -1)
        return in;
    int prevDotPos = in.lastIndexOf('.', dotPos-1);
    if(prevDotPos == -1)
        return in;
    CharSequence cs = in.subSequence(prevDotPos+1, dotPos);
    for(String cur : subs) {
        if(cur.contentEquals(cs)) {
            int start = in.lastIndexOf('.', prevDotPos-1);
            if(start == -1 || start == 0)
                return in;
            return "*" + in.substring(start);
        }
    }

    return "*" + in.substring(prevDotPos);
}

If there's any space for further improvement it would be good to hear.


Something like this is about as fast as you can make it:

static String starOutFirst(String s) {
    final int K = s.indexOf('.');
    return "*" + s.substring(K);
}
static String starOutButLastTwo(String s) {
    final int K = s.lastIndexOf('.', s.lastIndexOf('.') - 1);
    return "*" + s.substring(K);
}

Then you can do:

    System.out.println(starOutFirst("123.123.123.123"));
    // prints "*.123.123.123"

    System.out.println(starOutButLastTwo("a.b.c.example.com"));
    // prints "*.example.com"

You may need to use regex to see which of the two method is applicable for any given string.


I'd try using .indexOf("."), and .substring(index)

You didn't elaborate on the exact pattern you wanted to match but if you can avoid split(), it should cut down on the number of new strings it allocates (1 instead of several).


It's unclear from your question exactly what the code is supposed to do. Does it find the first '.' and replace everything up to it with a '*'? Or is there some fancier logic behind it? Maybe everything up to the nth '.' gets replaced by '*'?

If you're trying to find an instance of a particular string, use something like the Boyer-Moore algorithm. It should be able to find the match for you and you can then replace what you want.

Keep in mind that String in Java is immutable. It might be faster to change the sequence in-place. Check out other CharSequence implementations to see what you can do, e.g. StringBuffer and CharBuffer. If concurrency is not needed, StringBuilder might be an option.

By using a mutable CharSequence instead of the methods on String, you avoid a bunch of object churn. If all you're doing is replacing some slice of the underlying character array with a shorter array (i.e. {'*'}), this is likely to yield a speedup since such array copies are fairly optimized. You'll still be doing an array copy at the end of the day, but it may be faster than new String allocations.

UPDATE

All the above is pretty much hogwash. Sure, maybe you can implement your own CharSequence that gives you better slicing and lazily resizes the array (aka doesn't actually truncate anything until it absolutely must), returning Strings based on offsets and whatnot. But StringBuffer and StringBuilder, at least directly, do not perform as well as the solution poly posted. CharBuffer is entirely inapplicable; I didn't realize it was an nio class earlier: it's meant for other things entirely.

There are some interesting things about poly's code, which I wonder whether he/she knew before posting it, namely that changing the "*" on the final lines of the methods to a '*' results in a significant slowdown.

Nevertheless, here is my benchmark. I found one small optimization: declaring the '.' and "*" expressions as constants adds a bit of a speedup as well as using a locally-scoped StringBuilder instead of the binary infix string concatenation operator.

I know the gc() is at best advisory and at worst a no-op, but I figured adding it with a bit of sleep time might let the VM do some cleanup after creating 1M Strings. Someone may correct me if this is totally naïve.

Simple Benchmark

import java.util.ArrayList;
import java.util.Arrays;

public class StringSplitters {
    private static final String PREFIX = "*";
    private static final char DOT = '.';

    public static String starOutFirst(String s) {
        final int k = s.indexOf(DOT);
        return PREFIX + s.substring(k);
    }

    public static String starOutFirstSb(String s) {
        StringBuilder sb = new StringBuilder();
        final int k = s.indexOf(DOT);
        return sb.append(PREFIX).append(s.substring(k)).toString();
    }

    public static void main(String[] args) throws InterruptedException {
        double[] firstRates = new double[10];
        double[] firstSbRates = new double[10];
        double firstAvg = 0;
        double firstSbAvg = 0;
        double firstMin = Double.POSITIVE_INFINITY;
        double firstMax = Double.NEGATIVE_INFINITY;

        double firstSbMin = Double.POSITIVE_INFINITY;
        double firstSbMax = Double.NEGATIVE_INFINITY;

        for (int i = 0; i < 10; i++) {
            firstRates[i] = testFirst();

            firstAvg += firstRates[i];

            if (firstRates[i] < firstMin)
                firstMin = firstRates[i];
            if (firstRates[i] > firstMax)
                firstMax = firstRates[i];

            Thread.sleep(100);
            System.gc();
            Thread.sleep(100);
        }

        firstAvg /= 10.0d;

        for (int i = 0; i < 10; i++) {
            firstSbRates[i] = testFirstSb();

            firstSbAvg += firstSbRates[i];

            if (firstSbRates[i] < firstSbMin)
                firstSbMin = firstSbRates[i];
            if (firstSbRates[i] > firstSbMax)
                firstSbMax = firstSbRates[i];

            Thread.sleep(100);
            System.gc();
            Thread.sleep(100);
        }

        firstSbAvg /= 10.0d;

        System.out.printf("First:\n\tMin:\t%07.3f\tMax:\t%07.3f\tAvg:\t%07.3f\n\tRates:\t%s\n\n", firstMin, firstMax,
                firstAvg, Arrays.toString(firstRates));
        System.out.printf("FirstSb:\n\tMin:\t%07.3f\tMax:\t%07.3f\tAvg:\t%07.3f\n\tRates:\t%s\n\n", firstSbMin,
                firstSbMax, firstSbAvg, Arrays.toString(firstSbRates));

    }

    private static double testFirst() {
        ArrayList<String> strings = new ArrayList<String>(1000000);

        for (int i = 0; i < 1000000; i++) {
            int first = (int) (Math.random() * 128);
            int second = (int) (Math.random() * 128);
            int third = (int) (Math.random() * 128);
            int fourth = (int) (Math.random() * 128);

            strings.add(String.format("%d.%d.%d.%d", first, second, third, fourth));
        }

        long before = System.currentTimeMillis();
        for (String s : strings) {
            starOutFirst(s);
        }
        long after = System.currentTimeMillis();

        return 1000000000.0d / (after - before);
    }

    private static double testFirstSb() {
        ArrayList<String> strings = new ArrayList<String>(1000000);

        for (int i = 0; i < 1000000; i++) {
            int first = (int) (Math.random() * 128);
            int second = (int) (Math.random() * 128);
            int third = (int) (Math.random() * 128);
            int fourth = (int) (Math.random() * 128);

            strings.add(String.format("%d.%d.%d.%d", first, second, third, fourth));
        }

        long before = System.currentTimeMillis();
        for (String s : strings) {
            starOutFirstSb(s);
        }
        long after = System.currentTimeMillis();

        return 1000000000.0d / (after - before);
    }
}

Output

First:
    Min:    3802281.369 Max:    5434782.609 Avg:    5185796.131
    Rates:  [3802281.3688212926, 5181347.150259067, 5291005.291005291, 5376344.086021505, 5291005.291005291, 5235602.094240838, 5434782.608695652, 5405405.405405405, 5434782.608695652, 5405405.405405405]

FirstSb:
    Min:    4587155.963 Max:    5747126.437 Avg:    5462087.511
    Rates:  [4587155.963302752, 5747126.436781609, 5617977.528089887, 5208333.333333333, 5681818.181818182, 5586592.17877095, 5586592.17877095, 5524861.878453039, 5524861.878453039, 5555555.555555556]
0

精彩评论

暂无评论...
验证码 换一张
取 消