开发者

What is the most efficient way to detect duplicate characters in a String in Java?

开发者 https://www.devze.com 2023-02-19 23:21 出处:网络
Using data structures (HashMap) I was able to do it. This is the code: import java.util.*; class unique{

Using data structures (HashMap) I was able to do it.

This is the code:

import java.util.*;

class unique{
    public static void main(String[] args){
        HashMap<Character, Integer> charMap = new HashMap<Character, Integer>();
        boolean isNotUnique = false;
            for ( char loop : args[0].toCharArray() ){
            Integer freq = charMap.get( loop );
            charMap.put( loop, ( freq == null )? 1 : freq+1 );
            if ( charMap.get( loop ) > 1 )
            {
                isNotUnique = true;
            }
        }
            System.out.println ( isNotUnique );
    }
}

Without datastructures, I came up with a blunt approach. This has O(n^2)

class unique
{
    public static void main(String[] args)
    {
        String inputString = a开发者_如何学Gorgs[0];
        System.out.println( isUnique( inputString ) );

    }

    private static boolean isUnique(String inputString) {
        String methodString = inputString;
        for ( int i = 0; i < inputString.length(); i++ )
        {
            for ( int j = i+1; j < inputString.length(); j++ )
            {
                if ( methodString.charAt( i ) == methodString.charAt( j ) )
                {
                    return false;
                }
            }
        }
        return true;
    }
}

I was wondering if it was possible to solve in O(n) time complexity


If you need to support Unicode characters that aren't represented by surrogate char pairs, this will do it:

private static boolean isUnique(String inputString) {
    long[] used = new long[1024];
    for (char c : inputString.toCharArray()) {
        if ((used[c >>> 6] & (1 << c)) > 0) {
            return false;
        }
        used[c >>> 6] |= 1 << c;
    }
    return true;
}

It's using bit flips to save memory. It's essentially the same thing as if you used an array of booleans:

private static boolean isUnique2(String inputString) {
    boolean[] used = new boolean[65536];
    for (char c : inputString.toCharArray()) {
        if (used[c]) {
            return false;
        }
        used[c] = true;
    }
    return true;
}

If you only need to support ASCII characters you could limit the size of used in either case to reduce the memory required (so long[4] and boolean[256]). Below a certain length of inputString it's probably faster to do the n^2 check than allocate the memory for this though. So ideally you do a combination of the two based on the length.

If you need to support all possible Unicode characters you'll have to modify this to support surrogate char pairs. You can detect them with Character.isHighSurrogate(c). See this page for some help and search Google for more details.


What is the definition of character? a-z A-Z or all unicode?

if the length of string is quite big, such as one million, you could build an int array, the length of array is the length of you character set and the array would be initialized with zero.

after this, go through the string, according to each charactor : array[(int)char]++, so that, u can easily find the time of character appearing from the array.

however, short string won't need such method.

this method is O(n)


I was wondering if it was possible to solve in O(n) time complexity:

There are two simple solutions that are O(N) in time:

  • The HashSet approach is O(N) in time and O(N) in space, where N is the String length. (A regular Java HashSet containing N distinct characters will take O(N) space, with a relatively large constant of proportionality.)

  • The bit array approach is O(N) in time and O(1) in space, but the O(1) is 8K bytes (or 64K bytes if you use a boolean[]) and there is an associated cost of zeroing that much memory added to the time.

Neither of these is the best answer in all cases.

  • For small enough N, a simple O(N^2) nested loop will be be fastest. (And it uses no extra memory.)

  • For medium-sized N, a custom hash table that uses rehashing on collision will be better than HashSet or the bit array approach. The HashSet approach will be better than the bit array approach.

  • For large enough N, the bit array approach will be fastest and use the least memory.

(For the above, I'm assuming that input strings don't contain any duplicate characters. If they do, then the actual value of N will be less than the string length.)


If duplicate character detection needs to cope with UTF-16 surrogate pairs, then the simple approach is to transcode on the fly to Unicode codepoints, and change the data structures to use a HashSet<Integer>, bigger bit arrays, etc.

Conversely, if you can limit the size of the input character set, you can reduce the size of the bit arrays, etc.

These tweaks would make a big difference to where the small / medium / large thresholds are likely to fall.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号