Faster data structure for searching a string_问答_开发者

I have this code that determines whether a word (ignoring case) is included in a wordList text file. However, the wordList text file may have 65000++ lines, and to just search a word using my implementation below takes nearly a minute. Could you think of any better implementation?

Thanks!

import java.io.*;
import java.util.*;

public class WordSearch 
{
    LinkedList<String> lxx;
    FileReader fxx;
    BufferedReader bxx;

    public WordSearch(String wordlist) 
        throws IOException
    {
        fxx = new FileReader(wordlist);
        bxx = new BufferedReader(fxx);
        lxx = new LinkedList<String>();
        String word;

        while ( (word = bxx.readLine()) != null) 
            {
            lxx.add(word);
        }

        bxx.close();
    }

    pub开发者_如何学JAVAlic boolean inTheList (String theWord)
    {
        for(int i =0 ; i < lxx.size(); i++)
            {
            if (theWord.compareToIgnoreCase(lxx.get(i)) == 0)
                    {
                return true;
            }
        }

        return false;
    }
}

Use a HashSet into which you put a lowercase version of each word. Checking if a HashSet contains a specified string is, on average, a constant-time (read: extremely fast) operation.

Since you're searching, you may want to consider sorting the list before searching; then you can do binary search which is much faster than linear search. That can help if you'll perform multiple searches on the same list, otherwise the penalty you pay to sort the list isn't worth it for searching only once.

Also, doing linear search on a linked list using "lxx.get(i)" is asking for trouble. LinkedList.get() is O(n). You can either use an Iterator (easy way: for (String s : lxx)) or switch to a list type that has O(1) access time, such as ArrayList.

Each search through l in an O(n) operation, so this will get quite costly when you have thousands of words. Instead, use a HashSet:

Set<String> lxx;

...

lxx = new HashSet<String>();
while ( (word = bxx.readLine()) != null) {
        lxx.add(word.toLowerCase());
}
bxx.close();

and then use lxx.contains(theWord.toLowerCase()) to check if the word is in the file. Each lookup in the HashSet is an O(1) operation, so the times it takes is (nearly) independet of the size of your file.

First off, don't declare your variable to be a LinkedList, declare it to be a List (parts of the code not concerned with the List deleted:

public class WordSearch 
{
    List<String> lxx;

    public WordSearch(String wordlist) 
        throws IOException
    {
        lxx = new LinkedList<String>();
    }
}

Next up do not call get on the list, using a LinkedList get will be VERY slow. Instead use an iterator... better yet use the new stype for loop which uses an iterator for you:

    public boolean inTheList (String theWord)
    {
        for(String word : lxx)
        {
            if (theWord.compareToIgnoreCase(word) == 0)
            {
                return true;
            }
        }

        return false;
    }

Next change the new LinkedList to a new ArrayList:

lxx = new ArrayList();

This code should be faster, but you can still do better.

Since you do not care about duplicate words use Set instead of a List and use a HashSet instead of an ArrayList.

Doing that will speed the program up significantly.

Your original code, using a LinkedList with get has to start over at the start of the list each time when searching for the next word in the list. Using the Iterator (via the new style for-each loop) stops that from happening.

Using a LinkedList means that each time you have to go to the next word in the list there is a lookup involved, the ArrayList doesn't have that overhead.

Using a HashSet winds up (probably) using a tree structure that has very fast lookups.

Here's my implementation searching under 50 ms.

First you have to load the file and keep it sorted in memory.

You may load it however you want, but if you loaded it in big chunks will be easier.

My input was the byte into python book ( downloaded the HTML single file version ) and the Java language specification ( downloaded the html and create a single file out of all the html pages )

To create the list into a big file I used this same program ( see commented code ).

Once I have a big file with about 300k words, I ran the program with this output:

C:\Users\oreyes\langs\java\search>dir singlelineInput.txt
 El volumen de la unidad C no tiene etiqueta.
 El número de serie del volumen es: 22A8-203B

 Directorio de C:\Users\oreyes\langs\java\search

04/03/2011  09:37 p.m.         3,898,345 singlelineInput.txt
               1 archivos      3,898,345 bytes

C:\Users\oreyes\langs\java\search>javac WordSearch.java

C:\Users\oreyes\langs\java\search>java WordSearch singlelineInput.txt "great"
Loaded 377381 words in 2844 ms
true
in 31 ms

C:\Users\oreyes\langs\java\search>java WordSearch singlelineInput.txt "great"
Loaded 377381 words in 2812 ms
true
in 31 ms

C:\Users\oreyes\langs\java\search>java WordSearch singlelineInput.txt "awesome"
Loaded 377381 words in 2813 ms
false
in 47 ms

C:\Users\oreyes\langs\java\search>gvim singlelineInput.txt

C:\Users\oreyes\langs\java\search>java WordSearch singlelineInput.txt "during"
Loaded 377381 words in 2813 ms
true
in 15 ms

C:\Users\oreyes\langs\java\search>java WordSearch singlelineInput.txt "specification"
Loaded 377381 words in 2875 ms
true
in 47 ms

C:\Users\oreyes\langs\java\search>java WordSearch singlelineInput.txt "<href"
Loaded 377381 words in 2844 ms
false
in 47 ms

C:\Users\oreyes\langs\java\search>java WordSearch singlelineInput.txt "<br>"
Loaded 377381 words in 2829 ms
true
in 15 ms

Always under 50 ms.

Here's the code:

   import java.io.*;
   import java.util.*;

   class WordSearch {
       String inputFile;
       List<String> words;
       public WordSearch(String file ) { 
           inputFile = file;
       }
       public void initialize() throws IOException { 
           long start = System.currentTimeMillis();
           File file = new File( inputFile );
           ByteArrayOutputStream baos = new ByteArrayOutputStream(( int ) file.length());
           FileInputStream in = new FileInputStream( file );
           copyLarge( in, baos, (int)file.length() );

           Scanner scanner = new Scanner( new ByteArrayInputStream(  baos.toByteArray() ));
           words = new LinkedList<String>();
           while( scanner.hasNextLine() ) { 
              String l = scanner.nextLine().trim();
              //for( String s : l.split("\\s+")){
                //System.out.println( s );
                words.add( l.toLowerCase() );
              //}
           }

           Collections.sort( words );
           for( String s : words ) { 
               //System.out.println( s );
           }
           System.out.println("Loaded " + words.size() + " words in "+  ( System.currentTimeMillis() - start ) + " ms"  );
       }

       public boolean contains( String aWord ) { 
           return words.contains( aWord.toLowerCase() );
       }
        // taken from:  http://stackoverflow.com/questions/326390/how-to-create-a-java-string-from-the-contents-of-a-file/326413#326413 
        public static long copyLarge(InputStream input, OutputStream output, int size )
               throws IOException {
           byte[] buffer = new byte[size];// something biggie 
           long count = 0;
           int n = 0;
           while (-1 != (n = input.read(buffer))) {
               output.write(buffer, 0, n);
               count += n;
           }
           return count;
       }
       public static void main( String ... args ) throws IOException  { 
           WordSearch ws = new WordSearch( args[0] );
           ws.initialize();
           long start = System.currentTimeMillis();
           System.out.println( ws.contains( args[1] ) );
           System.out.println("in "+  ( System.currentTimeMillis() - start ) +" ms ");

       }
    }

The hard part was to get a sample input :P

Guess what, using a HashMap returns in no time:

Here's the modified version and it finish always in 0 ms.

   import java.io.*;
   import java.util.*;

   class WordSearch {
       String inputFile;
       //List<String> words;
       Set<String> words;
       public WordSearch(String file ) { 
           inputFile = file;
       }
       public void initialize() throws IOException { 
           long start = System.currentTimeMillis();
           File file = new File( inputFile );
           ByteArrayOutputStream baos = new ByteArrayOutputStream(( int ) file.length());
           FileInputStream in = new FileInputStream( file );
           copyLarge( in, baos, (int)file.length() );

           Scanner scanner = new Scanner( new ByteArrayInputStream(  baos.toByteArray() ));
           words = new HashSet<String>();
           while( scanner.hasNextLine() ) { 
              String l = scanner.nextLine().trim();
              //for( String s : l.split("\\s+")){
                //System.out.println( s );
                words.add( l.toLowerCase() );
              //}
           }

           //Collections.sort( words );
           for( String s : words ) { 
               System.out.println( s );
           }
           System.out.println("Loaded " + words.size() + " words in "+  ( System.currentTimeMillis() - start ) + " ms"  );
       }

       public boolean contains( String aWord ) { 
           return words.contains( aWord.toLowerCase() );
       }

        public static long copyLarge(InputStream input, OutputStream output, int size )
               throws IOException {
           byte[] buffer = new byte[size];// something biggie 
           long count = 0;
           int n = 0;
           while (-1 != (n = input.read(buffer))) {
               output.write(buffer, 0, n);
               count += n;
           }
           return count;
       }
       public static void main( String ... args ) throws IOException  { 
           WordSearch ws = new WordSearch( args[0] );
           ws.initialize();
           long start = System.currentTimeMillis();
           System.out.println( ws.contains( args[1] ) );
           System.out.println("in "+  ( System.currentTimeMillis() - start ) +" ms ");

       }
    }

Now I know for sure :)

two suggestions: Both data structures give you a better performance.